板凳-------Mysql cookbook学习 (十)
5.6 改变字符串的字符集或字符排序
mysql> set @s1 = 'my string';
Query OK, 0 rows affected (0.01 sec)mysql> set @s2 = convert(@s1 using utf8);
Query OK, 0 rows affected, 1 warning (0.00 sec)mysql> select charset(@s1), charset(@s2);
+--------------+--------------+
| charset(@s1) | charset(@s2) |
+--------------+--------------+
| latin7 | utf8mb3 |
+--------------+--------------+
1 row in set (0.00 sec)mysql> SET @s1 = _latin7 'some string'; -- 示例设置一个latin7字符串
Query OK, 0 rows affected (0.00 sec)mysql>
mysql> SET @s2 = CONVERT(@s1 USING latin1) COLLATE latin1_spanish_ci;
Query OK, 0 rows affected (0.00 sec)mysql> select collation(@s1), collation(@s2);
+-------------------+-------------------+
| collation(@s1) | collation(@s2) |
+-------------------+-------------------+
| latin7_general_ci | latin1_spanish_ci |
+-------------------+-------------------+
1 row in set (0.00 sec)mysql> -- 创建 latin1 字符集的变量
mysql> SET @s1 = _latin1 'Texto en español';
Query OK, 0 rows affected (0.00 sec)mysql>
mysql> -- 应用西班牙语校对规则
mysql> SET @s2 = @s1 COLLATE latin1_spanish_ci;
Query OK, 0 rows affected (0.00 sec)mysql>
mysql> -- 验证结果
mysql> SELECT-> @s1 AS original,-> @s2 AS modified,-> CHARSET(@s1) AS charset_s1,-> CHARSET(@s2) AS charset_s2,-> COLLATION(@s1) AS collation_s1,-> COLLATION(@s2) AS collation_s2;
+------------------+------------------+------------+------------+-------------------+-------------------+
| original | modified | charset_s1 | charset_s2 | collation_s1 | collation_s2 |
+------------------+------------------+------------+------------+-------------------+-------------------+
| Texto en espa?ol | Texto en espa?ol | latin1 | latin1 | latin1_swedish_ci | latin1_spanish_ci |
+------------------+------------------+------------+------------+-------------------+-------------------+
1 row in set (0.00 sec)
字符集和校对规则关系说明
字符集 (Character Set) 支持的校对规则 (Collation) 用途说明
latin1 latin1_spanish_ci, latin1_general_ci 西欧语言
latin7 latin7_general_ci, latin7_bin 波罗的海语言
utf8mb4 utf8mb4_spanish_ci, utf8mb4_unicode_ci 多语言支持
最佳实践建议
保持一致:变量、表和连接的字符集/校对规则应保持一致
优先使用 Unicode:对于多语言支持,推荐使用 utf8mb4
sql
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
显式转换:当需要不同字符集时,使用 CONVERT() 函数
校对规则选择:_ci:不区分大小写 (case-insensitive)_bin:二进制比较_es:特定语言规则(如西班牙语)mysql> set @s1 = 'my string';
Query OK, 0 rows affected (0.00 sec)mysql> set @s2 = convert(@s1 using binary);
Query OK, 0 rows affected (0.00 sec)mysql> set @s3 = convert(@s2 using utf8);
Query OK, 0 rows affected, 1 warning (0.00 sec)mysql> select charset(@s1), charset(@s2), charset(@s3);
+--------------+--------------+--------------+
| charset(@s1) | charset(@s2) | charset(@s3) |
+--------------+--------------+--------------+
| latin7 | binary | utf8mb3 |
+--------------+--------------+--------------+
1 row in set (0.00 sec)mysql> set @s1 = 'my string';
Query OK, 0 rows affected (0.00 sec)mysql> set @s2 = binary @s2;
Query OK, 0 rows affected, 1 warning (0.00 sec)mysql> select charset(@s1), charset(@s2);
+--------------+--------------+
| charset(@s1) | charset(@s2) |
+--------------+--------------+
| latin7 | binary |
+--------------+--------------+
1 row in set (0.00 sec)
5.7 更改字符串字母的大小写
mysql> use cookbook
Database changed
--整体替换大小写
mysql> select thing, upper(thing), lower(thing) from limbs;
+--------------+--------------+--------------+
| thing | upper(thing) | lower(thing) |
+--------------+--------------+--------------+
| human | HUMAN | human |
| insect | INSECT | insect |
| squid | SQUID | squid |
| octopus | OCTOPUS | octopus |
| fish | FISH | fish |
| centipede | CENTIPEDE | centipede |
| table | TABLE | table |
| armchair | ARMCHAIR | armchair |
| phonograph | PHONOGRAPH | phonograph |
| tripod | TRIPOD | tripod |
| Peg Leg Pete | PEG LEG PETE | peg leg pete |
| space alien | SPACE ALIEN | space alien |
| armchair | ARMCHAIR | armchair |
| centipede | CENTIPEDE | centipede |
| fish | FISH | fish |
| human | HUMAN | human |
| insect | INSECT | insect |
| Peg Leg Pete | PEG LEG PETE | peg leg pete |
| phonograph | PHONOGRAPH | phonograph |
| space alien | SPACE ALIEN | space alien |
| squid | SQUID | squid |
| table | TABLE | table |
| tripod | TRIPOD | tripod |
+--------------+--------------+--------------+
23 rows in set (0.02 sec)--局部修改大小写
mysql> select thing, concat(upper(left(thing, 1)), mid(thing, 2)) from limbs;
+--------------+----------------------------------------------+
| thing | concat(upper(left(thing, 1)), mid(thing, 2)) |
+--------------+----------------------------------------------+
| human | Human |
| insect | Insect |
| squid | Squid |
| octopus | Octopus |
| fish | Fish |
| centipede | Centipede |
| table | Table |
| armchair | Armchair |
| phonograph | Phonograph |
| tripod | Tripod |
| Peg Leg Pete | Peg Leg Pete |
| space alien | Space alien |
| armchair | Armchair |
| centipede | Centipede |
| fish | Fish |
| human | Human |
| insect | Insect |
| Peg Leg Pete | Peg Leg Pete |
| phonograph | Phonograph |
| space alien | Space alien |
| squid | Squid |
| table | Table |
| tripod | Tripod |
+--------------+----------------------------------------------+
23 rows in set (0.01 sec)--自定义函数局部修改大小写
mysql> create function initial_cap(s varchar(255))-> returns varchar(255) deterministic-> return concat(upper(left(s, 1)), mid(s, 2));
Query OK, 0 rows affected (0.06 sec)mysql> select thing, initial_cap(thing) from limbs;
+--------------+--------------------+
| thing | initial_cap(thing) |
+--------------+--------------------+
| human | Human |
| insect | Insect |
| squid | Squid |
| octopus | Octopus |
| fish | Fish |
| centipede | Centipede |
| table | Table |
| armchair | Armchair |
| phonograph | Phonograph |
| tripod | Tripod |
| Peg Leg Pete | Peg Leg Pete |
| space alien | Space alien |
| armchair | Armchair |
| centipede | Centipede |
| fish | Fish |
| human | Human |
| insect | Insect |
| Peg Leg Pete | Peg Leg Pete |
| phonograph | Phonograph |
| space alien | Space alien |
| squid | Squid |
| table | Table |
| tripod | Tripod |
+--------------+--------------------+
23 rows in set (0.01 sec)
https://blog.csdn.net/liqfyiyi/article/details/50886752
5.8 更改字符串大小写失败的情况
--改变非二进制字符串大小写
set @s = 'aBcD';
select upper(@s), lower(@s);
+-----------+-----------+
| upper(@s) | lower(@s) |
+-----------+-----------+
| ABCD | abcd |
+-----------+-----------+
1 row in set (0.00 sec)--改变二进制字符串大小写
mysql> drop table if exists t;
Query OK, 0 rows affected (0.09 sec)mysql> create table t (b BLOB) select 'aBcD' as b;
Query OK, 1 row affected (0.08 sec)
Records: 1 Duplicates: 0 Warnings: 0mysql> select b, upper(b), lower(b) from t; /*upper、lower改变二进制字符串大小写不生效*/
+------------+--------------------+--------------------+
| b | upper(b) | lower(b) |
+------------+--------------------+--------------------+
| 0x61426344 | 0x61426344 | 0x61426344 |
+------------+--------------------+--------------------+
1 row in set (0.00 sec) mysql> DROP TABLE IF EXISTS t;
Query OK, 0 rows affected (0.04 sec)mysql> CREATE TABLE t (txt TEXT) SELECT 'aBcD' AS txt;
Query OK, 1 row affected (0.05 sec)
Records: 1 Duplicates: 0 Warnings: 0mysql> SELECT txt, UPPER(txt), LOWER(txt) FROM t;
+------+------------+------------+
| txt | UPPER(txt) | LOWER(txt) |
+------+------------+------------+
| aBcD | ABCD | abcd |
+------+------------+------------+
1 row in set (0.00 sec)mysql> DROP TABLE IF EXISTS t;
Query OK, 0 rows affected (0.03 sec)mysql> create table t (b BLOB) select 'aBcD' as b;
Query OK, 1 row affected (0.06 sec)
Records: 1 Duplicates: 0 Warnings: 0mysql> SELECT b,-> UPPER(CAST(b AS CHAR)),-> LOWER(CAST(b AS CHAR))-> FROM t;
+------------+------------------------+------------------------+
| b | UPPER(CAST(b AS CHAR)) | LOWER(CAST(b AS CHAR)) |
+------------+------------------------+------------------------+
| 0x61426344 | ABCD | abcd |
+------------+------------------------+------------------------+
1 row in set (0.00 sec)set @s = binary 'aBcD';
select @s, lower(@s), upper(@s); /*MySQL4.1之前生效*/
mysql> select @s, lower(@s), upper(@s);
+------------+----------------------+----------------------+
| @s | lower(@s) | upper(@s) |
+------------+----------------------+----------------------+
| 0x61426344 | 0x61426344 | 0x61426344 |
+------------+----------------------+----------------------+
1 row in set (0.00 sec)/*先转换二进制字符串到非二机制字符串,再进行大小写转换*/
mysql> select b,-> upper(convert(b using latin1)) as upper,-> lower(convert(b using latin1)) as lower-> from t;
+------------+-------+-------+
| b | upper | lower |
+------------+-------+-------+
| 0x61426344 | ABCD | abcd |
+------------+-------+-------+
1 row in set (0.00 sec)mysql> select charset(version()), charset(md5('some string'));
+--------------------+-----------------------------+
| charset(version()) | charset(md5('some string')) |
+--------------------+-----------------------------+
| utf8mb3 | gbk |
+--------------------+-----------------------------+
1 row in set (0.00 sec)mysql> select upper(version());
+------------------+
| upper(version()) |
+------------------+
| 8.0.40 |
+------------------+
1 row in set (0.00 sec)mysql> select md5('some thing'), upper(convert(md5('some thing') using latin1));
+----------------------------------+------------------------------------------------+
| md5('some thing') | upper(convert(md5('some thing') using latin1)) |
+----------------------------------+------------------------------------------------+
| 35290446ab01e3fc9ec7837f0bf84e40 | 35290446AB01E3FC9EC7837F0BF84E40 |
+----------------------------------+------------------------------------------------+
1 row in set (0.00 sec)
5.9 控制字符串比较中的大小写敏感
mysql> select 'cat' = 'cat', 'cat' = 'dpg';
+---------------+---------------+
| 'cat' = 'cat' | 'cat' = 'dpg' |
+---------------+---------------+
| 1 | 0 |
+---------------+---------------+
1 row in set (0.00 sec)mysql> select 'cat' != 'cat', 'cat' != 'dpg';
+----------------+----------------+
| 'cat' != 'cat' | 'cat' != 'dpg' |
+----------------+----------------+
| 0 | 1 |
+----------------+----------------+
1 row in set (0.00 sec)mysql> select 'cat' < 'awk', 'cat' < 'dpg';
+---------------+---------------+
| 'cat' < 'awk' | 'cat' < 'dpg' |
+---------------+---------------+
| 0 | 1 |
+---------------+---------------+
1 row in set (0.00 sec)mysql> select 'cat' between 'awk' and 'engret';
+----------------------------------+
| 'cat' between 'awk' and 'engret' |
+----------------------------------+
| 1 |
+----------------------------------+
1 row in set (0.00 sec)set @s1 = binary 'cat', @s2 = binary 'CAT';
select @s1 = @s2; /*二进制字符串比较大小写敏感,结果不等*/
set @s1 = convert(@s1 using latin1) collate latin1_swedish_ci;
set @s2 = convert(@s2 using latin1) collate latin1_swedish_ci;
select @s1 = @s2; /*转换成非二进制字符串,并指定字符集collation大小写不敏感,结果相等*//*latin1默认collation为latin1_swedish_ci*/
set @s1 = convert(@s1 using latin1);
set @s2 = convert(@s2 using latin1);
select @s1 = @s2; mysql> set @s1 = binary 'cat', @s2 = binary 'CAT';
Query OK, 0 rows affected, 2 warnings (0.00 sec)mysql> select @s1 = @s2;
+-----------+
| @s1 = @s2 |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)mysql> set @s1 = binary 'cat', @s2 = binary 'CAT'; //大小写不敏感时, 相等的。
Query OK, 0 rows affected, 2 warnings (0.00 sec)mysql> select @s1 = @s2;
+-----------+
| @s1 = @s2 |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)mysql> set @s1 = convert(@s1 using latin1) collate latin1_swedish_ci;
Query OK, 0 rows affected (0.00 sec)mysql> set @s2 = convert(@s2 using latin1) collate latin1_swedish_ci;
Query OK, 0 rows affected (0.00 sec)mysql> select @s1 = @s2;
+-----------+
| @s1 = @s2 |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)select @s1 collate latin1_general_cs = @s2 collate latin1_general_cs as '@s1 = @s2';
+-----------+
| @s1 = @s2 |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)select _latin1 'cat' = binary 'CAT';mysql> set @s1 = _latin1 'cat', @s2 = _latin1 'CAT';
Query OK, 0 rows affected (0.00 sec)mysql> select @s1 = @s2, binary @s1 = @s2, @s1 = binary @s2;
+-----------+------------------+------------------+
| @s1 = @s2 | binary @s1 = @s2 | @s1 = binary @s2 |
+-----------+------------------+------------------+
| 1 | 0 | 0 |
+-----------+------------------+------------------+
1 row in set, 2 warnings (0.00 sec)--修改表定义,改变列大小写敏感模式
-- 删除表(如果存在)
DROP TABLE IF EXISTS news;-- 创建新表
CREATE TABLE news (id INT UNSIGNED NOT NULL AUTO_INCREMENT,article TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,PRIMARY KEY(id)
);
Query OK, 0 rows affected, 2 warnings (0.05 sec)-- 修改表(正确语法)
ALTER TABLE news
MODIFY article TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
注意事项
1. utf8mb4 是 MySQL 中完整的 UTF-8 实现(支持4字节字符,如emoji)
2. utf8mb4_general_ci 是大小写不敏感的排序规则
3. 如果您需要大小写敏感的比较,可以使用 utf8mb4_bin
这样修改后,您的 article 列将会正确支持大小写不敏感的文本比较和搜索。
5.10 使用sql模式进行模式匹配
--消除可能存在的表冲突
drop table if exists metal;--创建表
create table metal (name varchar(20));--初始化表数据
insert into metal (name) values('copper'), ('gold'), ('iron'), ('lead'), ('mercury'), ('platinum'), ('silver'), ('tin');--查询初始化数据
select * from metal;--查询以特定字符开头的字符串
mysql> select name from metal where name like 'co%';
+--------+
| name |
+--------+
| copper |
+--------+
1 row in set (0.00 sec)
--查询以特定字符结尾的字符串
mysql> select name from metal where name like '%er';
+--------+
| name |
+--------+
| copper |
| silver |
+--------+
2 rows in set (0.00 sec)
--查询含有特定字符的字符串
mysql> select name from metal where name like '%er%';
+---------+
| name |
+---------+
| copper |
| mercury |
| silver |
+---------+
3 rows in set (0.00 sec)
--查询特定位置出现特定字符
mysql> select name from metal where name like '__pp%';
+--------+
| name |
+--------+
| copper |
+--------+
1 row in set (0.00 sec)
--查询不含有特定字符的字符串
mysql> select name from metal where name not like '%i%';
+---------+
| name |
+---------+
| copper |
| gold |
| lead |
| mercury |
+---------+
4 rows in set (0.00 sec)
--Null值
mysql> select null like '%', null not like '%';
+---------------+-------------------+
| null like '%' | null not like '%' |
+---------------+-------------------+
| NULL | NULL |
+---------------+-------------------+
1 row in set (0.00 sec)
5.11 使用正则表达式进行模式匹配
--模式字符与定义
模式字符 定义
^ 匹配字符串的开始部分
$ 匹配字符串的结束部分
. 匹配任何字符(包括回车和新行)
[...] 括号内任意一个字符
[^...] 除了括号内所列字符之外的任意一个字符
p1 | p2 | p3 p1、p2或p3中任意一个模式串
* 匹配0或多个*之前的任何序列
+ 匹配1或多个+之前的任何序列
{n} n个{n}之前的任何序列
{m, n} 最少m个,最多n个,{m, n}之前的任何序列--查询以特定字符开头的字符串
mysql> select name from metal where name regexp '^co';
+--------+
| name |
+--------+
| copper |
+--------+
1 row in set (0.03 sec)--查询以特定字符结尾的字符串
mysql> select name from metal where name regexp 'er$';
+--------+
| name |
+--------+
| copper |
| silver |
+--------+
2 rows in set (0.00 sec)--查询含有特定字符的字符串
mysql> select name from metal where name regexp 'er';
+---------+
| name |
+---------+
| copper |
| mercury |
| silver |
+---------+
3 rows in set (0.00 sec)--查询特定位置出现特定字符
mysql> select name from metal where name regexp '^..pp';
+--------+
| name |
+--------+
| copper |
+--------+
1 row in set (0.00 sec)--查询不含有特定字符的字符串mysql> select name from metal where name not like '%i%';
+---------+
| name |
+---------+
| copper |
| gold |
| lead |
| mercury |
+---------+
4 rows in set (0.00 sec)--posix字符集定义正则表达式
posix类 匹配定义
[:alnum:] 字符和数字
[:alpha:] 字母
[:blank:] 空格或制表符(tab)
[:cntrl:] 控制符
[:digit:] 数字
[:graph:] 图形符号(不包括空格)
[:lower:] 小写字母
[:print:] 图形符号(包括空格)
[:punct:] 标点符号
[:space:] 空格、制表符、换行、回车换行
[:upper:] 大写字母
[:xdigit:] 十六进制符(0-9, a-f, A-F)--检查是否含有十六进制字符
mysql> select name, name regexp '[[:xdigit:]]' from metal;
+----------+----------------------------+
| name | name regexp '[[:xdigit:]]' |
+----------+----------------------------+
| copper | 1 |
| gold | 1 |
| iron | 0 |
| lead | 1 |
| mercury | 1 |
| platinum | 1 |
| silver | 1 |
| tin | 0 |
+----------+----------------------------+
8 rows in set (0.00 sec)--选择性匹配
mysql> select name from metal where name regexp '^[aeiou]|er$';
+--------+
| name |
+--------+
| copper |
| iron |
| silver |
+--------+
3 rows in set (0.00 sec)--对正则表达式进行分组
select '0m' regexp '^[[:digit:]]+|[[:alpha:]]+$'; /*以数字开头或字母结尾*/
+-------------------------------------------+
| '0m' regexp '^[[:digit:]]+|[[:alpha:]]+$' |
+-------------------------------------------+
| 1 |
+-------------------------------------------+
1 row in set (0.00 sec)
select '0m' regexp '^([[:digit:]]+|[[:alpha:]]+)$'; /*完全由数字或者完全由字母组成*/
+---------------------------------------------+
| '0m' regexp '^([[:digit:]]+|[[:alpha:]]+)$' |
+---------------------------------------------+
| 0 |
+---------------------------------------------+
1 row in set (0.00 sec)--Null值
mysql> select null regexp '.*', null not regexp '.*';
+------------------+----------------------+
| null regexp '.*' | null not regexp '.*' |
+------------------+----------------------+
| NULL | NULL |
+------------------+----------------------+
1 row in set (0.00 sec)mysql> -- 创建测试表
mysql> CREATE TABLE test (str VARCHAR(100));
Query OK, 0 rows affected (0.09 sec)mysql> INSERT INTO test VALUES ('abc123'), ('ABCxyz'), ('defabc'), ('ab cde');
Query OK, 4 rows affected (0.01 sec)
Records: 4 Duplicates: 0 Warnings: 0mysql>
mysql> -- 使用REGEXP
mysql> SELECT str FROM test WHERE str REGEXP '^abc'; -- 返回 'abc123'
+--------+
| str |
+--------+
| abc123 |
| ABCxyz |
+--------+
2 rows in set (0.00 sec)mysql> SELECT str FROM test WHERE str REGEXP BINARY '^abc'; -- 返回 'abc123'(区分大小写)
ERROR 3995 (HY000): Character set 'utf8mb4_0900_ai_ci' cannot be used in conjunction with 'binary' in call to regexp_like.
mysql>
mysql> -- 使用LEFT()
mysql> SELECT str FROM test WHERE LEFT(str, 3) = 'abc'; -- 返回 'abc123'
+--------+
| str |
+--------+
| abc123 |
| ABCxyz |
+--------+
2 rows in set (0.00 sec)mysql> SELECT str FROM test WHERE BINARY LEFT(str, 3) = 'abc'; -- 返回 'abc123'(区分大小写)
+--------+
| str |
+--------+
| abc123 |
+--------+
1 row in set, 1 warning (0.00 sec)
- 默认情况下不区分大小写的问题
您观察到 REGEXP ‘^abc’ 和 LEFT(str, 3) = ‘abc’ 都匹配了 abc123 和 ABCxyz,这是因为:
• 您的表使用了默认的 utf8mb4_0900_ai_ci 排序规则(ai = accent insensitive,ci = case insensitive)
• 这种排序规则会使比较操作不区分大小写 - 使用 BINARY 的问题
当您尝试 REGEXP BINARY ‘^abc’ 时出现错误,这是因为:
• MySQL 8.0 的新正则表达式实现不支持直接与 BINARY 关键字结合使用
解决方案
方法1:使用 COLLATE 指定区分大小写的排序规则
sql
复制
下载
SELECT str FROM test WHERE str REGEXP '^abc' COLLATE utf8mb4_bin;
方法2:使用 BINARY 与 LEFT() 函数(您已成功使用)
sql
复制
下载
SELECT str FROM test WHERE BINARY LEFT(str, 3) = 'abc';
方法3:修改列或表的排序规则
sql
复制
下载
ALTER TABLE test MODIFY str VARCHAR(100) COLLATE utf8mb4_bin;
方法4:使用 LIKE 操作符(区分大小写)
sql
复制
下载
SELECT str FROM test WHERE str LIKE 'abc%' COLLATE utf8mb4_bin;
性能建议
• 对于简单的前缀匹配,LIKE 'abc%' 或 LEFT(str, 3) = 'abc' 通常比 REGEXP 性能更好
• 如果需要创建索引来提高查询性能,考虑使用前缀索引:
sql
复制
下载
CREATE INDEX idx_str_prefix ON test (str(3));
str regexp ‘1+’
在 MySQL 中,str REGEXP '^[0-9]+' 是用来检查字符串是否以一个或多个数字开头的正则表达式模式。让我详细解释它的用法和注意事项:1. 基本用法
sql
SELECT * FROM table_name WHERE str REGEXP '^[0-9]+';
正则表达式解析:
^ - 匹配字符串开头[0-9] - 匹配任意数字(0到9)+ - 匹配前面的元素一次或多次示例结果:
str 是否匹配
"123abc" ✅ 匹配
"1apple" ✅ 匹配
"a456" ❌ 不匹配
" test" ❌ 不匹配
2. 常见变体
(1) 匹配纯数字的字符串:
sql
SELECT * FROM table_name WHERE str REGEXP '^[0-9]+$';
$ 表示字符串结尾(2) 匹配特定长度的数字开头:
sql
-- 匹配以3位数字开头
SELECT * FROM table_name WHERE str REGEXP '^[0-9]{3}';
(3) 包含小数点的数字:
sql
SELECT * FROM table_name WHERE str REGEXP '^[0-9]+\\.[0-9]+';
需要双反斜杠转义小数点3. 性能优化建议
问题:
REGEXP 通常无法使用索引,全表扫描效率低优化方案:
sql
-- 方案1:使用函数索引(MySQL 8.0+)
ALTER TABLE table_name ADD INDEX ((str REGEXP '^[0-9]+'));-- 方案2:改用LEFT+RLIKE组合
SELECT * FROM table_name
WHERE LEFT(str, 1) BETWEEN '0' AND '9'
AND str RLIKE '^[0-9]+';
4. 实际应用示例
案例:找出所有以年份开头的日志条目
sql
-- 匹配2020-2029开头的字符串
SELECT log_text FROM logs
WHERE log_text REGEXP '^202[0-9]';
案例:验证产品编号格式
sql
-- 要求以3位数字开头,后跟大写字母
SELECT product_code FROM products
WHERE product_code REGEXP '^[0-9]{3}[A-Z]';
5. 注意事项
大小写敏感:默认情况下,REGEXP 不区分大小写如需区分大小写,使用:sql
SELECT * FROM table_name
WHERE str REGEXP BINARY '^[0-9]+';
特殊字符转义:sql
-- 匹配以"1.0"开头的字符串
SELECT * FROM table_name
WHERE str REGEXP '^1\\.0';
性能对比:sql
EXPLAIN SELECT * FROM large_table
WHERE str REGEXP '^[0-9]+';
-- 检查是否使用了索引
建议对大数据表使用更高效的替代方案,如存储生成列或应用层过滤。
mysql> show tables;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| news |
| salaries |
| t |
| test |
| titles |
+----------------------+
11 rows in set (0.00 sec)mysql> -- 查看salaries表结构
mysql> DESCRIBE salaries;
+-----------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| emp_no | int | NO | PRI | NULL | |
| salary | int | NO | | NULL | |
| from_date | date | NO | PRI | NULL | |
| to_date | date | NO | | NULL | |
+-----------+------+------+-----+---------+-------+
4 rows in set (0.02 sec)mysql>
mysql> -- 查看employees表结构
mysql> DESCRIBE employees;
+------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------+------+-----+---------+-------+
| emp_no | int | NO | PRI | NULL | |
| birth_date | date | NO | | NULL | |
| first_name | varchar(14) | NO | | NULL | |
| last_name | varchar(16) | NO | | NULL | |
| gender | enum('M','F') | NO | | NULL | |
| hire_date | date | NO | | NULL | |
+------------+---------------+------+-----+---------+-------+
6 rows in set (0.00 sec)mysql>
mysql> -- 查看departments表结构
mysql> DESCRIBE departments;
+-----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+-------+
| dept_no | char(4) | NO | PRI | NULL | |
| dept_name | varchar(40) | NO | UNI | NULL | |
+-----------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)解决方案
1. 首先查看表结构
sql
复制
下载
-- 查看salaries表结构
DESCRIBE salaries;-- 查看employees表结构
DESCRIBE employees;-- 查看departments表结构
DESCRIBE departments;
2. 查找可能包含数字的列
例如,在 employees 表中查找 emp_no 是否为纯数字:
sql
复制
下载
SELECT emp_no FROM employees
WHERE emp_no REGEXP '^[0-9]+$';
3. 实际应用示例
如果您想查找特定模式的列值:
sql
复制
下载
-- 在departments表中查找dept_no是否为纯数字
SELECT dept_no FROM departments
WHERE dept_no REGEXP '^[0-9]+$';-- 在employees表中查找first_name是否意外包含数字
SELECT first_name FROM employees
WHERE first_name REGEXP '[0-9]';
- 更智能的查询
mysql> -- 查找所有表中可能包含数字的字符串列
mysql> SELECT-> CONCAT('SELECT "',-> table_name, '.', column_name,-> '" AS source, COUNT(*) AS invalid_rows FROM ',-> table_name, ' WHERE ', column_name,-> ' REGEXP ''[0-9]'' AND ', column_name,-> ' NOT REGEXP ''^[0-9]+$'';') AS check_query-> FROM-> information_schema.columns-> WHERE-> table_schema = 'employees'-> AND data_type IN ('varchar', 'char', 'text')-> ORDER BY-> table_name, column_name;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_query |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SELECT "current_dept_emp.dept_no" AS source, COUNT(*) AS invalid_rows FROM current_dept_emp WHERE dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$'; |
| SELECT "departments.dept_name" AS source, COUNT(*) AS invalid_rows FROM departments WHERE dept_name REGEXP '[0-9]' AND dept_name NOT REGEXP '^[0-9]+$'; |
| SELECT "departments.dept_no" AS source, COUNT(*) AS invalid_rows FROM departments WHERE dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$'; |
| SELECT "dept_emp.dept_no" AS source, COUNT(*) AS invalid_rows FROM dept_emp WHERE dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$'; |
| SELECT "dept_manager.dept_no" AS source, COUNT(*) AS invalid_rows FROM dept_manager WHERE dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$'; |
| SELECT "employees.first_name" AS source, COUNT(*) AS invalid_rows FROM employees WHERE first_name REGEXP '[0-9]' AND first_name NOT REGEXP '^[0-9]+$'; |
| SELECT "employees.last_name" AS source, COUNT(*) AS invalid_rows FROM employees WHERE last_name REGEXP '[0-9]' AND last_name NOT REGEXP '^[0-9]+$'; |
| SELECT "news.article" AS source, COUNT(*) AS invalid_rows FROM news WHERE article REGEXP '[0-9]' AND article NOT REGEXP '^[0-9]+$'; |
| SELECT "test.dept_no" AS source, COUNT(*) AS invalid_rows FROM test WHERE dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$'; |
| SELECT "titles.title" AS source, COUNT(*) AS invalid_rows FROM titles WHERE title REGEXP '[0-9]' AND title NOT REGEXP '^[0-9]+$'; |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)
然后对找到的列应用正则表达式查询。
注意事项
不是所有列都适合用正则表达式检查
主键/外键通常是数字,但可能存储为字符串
大表上的正则查询可能很慢
以下是详细分析和建议:
数据质量检查执行方案
- 部门相关表检查
mysql> -- 检查部门编号格式(是否混用字母数字)
mysql> SELECT "departments.dept_no" AS source,-> COUNT(*) AS mixed_rows-> FROM departments-> WHERE dept_no REGEXP '[0-9]'-> AND dept_no NOT REGEXP '^[0-9]+$';
+---------------------+------------+
| source | mixed_rows |
+---------------------+------------+
| departments.dept_no | 9 |
+---------------------+------------+
1 row in set (0.00 sec)mysql>
mysql> -- 检查部门名称是否包含数字
mysql> SELECT "departments.dept_name" AS source,-> dept_name AS sample_data,-> COUNT(*) AS numeric_rows-> FROM departments-> WHERE dept_name REGEXP '[0-9]'-> GROUP BY dept_name;
Empty set (0.00 sec)
- 员工姓名检查(重点检查)
mysql> -- 检查名字中的数字(不应出现)
mysql> SELECT "employees.first_name" AS source,-> first_name AS invalid_data,-> COUNT(*) AS error_count-> FROM employees-> WHERE first_name REGEXP '[0-9]'-> GROUP BY first_name;
Empty set (0.53 sec)mysql>
mysql> -- 检查姓氏中的数字(不应出现)
mysql> SELECT "employees.last_name" AS source,-> last_name AS invalid_data,-> COUNT(*) AS error_count-> FROM employees-> WHERE last_name REGEXP '[0-9]'-> GROUP BY last_name;
Empty set (0.20 sec)
3. 职位标题检查```python
mysql> -- 检查职位名称中的数字(合理情况如"工程师II")
mysql> SELECT "titles.title" AS source,-> title AS mixed_data,-> COUNT(*) AS count-> FROM titles-> WHERE title REGEXP '[0-9]'-> GROUP BY title-> ORDER BY count DESC;
Empty set (1.05 sec)
高级检查技巧
- 提取字符串中的数字部分
mysql> SELECT-> first_name,-> REGEXP_REPLACE(first_name, '[^0-9]', '') AS extracted_numbers-> FROM employees-> WHERE first_name REGEXP '[0-9]';
Empty set (0.20 sec)
- 检查数字位置
mysql> -- 查找数字不在开头的情况
mysql> SELECT dept_no-> FROM departments-> WHERE dept_no REGEXP '[0-9][^0-9]|[^0-9][0-9]';
+---------+
| dept_no |
+---------+
| d009 |
| d005 |
| d002 |
| d003 |
| d001 |
| d004 |
| d006 |
| d008 |
| d007 |
+---------+
9 rows in set (0.00 sec)
- 综合数据质量报告
mysql> SELECT-> 'departments.dept_no' AS column_name,-> COUNT(*) AS total_rows,-> SUM(dept_no REGEXP '^[0-9]+$') AS pure_numeric,-> SUM(dept_no REGEXP '[0-9]' AND dept_no NOT REGEXP '^[0-9]+$') AS mixed_alphanumeric,-> SUM(NOT dept_no REGEXP '[0-9]') AS no_numbers-> FROM departments-> UNION ALL-> SELECT-> 'employees.first_name',-> COUNT(*),-> SUM(first_name REGEXP '^[0-9]+$'),-> SUM(first_name REGEXP '[0-9]' AND first_name NOT REGEXP '^[0-9]+$'),-> SUM(NOT first_name REGEXP '[0-9]')-> FROM employees;
+----------------------+------------+--------------+--------------------+------------+
| column_name | total_rows | pure_numeric | mixed_alphanumeric | no_numbers |
+----------------------+------------+--------------+--------------------+------------+
| departments.dept_no | 9 | 0 | 9 | 0 |
| employees.first_name | 300024 | 0 | 0 | 300024 |
+----------------------+------------+--------------+--------------------+------------+
2 rows in set (0.53 sec)
执行建议
先对小型表执行检查,确认查询效果
对大表添加LIMIT子句先采样检查
考虑创建临时表存储检查结果:
sql
CREATE TABLE data_quality_results AS
SELECT 'employees.first_name' AS column_name, first_name AS sample_data
FROM employees WHERE first_name REGEXP '[0-9]' LIMIT 100;
数据质量分析报告
1. departments.dept_no (部门编号)
总行数: 9纯数字编号: 0 (0%)字母数字混合: 9 (100%)不含数字: 0 (0%)🔍 发现: 所有部门编号都是字母数字混合格式,没有纯数字的部门编号。这是否符合预期设计?2. employees.first_name (员工名)
总行数: 300,024纯数字名字: 0 (0%)含数字的名字: 0 (0%)不含数字: 300,024 (100%)✅ 验证通过: 所有员工名字都不包含数字,符合姓名字段的数据质量要求。
深入建议
对于部门编号的进一步检查
mysql> -- 查看部门编号的具体格式
mysql> SELECT-> dept_no,-> LENGTH(dept_no) AS length,-> dept_no REGEXP '^[A-Za-z][0-9]+$' AS format_check-> FROM departments;
+---------+--------+--------------+
| dept_no | length | format_check |
+---------+--------+--------------+
| d009 | 4 | 1 |
| d005 | 4 | 1 |
| d002 | 4 | 1 |
| d003 | 4 | 1 |
| d001 | 4 | 1 |
| d004 | 4 | 1 |
| d006 | 4 | 1 |
| d008 | 4 | 1 |
| d007 | 4 | 1 |
+---------+--------+--------------+
9 rows in set (0.00 sec)mysql> -- 检查last_name和title字段
mysql> SELECT-> 'employees.last_name' AS column_name,-> COUNT(*) AS total,-> SUM(last_name REGEXP '[0-9]') AS with_numbers-> FROM employees-> UNION ALL-> SELECT-> 'titles.title',-> COUNT(*),-> SUM(title REGEXP '[0-9]')-> FROM titles;
+---------------------+--------+--------------+
| column_name | total | with_numbers |
+---------------------+--------+--------------+
| employees.last_name | 300024 | 0 |
| titles.title | 443308 | 0 |
+---------------------+--------+--------------+
2 rows in set (0.58 sec)
数据治理建议
部门编号规范:
确认字母数字混合的设计是否符合业务规则考虑添加检查约束确保一致性:sql
ALTER TABLE departments
ADD CONSTRAINT chk_dept_no_format
CHECK (dept_no REGEXP '^[A-Za-z][0-9]{3}$');
姓名字段保护:保持现有的数据清洁状态添加应用层验证防止未来输入数字定期质量检查:sql
-- 创建定期执行的检查视图
CREATE VIEW data_quality_metrics AS
SELECT ... [上述查询] ...;数据质量深度分析报告
1. 部门编号(dept_no)格式验证
部门编号 长度 符合格式检查
d001-d009 4 全部通过 ✅
🔍 关键发现:所有部门编号严格遵循"1个字母+3个数字"的格式(如d001)长度统一为4个字符符合正则表达式 ^[A-Za-z][0-9]{3}$ 的验证2. 员工姓氏(last_name)检查
总记录数: 300,024包含数字的记录: 0 (完全清洁) ✅3. 职位标题(title)检查
总记录数: 443,308包含数字的记录: 0 (完全清洁) ✅
数据治理强化建议
- 部门编号标准化
mysql> -- 添加正式约束确保未来数据质量
mysql> ALTER TABLE departments-> ADD CONSTRAINT chk_dept_no_pattern-> CHECK (dept_no REGEXP '^d[0-9]{3}$');
Query OK, 9 rows affected (0.27 sec)
Records: 9 Duplicates: 0 Warnings: 0mysql>
mysql> -- 验证约束效果
mysql> INSERT INTO departments VALUES ('dx12', 'Invalid Dept'); -- 应失败
ERROR 3819 (HY000): Check constraint 'chk_dept_no_pattern' is violated.
- 建立数据质量监控视图
mysql> CREATE VIEW data_quality_monitor AS-> SELECT-> 'departments' AS table_name,-> 'dept_no' AS column_name,-> COUNT(*) AS total,-> SUM(dept_no REGEXP '^d[0-9]{3}$') AS valid_count,-> ROUND(SUM(dept_no REGEXP '^d[0-9]{3}$')/COUNT(*)*100,2) AS valid_percent-> FROM departments-> UNION ALL-> SELECT-> 'employees',-> 'last_name',-> COUNT(*),-> SUM(NOT last_name REGEXP '[0-9]'),-> 100.00-> FROM employees;
Query OK, 0 rows affected (0.02 sec)
- 自动化检测脚本
bash
#!/bin/bash
# 每月自动运行数据质量检查
mysql -e "SELECT * FROM employees.data_quality_monitor" > /reports/$(date +%Y%m)_data_quality.txt
异常处理方案
虽然当前数据质量完美,建议建立防御机制:
sql
-- 1. 创建异常记录表
CREATE TABLE data_quality_exceptions (id INT AUTO_INCREMENT PRIMARY KEY,table_name VARCHAR(50),column_name VARCHAR(50),invalid_value VARCHAR(255),check_type VARCHAR(100),detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);-- 2. 设置触发器捕获未来异常
DELIMITER //
CREATE TRIGGER check_last_name_before_insert
BEFORE INSERT ON employees
FOR EACH ROW
BEGINIF NEW.last_name REGEXP '[0-9]' THENINSERT INTO data_quality_exceptions (table_name, column_name, invalid_value, check_type)VALUES ('employees', 'last_name', NEW.last_name, 'numeric_check');END IF;
END//
DELIMITER ;
0-9 ↩︎