mysql database with latin1 charset and utf8 data

I know that it's 2021, but we are still having problems with encoding in mysql (MariaDB in this cane, but problem is smilar). This time, it's application which I inherited which saves utf-8 data into database which is declared as latin1.

How can you check if this is problem with your database too?

MariaDB [ompdb]> show create database ompdb ;
+----------+------------------------------------------------------------------+
| Database | Create Database                                                  |
+----------+------------------------------------------------------------------+
| ompdb    | CREATE DATABASE `ompdb` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+------------------------------------------------------------------+

Alternative way is to invoke mysqldump ompdb and example file generated. Why is this a problem? If we try SQL query on one of tables:

MariaDB [ompdb]> select * from submission_search_keyword_list where keyword_text like 'al%ir' ;
+------------+--------------+
| keyword_id | keyword_text |
+------------+--------------+
|       3657 | alzir        |
|       1427 | alÅ¾ir       |
+------------+--------------+

You can clearly see double-encoded utf8 which should be alžir. This is because our client is connecting using utf8 charset, getting utf8 data in binary form so we see double-encoding. So we can try to conntect using latin1 with:

root@omp-clone:/home/dpavlin# mysql --default-character-set=latin1 ompdb
MariaDB [ompdb]> select * from submission_search_keyword_list where keyword_text like 'al%ir' ;
+------------+--------------+
| keyword_id | keyword_text |
+------------+--------------+
|       3657 | alzir        |
|       1427 | alžir       |
+------------+--------------+

Note that everything is still not well, because grid after our utf8 data is not aligned well.

Googling around, you might find that possible solution is to add --default-character-set=latin1 to mysqldump, edit all occurrences of latin1 to utf8 (utf8mb4 is better choice) and reload database, and problem is solved, right?

If we try to do that, we will get following error:

ERROR 1062 (23000) at line 1055 in file: '1.sql': Duplicate entry 'alžir' for key 'submission_search_keyword_text'

Why is this? MySQL uses collation setting to remove accents from data, so it treats alzir and alžir as same string. Since we have both of them in our data, this is not good enough. Also, editing database manually always makes me nervous, so we will using following to get database dump without declaration of encoding (due to --skip-opt option), but using latin1 for dumping data:

mysqldump ompdb --skip-set-charset --default-character-set=latin1 --skip-opt > /tmp/1.sql

Next, we need to create database with collation which preserves everything (utf8mb4_bin) using:

CREATE DATABASE omp2 CHARACTER SET = 'utf8mb4' COLLATE 'utf8mb4_bin' ;

Finally we should be able to reload created dump without errors:

mysql omp2 < /tmp/1.sql

One additional benefit of using --skip-opt for mysqldump is that every insert is split into individual line. So if you want to have correct collation and skip data which is invalid (which might be possible depending on where data is) you can use same mysqldump file and add -f flag when reloading dump like mysql -f omp2 < /tmp/1.sql which will report data that have errors, but insert everything else into database.