utf8

What is the difference utf8_unicode_ci and utf8_general_ci?

From the MySQL manual:

For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci.

They have a amusing “examples of the effect of collation” set on “sorting German umlauts,” but it unhelpfully uses latin1_* collations. And another table that helpfully explains:

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

This forum post adds more info, but nowhere do they explain how a ☃ sorts against ☁ or ⛅.

How much faster is utf8_general_ci than utf8_unicode_ci, though? An August 2010 message in the MySQL forums seems to suggest the performance for specific operations could be 30% faster, but then dismisses the performance difference as unimportant compared to good indexing and writing efficient queries.

The Difference Between MySQL’s utf8_unicode_ci and. utf8_general_ci Collations

MySQL answer: utf8_unicode_ci vs. utf8_general_ci.

Collation controls sorting behavior. Unicode rationalizes the character set, but doesn’t, on it’s own, rationalize sorting behavior for all the various languages it supports. utf8_general_ci (ci = case insensitive) is apparently a bit faster, but sloppier, and only appropriate for English language data sets.

Converting MySQL Character Sets

This Gentoo Wiki page suggests dumping the table and using iconv to convert the characters, then insert the dump into a new table with the new charset.

Alex King solved a different problem: his apps were talking UTF8, but his tables were Latin1. His solution was to dump the tables, change the charset info in the dump file, then re-insert the contents.