Freaking MySQL Character Set Encodings

Derek Sivers‘ plan, with all it’s bin2hex and regexp and back and forth between MySQL and PHP almost looks good compared to what I’m about to do. Really, why is it so difficult to go from latin1 (tables created back in MySQL 3) to utf8? Not only do you have to set the charset on the table, but also the connection, in PHP, and flipping everywhere. And then you’ve gotta deal with all this old data that’s in the wrong character set.

I got all excited when I learned of CONVERT() and CAST(), and felt completely betrayed when they turned out to be basically useless to me.

CONVERT() with USING is used to convert data between different character sets. In MySQL, transcoding names are the same as the corresponding character set names. For example, this statement converts the string 'abc' in the default character set to the corresponding string in the utf8 character set:

SELECT CONVERT('abc' USING utf8);

Thing is, I can’t figure out where it’s inheriting “the default character set,” and I can’t seem to come up with a combination of settings that will do the job. This old doc looked promising at first, but quickly let me down.

Ack. It looks like I’m going to have to read the contents of a table using PHP and latin1 connection encoding, then insert the data into a duplicate table using utf8 connection encoding. Either that, or Derek Sivers’ 28 step process.

In the interest of documenting even the things that don’t work: I had high hopes for this, but it doesn’t do anything useful.

if($tables = $wpdb->get_col('SHOW TABLES;')) {
	// For every table in the database
	foreach($tables as $table) {
		$changes = array();
 
		// create the new table, set it to UTF8
		$changes[] = “CREATE TABLE `conv_$table` LIKE `$table`”;
		$changes[] = “ALTER TABLE `conv_$table` DEFAULT CHARACTER SET=utf8”;
 
		// copy the data into the new table, transcoding to UTF8
		$fields = $wpdb->get_results(“DESCRIBE {$table};);			
		$allfields = $selectfields = array();
		foreach($fields as $field){
			$allfields[] = $field->Field;
			$selectfields[] = (stripos($field->Type, 'char') || stripos($field->Type, 'text')) ? “CONVERT(`$field->Field` USING utf8): “`$field->Field`”;
		}
		$changes[] = “INSERT INTO `conv_$table` (`”. implode($allfields, '`, `') .“`) SELECT ”. implode($selectfields, ', ') .“ FROM $table”;
 
		// move the old table aside and put the new one in place
		$changes[] =RENAME TABLE `$table` TO `preutf8_$table`”;
		$changes[] =RENAME TABLE `conv_$table` TO `$table`”;
 
		// echo out the mysql commands
		echo<h2 id="11972_changes-to-table-tab_1" >Changes to table <code>$table</code>:</h2>;
		echo implode($changes,;<br />\n”);
	}
}

mysql, character set, encoding, utf8 conversion

2 thoughts on “Freaking MySQL Character Set Encodings

  1. Hi!

    You should write
    (stripos($field->Type, ‘char’) !== false || stripos($field->Type, ‘text’) !== false)
    instead of
    (stripos($field->Type, ‘char’) || stripos($field->Type, ‘text’))

    because stripos($field->Type, ‘text’) returns 0 with mysql field type text

    But nice code anyway!

  2. Pingback: Dummy’s guide for setting/converting character set for a web application « memento

Comments are closed.