Menu

Speeding up MySQL joins on tables with TEXT columns, maybe

The thing about WordPress’ DB schema is that TEXT and VARCHAR content is mixed in the posts table (to say nothing of the frustrations of DATETIME columns). That’s not such a problem for a blog with a few hundred posts, but it’s a different matter when you have a few hundred thousand posts. And it wouldn’t even be a problem then, except for this quirk in MySQL:

Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types (see Section 8.4.3.3, “How MySQL Uses Internal Temporary Tables”). Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.

Please note that this is not the same as using filesort, there’s no setting for max memory or heap table size that will fix it, and it’s a frustration that grows with your tables. Simply put, if you do a join on a table with a TEXT column on it, you’re going to end up with a temporary table on disk along the way.

That’s when Matthew Yonkovit points out what you might already be thinking:

If your using lots of text fields, doing lots of sorting, group by’s, etc you may see a nice performance boost by pointing over your tmpdir to tmpfs.

And at 2bits you’ll find graphical proof, “note how many slowqueries per second before and after the change:”

mysql-ramdisk-mysql_slowqueries-day

But in the MySQL docs I found:

If the MySQL server is acting as a replication slave, you should not set –tmpdir to point to a directory on a memory-based file system or to a directory that is cleared when the server host restarts. A replication slave needs some of its temporary files to survive a machine restart so that it can replicate temporary tables or LOAD DATA INFILE operations. If files in the temporary file directory are lost when the server restarts, replication fails.

And if I stopped there I’d feel lost, but then I found this from Matt Reid:

there is an option for slaves to use their own tmpdir location. If you read the rest of the manual… http://dev.mysql.com/doc/refman/5.0/en/replication-options-slave.html#option_mysqld_slave-load-tmpdir

So does any of this work? I don’t know yet.

Whatever performance I do gain, it’ll likely just be a stop-gap. The real solution to the complex joins I’m doing (this is why, here’s a small implementation) will be to move them out of MySQL and into something like Sphinx.