regex

Why PHP’s RegEx Is Slow, And What You Can Do About It (if you happen to be a committer on the PHP project)

Regular Expression Matching Can Be Simple And Fast, by Russ Cox:

Perl [and PHP and others] could not now remove backreference support, of course, but they could employ much faster algorithms when presented with regular expressions that don’t have backreferences.

How much faster? About a million times (no, I do not exaggerate).

I use a lot of regular expressions, and relatively few of them use backreferences. It’d be worth optimizing.

Matching Multi-line Regex in BBEdit

I love BBEdit on my Mac, but I was left scratching my head again today when I was trying to remember how to make its regex engine match a pattern across multiple lines. My hope was to extract a list of initial articles from a page that had HTML like this:

<table>
  <tr>
    <td valign="top" colspan="34" align="left">
      am
    </td>
    <td valign="top" colspan="10" align="left">
      Scottish Gaelic
    </td>
  </tr>
</table>

<table>
  <tr>
    <td valign="top" colspan="34" align="left">
      an
    </td>
    <td valign="top" colspan="10" align="left">
      English,
    </td>
    <td valign="top" colspan="10" align="left">
      Irish,
    </td>
    <td valign="top" colspan="10" align="left">
      Scots,
    </td>
    <td valign="top" colspan="10" align="left">
      Scottish Gaelic,
    </td>
    <td valign="top" colspan="10" align="left">
      Yiddish
    </td>
  </tr>
</table>

<table>
  <tr>
    <td valign="top" colspan="34" align="left">
      an t-
    </td>
    <td valign="top" colspan="10" align="left">
      Irish,
    </td>
    <td valign="top" colspan="10" align="left">
      Scottish Gaelic
    </td>
  </tr>
</table>

Indeed, it has well over 100 tables like that, and I was looking for the contents of the first TD in each. The following regex does it:

(?s)[^<]*<table>[^<]*<tr>[^<]*<td[^>]*>([^<]*)</td>.*?</table>

The most significant part of this is the (?s) at the beginning that tells BBEdit to match the pattern across line breaks. A more ninja-like regex assassin would probably be able to do it better, but this worked.