HTML and regular expressions

Posted by cafuego on Tuesday 15 November 2011.

I sort of dislike regular expressions. They're usually annoying to read and not infrequently incomprehensible without reading a book (or two) first. Still, they're useful.

I wanted to apply some regex magic to HTML content, to change words that are not part of the HTML markup. However, anecdotally, regular expressions don't play well with HTML. Still, I wasn't interested in the tags, only what is between them, so it struck me as a not impossible task.

Eventually, I found a regex that matches all content that is not an HTML tag (?<=^|>)([^><]+?)(?=<|$) and the PHP preg_replace_callback() function.

Using these, I made a snippet that does what I need; it replaces words in HTML content, leaving all tags unaffected:

  $output = preg_replace_callback(
    '/(?<=^|>)([^><]+?)(?=<|$)/'
    create_function(
      '$matches',
      'return preg_replace("/\b([^\s]+)\b/", "Word", $matches[0]);'
    ),
    $string);

This extracts all sub-strings that are not HTML tags from from $string. It then passes each of these substrings to a function that replaces all (multiple) occurrences of continuous non-space characters to the string "Word".

Of course, I plan to use this for evil.

Update: Hurray, the patch was accepted into misery.module :-)

Tags: