I sort of dislike regular expressions. They're usually annoying to read and not infrequently incomprehensible without reading a book (or two) first. Still, they're useful.
I wanted to apply some regex magic to HTML content, to change words that are not part of the HTML markup. However, anecdotally, regular expressions don't play well with HTML. Still, I wasn't interested in the tags, only what is between them, so it struck me as a not impossible task.
Eventually, I found a regex that matches all content that is not an HTML tag (?<=^|>)([^><]+?)(?=<|$) and the PHP preg_replace_callback() function.
Using these, I made a snippet that does what I need; it replaces words in HTML content, leaving all tags unaffected:
$output = preg_replace_callback( '/(?<=^|>)([^><]+?)(?=<|$)/' create_function( '$matches', 'return preg_replace("/\b([^\s]+)\b/", "Word", $matches[0]);' ), $string);
This extracts all sub-strings that are not HTML tags from from $string. It then passes each of these substrings to a function that replaces all (multiple) occurrences of continuous non-space characters to the string "Word".
Of course, I plan to use this for evil.
Update: Hurray, the patch was accepted into misery.module :-)
Comments
thank you!
awesome regexp, you save my day
Add new comment