Welcome! » Log In » Create A New Profile

Question to html purifier experts

Posted by Dmitri 
Question to html purifier experts
December 23, 2009 01:18PM


I am thinking of this idea to write something like html purifier, basically I want to know your opinion of what's wrong with doing it this way: 3 basic steps

1) Convert charset encoding to utf8, like this: if charset is already reported as utf-8 or ascii, then validate that it does not contain illegal chars and strip low bytecode chars (except tab, newline and space) If utf8 does not pass the test for well-formdness, then recode it using iconv UTF-8 to UTF-8 with //IGNORE flag or similarly can use mb_convert_encoding using the "none" as default character If charset is anything other than utf8 then convert it to utf8 using either utf8_encode or iconv or mb_convert_encoding, depending on what extension is available and of cause utf8_encode only works on latin1

2) Fix the html string by running it through Tidy. This will take care of closing unclosed tags, can also remove garbage from MS-Word created HTML. It will also add necessary <html><body></body></html> to the html fragment.

3) The actual stripping off the dangerous tags: do with with DOMDocument/DOMElement classes of php. Since by now we have valid UTF-8 string and fixed HTML, we should be able to load the document into DOMDocument without problems.

Now we can use DOM interface or even Xpath to find all tags that are not on our whitelist (or on our blacklist), and can also find all tags that have dangerous attributes and remove all of them.

The DOMElement removes all decendents, so nesting is solved right there.

Now we have the clean DOMDocument and can dump it back as a string, there are also some tricks to only dump the contents of what's inside the tag, so the actual <html><body></body></html> will not be included in the result if we don't want that.

This looks easy and uses only the extensions that are already in php - mbstring or iconv, tidy and DOM

What's your take on this approach?

Edited 1 time(s). Last edit at 12/23/2009 01:27PM by Ambush Commander.

Re: Question to html purifier experts
December 23, 2009 01:21PM

Nice one! My whole message has been purified? Shit

Re: Question to html purifier experts
December 23, 2009 01:31PM

It mostly works, and is the architecture HTML Purifier wants to move forward to soon. There are some technical details, which I will introduce here:

  • DOM has pretty terrible XML support; I've mostly concluded by now (from my adventures in html5lib) that I'll have to roll my own support if I want first-class support for things like SVG and MathML
  • DOM's HTML parsing algorithm is decent, but it's not the best. html5lib implements something much closer to what browsers actually do, but putting that in will be a pretty big performance expense
  • You still need all of the infrastructure for attribute validation and whatnot, which is really the hard part about getting a comprehensive purification; so while you'd get to rewrite a lot of the HTMLPurifier_Strategy classes to be smaller, that code stays.

HTML Purifier adopted the token-based approach because it wanted to be PHP4 compatible. Since this is no longer the case, moving in this direction is possible.

Re: Question to html purifier experts
December 23, 2009 01:44PM

The purifier is mostly needed to HTML, I think php's DOM is good enough for HTML parsing, I use DOMDocument/DomElement for parsing all sorts of RSS/ATOM feeds and never have a problem.

Why do you need to support MathML and SVG? These could and should be added as plugin filters, the core should work on HTML only and be good at it.

I think using the simple strategy and already included php extensions will make this a pretty lightweight package. I mean, dump php4, people can still use the current version of purifier for php4, but the new one for php5 will be much faster with a lot less code. It will have to have Tidy and either iconv or mbstring as required dependencies, but that's just a small inconvenience people will have to accept. Most php installations these days already have all these libs installed

Re: Question to html purifier experts
December 23, 2009 01:52PM

"You'd be surprised!" I get a lot of people who come in with custom built PHPs and they've disabled DOM.

Why do you need to support MathML and SVG?

HTML5 is codifying support for them.

You should take a look at the source tree under library. There's not actually very much code we'd be able to ditch under the new scheme.

Re: Question to html purifier experts
December 23, 2009 01:55PM

OK. By the way, I just saw the new HTML_Safe, it's in svn only for now. It does not look much different from the old one, and the thing like $parser=& new XML_HTMLSax3(); is still in the code that claims to be php 5.2

Here is the link:


Sorry, you do not have permission to post/reply in this forum.