With the advent of Web 2.0, the end user has gone from passive consumer to active producer of content on the World Wide Web. Wikis, Social Software and Blogs all put the user in control.
Give the user too much control, however, and you set yourself up for XSS attacks. For this reason, HTML's flexibility has proven to be both a blessing and a curse, and the software that processes it must strike a fine balance between security and usability. How do we prevent users from injecting JavaScript or inserting malformed HTML while allowing a rich syntax of tags, attributes and CSS? How do we put HTML inside RSS feed without worrying about sloppy coding messing up XML parsing? Almost every PHP developer has come across this problem before, and many have tried (albeit unsuccessfully) to solve this problem. We will analyze existing libraries to demonstrate how they are ineffective and, of course, how HTML Purifier solves all our problems and achieves standards-compliance.
I will take no quarter and pull no punches: as of the time of writing, no other library comes even close to solving the problem effectively for richly formatted documents. But, nonetheless, there is a necessary disclaimer:
This comparison document was written by the author of HTML Purifier, and clearly is in favor of HTML Purifier. However, that doesn't mean that it is biased: I have made every attempt to be factual and fair, and I hope that you will agree, by the time you finish reading this document, that HTML Purifier is the only satisfactory HTML filter out there today.
Table of Contents
Summary
A table summarizing the differences for the impatient.
Library | Version | Date | License | Whitelist | Removal | Well-formed | Nesting | Attributes | XSS safe | Standards safe |
---|---|---|---|---|---|---|---|---|---|---|
striptags | n/a | n/a | n/a | Yes (user) | Buggy | No | No | No | No | No |
PHP Input Filter | 1.2.2 | 2005-10-05 | GPL | Yes (user) | Yes | No | No | Partial | Probably | No |
HTML_Safe | 0.9.9beta | 2005-12-21 | BSD (3) | Mostly No | Yes | Yes | No | Partial | Probably | No |
kses | 0.2.2 | 2005-02-06 | GPL | Yes (user) | Yes | No | No | Partial | Probably | No |
htmLawed | 1.1.9.1 | 2009-02-26 | GPL | Yes (not default) | Yes (user) | Yes (user) | Partial | Partial | Probably | No |
Safe HTML Checker | n/a | 2003-09-15 | n/a | Yes (bare) | Yes | Yes | Almost | Partial | Yes | Almost |
HTML Purifier | 4.15.0 | 2022-09-18 | LGPL | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
HTML Tidy is omitted from this list because it is not an HTML filter.
Look Ma, No HTML!
A clever person solves a problem. A wise person avoids it.— Albert Einstein
Before we jump into the weird and not-so-wonderful world of HTML filters, we must first consider another domain: non-HTML markup libraries. While libraries of this type really shouldn't be considered HTML filters, they are the number one method of taking user input and processing it into something more than plain old text. These libraries forgo HTML and define their own markup syntax. BBCode, Wikitext, Markdown and Textile are all examples of such markup languages (although it should be noted that Wikitext and Markdown can allow HTML within them). The benefits (to those who use it, anyway) are clear: simplicity and security.
Markup language | Sample |
---|---|
BBCode | [b]B[/b] [i]i[/i] [url = http://www.example.com/]link[/url]. |
Wikitext1 | '''B''' ''i'' [http://www.example.com/ link] |
Markdown2 | **B** *i* [link](http://www.example.com/) |
Textile | *B* _i_ "link":http://www.example.com/ |
HTML | <b>B</b> <i>i</i> <a href="http://www.example.com/">link</a> |
WYSIWYG | B i link |
- Wikitext shown is modeled after MediaWiki style. There are many variants of Wikitext currently extant.
-
Strictly speaking, the Markdown syntax is not equivalent: bold text
is expressed as
<strong>
and italicized text is expressed as<em>
. Most browser default stylesheets, however, map those two semantic tags to the associated styling, so many users assume that it really is italics (and use it improperly for, say, book titles.)
Simplicity
HTML source code is often criticized for being difficult to read. For example, compare:
* Item 1 * Item 2
...with:
<ul> <li>Item 1</li> <li>Item 2</li> </ul>
Which would you prefer to edit? The answer seems obvious, but be careful not to fall into the fallacy of false dilemma. There is a third choice: the WYSIWYG (rich text) editor, which blows earlier choices out of the water in terms of usability.
Note that rich text editors and alternate markup syntaxes are not mutually exclusive, but, when push comes to shove, it's easier implement this sort of editor on top of HTML than some obscure markup language. And in the cases when it is done, you usually end up with a live preview, not a true rich text editor.
“Now just wait a second,” you may be saying, “WYSIWYG editors aren't all that great.” There are many good arguments against these editors, and intelligent people have written essays devoted to criticizing WYSIWYG. In addition to the usual arguments against said editors, the web poses another limitation: no JavaScript means no editor, and no editor means... (gasp) manually typing in code.
Even the most dogmatic purist, however, should recognize that for all its faults, prospective clients really want rich text editors. There are steps you can take to mitigate the associated drawbacks of these editors.
It is often asserted that WYSIWYG editors encourage excessive presentational markup. As it turns out, this is the case with any markup language that allows the smallest iota of presentational tags, be it <font> or [color=red]. A good way to reduce this trouble is to simply eliminate the dialogue boxes that allow users to change colors or fonts (which usually have no legitimate use) and adopt a WYSIWYM scheme, allowing users to select contextually correct formatting styles for segments of text.
Simplicity is also a double-edged sword. The moment any remotely complex markup is needed, these lightweight markup languages fail to produce. Sure you can make '''this text bold''' with Wikitext, but that infobox all “rendered nicely in aqua blue” will require a gaggle of <div>s and CSS. These languages face the same troubles as regular HTML filters in that their whitelist is too restrictive (besides the fact that their table markup is extraordinarily complex).
Security
BBCode can be boiled down to a “wanna-be” version of HTML. I mean, replacing the angled brackets with square brackets and omitting the occasional parameter name? How much more un-original can you get? Somehow, I don't think BBCode was meant to readable. Wikipedia agrees:
BBCode was devised and put to use in order to provide a safer, easier and more limited way of allowing users to format their messages. Previously, many message boards allowed the users to include HTML, which could be used to break/imitate parts of the layout, or run JavaScript. Some implementations of BBCode have suffered problems related to the way they translate the BBCode into HTML, which could negate the security that was intended to be given by BBCode.
Or, put more simply:
BBCode came to life when developers where too lazy to parse HTML correctly and decided to invent their own markup language. As with all products of laziness, the result is completely inconsistent, unstandardized, and widely adopted.
Well, developers, the whole point of HTML Purifier is that I do the work so you can just execute the ridiculously simple $purifier->purify($html) call and go on to do, well, whatever you developers do. :-P
Conclusion
These alternative markup languages have their shiny points, and HTML Purifier is not meant to replace them. However, a major reason for their existence has been called into question. Why are you using these languages?
HTML Tidy
Dave Raggett's HTML Tidy is a program; neat enough, at least, to make it into PHP as a PECL extension. The premise is simple, the execution effective. Tidy is, in short, a great tool.
It is not, however, a filter. I am often surprised when people ask me, “What about Tidy?” There's nothing against Tidy: Tidy tackles a different problem set. Let's see what man tidy has to say:
Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For HTML variants, it detects and corrects many common coding errors and strives to produce visually equivalent markup that is both W3C compliant and works on most browsers. A common use of Tidy is to convert plain HTML to XHTML.
Hmm... why do I not see the words “filter” or “XSS” in here? Perhaps it's because Tidy accepts any valid HTML. Including script tags. Which leads us to our second part: Tidy parses documents, not document fragments.
This is not to say that I haven't seen Tidy be used in this sort of fashion. MediaWiki, for instance, uses Tidy to cleanup the final HTML output before shuttling it off to the browser. The developers, nevertheless, agree that this is only a band-aid solution, and that the real way to fix it is to fix the parser. Tidy's great, but in terms of security, it's not suitable for untrusted sources.
OWASP AntiSamy
Although OWASP AntiSamy is implemented in Java and .NET, it is worth a quick mention here because it purports to do the same thing as HTML Purifier. The bottom line? It gets pretty close, but it just doesn't have the same depth as HTML Purifier.
Architecturally speaking, OWASP AntiSamy is highly dependent on what are called “policy files”, which is an highly extended form of XML Schema with information on what attributes and elements to allow. As such, the actual code for filtering is relatively light-weight. AntiSamy gets lots of points for using legitimate HTML and CSS parsers (extra props for the CSS parser; HTML Purifier doesn't use one, but we should!)
Unfortunately, while XML Schema files can get a high level of control on the validation, the regular expression heavy approach begins showing signs of stress when data-types are complex (e.g. URIs), and XML Schema is ill-suited for large-scale DOM manipulation, which is necessary when transforming HTML for standards compliance. Nonetheless, I would be fairly confident in its XSS cleaning abilities, so long as it removes things it doesn't recognize by default (something I find slightly perplexing in its policy files, since some rules indicate things to be removed.)
Preface
I've ordered my analyses according to how bad a library is. The worst is first, and then we move up the spectrum. I will point out the most flagrant problems with the libraries, but note that I will omit more advanced vulnerabilities: if you can't catch an onmouseover attribute, I really shouldn't reprimand you for letting non-SGML code points through. The ideal solution, however, must do all these things.
Note that besides striptags, most of the libraries are moderately effective against the most common XSS attacks. None of them (save Safe HTML Checker) fare very well in the standards-compliance department though.
striptags()
Whitelist | Yes, user-specified |
---|---|
Removes foreign tags | Buggy |
Makes well-formed | No |
Fixes nesting | No |
Validates attributes | No |
The PHP function striptags() is the classic solution for attempting to clean up HTML. It is also the worst solution, and should be avoided like the plague. The fact that it doesn't validate attributes at all means that anyone can insert an onmouseover='xss();' and exploit your application.
While this can be bandaided with a series of regular expressions that strip out on[event] (you're still vulnerable to XSS and at the mercy of quirky browser behavior), striptags() is fundamentally flawed and should not be used.
PHP Input Filter
Though its title may not imply it, PHP Input Filter is a souped up version of striptags() with the ability to inspect attributes. (Don't mind the hastily tacked on query escaping function).
Version | 1.2.2 |
---|---|
Last update | 2005-10-05 |
License | GPL |
Whitelist | Yes, user defined |
Removes foreign tags | Yes |
Makes well-formed | No |
Fixes nesting | No |
Validates attributes | Partial |
XSS safe | Probably |
Standards safe | No |
PHP Input Filter implements an HTML parser, and performs very basic checks on whether or not tags and attributes have been defined in the whitelist as well as some smarter XSS checks. It is left up to the user to define what they'll permit.
With absolutely no checking of well-formedness, it is trivially easy to trick the filter into leaving unclosed tags lying around. While to some standards-compliance may be viewed by some as a “nice feature”, basic sanity checks like this must be implemented, otherwise a user can mangle a website's layout.
More troubles: Woe to any user that allows the style attribute: you can't simply just let CSS through and expect your layout not to be badly mutilated. To top things off, the filter doesn't even preserve data properly: attributes have all spaces stripped out of them. Stay away, stay away!
HTML_Safe/SafeHTML
HTML_Safe is PEAR's HTML filtering library. It should be noted that this is the same library as SafeHTML, though with different branding (and a different version number).
Version | 0.9.9beta |
---|---|
Last update | 2005-12-21 |
License | BSD (3 clause) |
Whitelist | Mostly No |
Removes foreign tags | Yes |
Makes well-formed | Yes |
Fixes nesting | No |
Validates attributes | Partial |
XSS safe | Probably |
Standards safe | No |
HTML_Safe's mechanism of action involves parsing HTML with a SAX parser and performing validation and filtering as the handlers are called. HTML_Safe does a lot of things right, which is why I say it probably isn't vulnerable to XSS, but its approach is fundamentally flawed: blacklists.
This library maintains arrays of dangerous tags, attributes and CSS properties. (It also has a blacklist of dangerous URI protocols, but this is intelligently disabled by default in favor of a protocol whitelist.) What this means is that HTML_Safe has no qualms of accepting input like <foobar> Bang </foobar>. Anything goes except the tags in those arrays. Scratch standards-compliance (and that was without even considering proper nesting).
For now, HTML_Safe might be safe from XSS. In the future, however, one of the infinitely many tags that HTML_Safe lets through might just possibly be given special functionality by browser vendors. And it might just turn out that this can be exploited. Any blacklist solution puts you at a perpetual arms race against crackers who are constantly discovering new and inventive ways to abuse tags and attributes that you didn't blacklist.
kses
kses appears to be the de-facto solution for cleaning HTML, having found its way into applications such as WordPress and being the number one search result for “php html filter”.
Version | 0.2.2 |
---|---|
Last update | 2005-02-06 |
License | GPL |
Whitelist | Yes, user defined |
Removes foreign tags | Yes |
Makes well-formed | No |
Fixes nesting | No |
Validates attributes | Partial |
XSS safe | Probably |
Standards safe | No |
To be truthful, I didn't do as comprehensive a code survey for kses as I did for some of the other libraries. Out of all the classes I've reviewed so far, kses was definitely the hardest to understand.
kses's modus operandi is splitting up html with a monster regexp and then validating each section with kses_split2(). It suffers from the same problems as Input Filter: no well-formedness checks leading to rampant runaway tags (and no standards-compliance). WordPress, the primary user of kses today, had to implement their own custom tag-balancing code to fix this problem: don't use this library without some equivalent!
Its whitelist syntax, however, is the most complex of all these libraries, so I'm going to take some time to argue why this particular implementation is bad. The author of this library was thoughtful enough to provide some basic constraint checks on attributes like maxlen and maxval. Now, barring the fact that there simply aren't enough checks, and the fact that they are all lumped together in one function, we now must wonder whether or not the user will go through the trouble of specifying the maximum length of a title attribute.
I have my opinions about inherent human laziness, but perhaps WordPress's default filterset is the most telling example:
$allowedposttags = array ( /* formatted and trimmed */ 'hr' => array ( 'align' => array (), 'noshade' => array (), 'size' => array (), 'width' => array () ) );
Hmm... do I see a blatant lack of attribute constraints? Conclusion: if the user can get away with not doing work, they will! The biggest problem in all these whitelists filters is that they forgot to supply the whitelist. The whitelist is just as important as the code that uses the whitelist to filter HTML.
htmLawed
htmLawed is kses on steroids. After looking at HTML Purifier and deciding that it was too slow for him, Santosh Patnaik went ahead and rewrote the kses engine with more features.
Version | 1.1.9.1 |
---|---|
Last update | 2009-02-26 |
License | GPL |
Whitelist | Yes, but blacklist is default |
Removes foreign tags | Yes, user defined |
Makes well-formed | Yes, user defined |
Fixes nesting | Partial |
Validates attributes | Partial |
XSS safe | Probably |
Standards safe | No |
htmLawed improves standards-compliance, but it is not fully
standards-compliant; there are a number of cases which the author has
explicitly stated he will not fix. There are issues with content
models in table
and ruby
and tags that
must have content in them.
Let's, for a moment, imagine that htmLawed is XSS-safe when
safe
is on.
Even then, it still is not XSS-safe out of the tin: you have
to turn on htmLawed's security features! This is
by
design. Sane defaults are important, because for every person who
does read the documentation, there is
another
one who doesn't (and is mislead by claims that “htmLawed is a single-file PHP
software that makes input text secure”), and is
surprised at some behavior.
Software must be safe by default; the user can then relax
any security restrictions.
I also disagree with some of the choices with regards to what elements are
“safe”. form
is XSS-safe,
but it is certainly not phishing safe. Forms can be
used to spoof system dialogs on that person's domain. These should
not be allowed in safe
mode.
Safe HTML Checker
Safe HTML Checker is (to my knowledge) the first attempt to make a filter that also outputs standards-compliant XHTML. It wasn't even released or licensed officially, but we'll let that slide: a 4th place search result must have done something right.
Version | in-house |
---|---|
Last update | 2003-09-15 |
License | undefined |
Whitelist | Yes (bare-bones) |
Removes foreign tags | Yes |
Makes well-formed | Yes |
Fixes nesting | Almost |
Validates attributes | Partial |
XSS safe | Yes |
Standards safe | Almost |
Indeed, it is quite a well-written piece of code. It demonstrates knowledge of inline versus block elements, thus almost nearly getting nesting correct (the only exception is an unimplemented omitted SGML exclusion for <a> tags, and that's easy to fix).
Unfortunately, part of the reason why it works so well is that it's extremely restrictive. No styling, no tables, very few attributes. Perfectly appropriate for blog comments, but then again, there's always BBCode. This probably means that Safe HTML Checker has a different goal than HTML Purifier.
The XML parser is also quite strict. Accidentally missed a < sign? The parser will complain with the cryptic message: “XHTML is not well-formed”. The solution is not as simple as just switching to a more permissive parser: Safe HTML Checker relies on the fact that the parser will have matched up the tags for them.
HTML Purifier
Version | 4.15.0 |
---|---|
Last update | 2022-09-18 |
License | LGPL |
Whitelist | Yes |
Removes foreign tags | Yes |
Makes well-formed | Yes |
Fixes nesting | Yes |
Validates attributes | Yes |
XSS safe | Yes |
Standards safe | Yes |
That table should say it all, but I'll add a few more features:
UTF-8 aware | Yes |
---|---|
Object-Oriented | Yes |
Validates CSS | Yes |
Tables | Yes |
PHP 5 only | Yes |
E_STRICT compliant | Yes |
Can auto-paragraph | Yes |
Extensible | Yes |
Unit tested | Yes |
This is not to say that HTML Purifier doesn't have problems of its own. It's big (while the others usually fit in one file, this one requires a huge include list), and it's missing features. But even with these deficiencies, HTML Purifier is far better than the other libraries.