HTML Purifier Sucks

...needless to say, I don't think I'll bother investigating further!
— Stormrider on SitePoint Forums

Contrary to what this comparison page suggests, HTML Purifier sucks. It swallows oceans, it drinks blood, and it is more effective than your dust-busting Hoover 3000. Why does it suck? How can we make it un-sucky?

This document is currently under construction.

Bloat

As of version 2.1.3, HTML Purifier's library folder contains 164 files in 30 folders, weighing at about 696 kilobytes. For comparison, the CodeIgniter web application framework contains 147 files, 29 folders and weighs 902 kilobytes.

These back-of-a-napkin statistics are very telling about HTML Purifier's internal architecture: object-oriented, one class per file and small components, to the extreme. It also works against HTML Purifier when it comes to the performance department. For most input strings, the memory footprint from this library's source code is higher than the memory used actually processing the HTML (four megabytes, last I checked.)

Performance

HTML Purifier is extremely slow. Various benchmarks have shown HTML Purifier to be an order of a magnitude slower than comparable solutions.

Whitespace

The Stormrider quote at the very beginning of this document is for one very specific problem: whitespace.

Data-loss

It is trivially easy to nuke the contents of a document by inserting a </div> tag near the beginning, when DOMLex is being used.