A.C. van Rijn
performance
June 22, 2007 10:51AM

Does using something like HTML purifier result in reduced performance? I'm curious about how much additional overhead it introduces compared to other solutions. Thanks!

Re: performance
June 22, 2007 02:34PM

While, as developer of HTML Purifier, I shouldn't be saying this, HTML Purifier probably takes more processing power than the other inferior solutions out there, for simply one reason: it's comprehensive.

That being said, it's still pretty fast, and it's been consistently getting faster (HTML Purifier 2.0 is bounds faster than its predecessors due to caching). Sometime, I'll do some benchmarks (or, for neutrality's sake, maybe someone else will do it and publish them). I too would like to know some solid numbers.

An evaluation of HTMLPurifier performance
July 07, 2007 07:00PM

An evaluation of HTMLPurifier performance:

Purifiers

1. HTML Purifier 2.0.1 (default config)

2. A custom function based on Mediawiki(1.10)/includes/sanitizer.php that simply balances tags and htmlspecialchars() '<' and '>' that are not part of XHTML tags.

System

* Windows XP SP2 / Pentium 4 / 1.8 GHz / 1.2 GB RAM

* Local Apache 2.2 + mod_php / PHP 5.2.1 / Firefox 2

Method

Restart Apache -> time custom purifier [call script through browser 5 times] -> restart -> time HTMLPurifier [call script through browser 5 times]

Test script

* Simple script

* Get input -> function call to purify -> echo purified HTML

* Microtime from purifier function call to immediately echoed output

* Get peak memory usage

Input

* Body content of HTML file: http://www.w3.org/TR/xhtml1/

* 71 kb

* 72,685 characters; 2092 of them '>'

* 63 markup errors detected when file is uploaded for checking at http://validator.w3.org

RESULT

Custom purifier:

* 0.3381. 1.1416, 0.4707, 0.9563 and 0.6372 seconds; average 0.7708

* peak memory usage 631 kb

* -80 '>' (loss) and +612 (gain) characters after purification

* 112 markup errors

HTMLPurifier:

* 1.7105, 1.9529, 1.3612, 1.7437 and 2.1334 seconds; average 1.7803

* peak memory usage 6,558 kb

* -30 '>' and -1542 characters after purification

* 0 markup errors

Re: performance
July 08, 2007 10:32AM

While that's an interesting test you've done, it's like comparing apples to oranges. The custom purifier fails to fix a 112 markup errors, while HTML Purifier is performs perfectly. The high memory usage, however, is slightly disturbing, and I'll need to look into that.

Re: performance
July 08, 2007 01:52PM

When there is no other apple around, a comparison with an orange can be better than no comparison or one with a potato ;)

Seriously, the evaluation was to check memory and time consumed by HTML Purifier, and not its already-proven efficacy. The other script is what one would call a positive control in an experiment.

If HTML Purifier takes, say, 7 MB and 2 seconds, to process 70 kb, a busy site might not want to use it.

Re: performance
July 09, 2007 08:29AM

I completely understand. HTML Purifier must be made faster.

Re: performance
July 11, 2007 12:17PM

Some more time checks, of whatever worth, using the almost valid XHTML file from W3C and its truncated or randomly mutated (to ASCII 32-255) versions. Same setup as earlier, but this time only the time to purify is clocked.

71 kb

1.0122, 1.0244, 1.0101

48 kb

0.7097, 0.7211, 0.7144

24 kb

0.4210, 0.4341, 0.4151

71 kb, every 20th character mutated

0.1769, 0.1539, 0.1585

71 kb, every 10th character mutated

0.7165, 0.7111, 0.6935

1 character string

0.0209, 0.0344, 0.0307

Re: performance
July 11, 2007 12:34PM

Quick note: table syntax is supported on these forums. Here's the data reformatted, with some extra interesting number crunching:

Size (kb) Trial 1 (sec) Trial 2 (sec) Trial 3 (sec) Average (sec) Ratio (sec/kb)
71 1.0122 1.0244 1.0101 1.0156 0.0143
48 0.7097 0.7211 0.7144 0.7151 0.0149
24 0.4210 0.4341 0.4151 0.4235 0.0176
0.001 0.0209 0.0344 0.0307 0.0287 n/a

The one character string gives a good idea on what the initialization overhead for HTML Purifier is, and it appears that HTML Purifier is O(n) with regards to performance, i.e. the time taken to process HTML rises linearly with the size of the input document.

lavallefils
Re: performance
August 17, 2007 01:28AM

Just wondering if it is worth porting HTMLPurifier as a PHP extension and if so how much performance improvement this can bring.

Re: performance
August 17, 2007 03:37PM

It would certainly give significant performance increases, but it's not an easy task.

Sorry, you do not have permission to post/reply in this forum.