Welcome! » Log In » Create A New Profile

High memory usage -- include() for HTMLPurifier; the tokens array

Posted by patnaik 
High memory usage -- include() for HTMLPurifier; the tokens array
August 10, 2007 03:30PM

Here is some information on HTMLPurifier's performance gathered using the 'standalone' version on an almost clean HTML string of ~65k characters with ~2k of them < (PHP 5.2.1 / WinXP, 1.8 GHz). Perhaps this can help improve the speed and memory usage of the software. (Note that the standalone version seems to be quite faster than the regular version mainly because of the reduced require_once() calls.)

Memory usage increased by ~2.8 MB immediately after the "include('HTMLPurifier.standalone.php')" call, and by a couple of 100 kbs after instantiation and after purification. A script with code for just the include() call incurred a peak memory usage of 3 MB. Why is it so? Peak memory usage when the code in the included file (of ~0.5 MB) was commented out was just 70 or so kb.

Memory usage after the call to function tokenizeHTML (Lexer/DOMLex.php, class HTMLPurifier_Lexer_DOMLex) increased from 3.2 MB to 5.5 MB. Also, profiling with XDeBug showed that HTMLPurifier_Lexer_DOMLex->TokenizeDOM, called ~2500 times, took >80% of the purification time of ~1 second. That TokenizeDOM plays a big role is expected. The array of tokens generated in the test case had ~3.5k elements, and was responsible for the 3.2-to-5.5 MB increase (such increases were found proportional to the length of the input string). The tokens array is multi-dimensional and has objects and named keys. Can memory usage be reduced to various extents by not using objects and named keys in the array?

Re: High memory usage -- include() for HTMLPurifier; the tokens array
August 11, 2007 01:13AM

These results mostly mirror the ones I got in another memory usage survey I did with XDebug.

The code's implicit memory footprint is largely unavoidable; it's due to HTML Purifier's modular, extensible OOP design. (The alternative is to rewrite the library as procedural code :-P)

Replacing objects as arrays may help reduce the footprint, although it would require massive rewrites of almost every class (it probably can be automated though). Also, it should be noted that the DOM extension lies about its own memory usage: it didn't show up (at least on my tests) until a node was actually accessed.

Re: High memory usage -- include() for HTMLPurifier; the tokens array
August 11, 2007 10:35AM
A script with code for just the include() call incurred a peak memory usage of 3 MB.

I am not very familiar with OOP, but why the 3 MB usage when there is no object instantiation and the class is not used? Further, XDeBug shows HTMLPurifier_ConfigSchema::define, HTMLPurifier_ConfigSchema::instance, etc., are called many dozens of times. Why are those code executed when the one-line script has just the code for include()?

Re: High memory usage -- include() for HTMLPurifier; the tokens array
August 11, 2007 11:37AM

PHP still needs to store the class definition in memory. Also, the ConfigSchema calls are used to have a de-centralized but strict configuration system; I've already optimized them by removing all safe-guards when a certain constant is not defined.

If you have any other ideas why simple inclusion is taking 3 MB, please, tell me.

This is simply unacceptable and a dealbreaker for sure.. I was going to use this script on my server but due to these high memory usages it simply is not an option.

You may indeed want to convert it to a procedural format, I know you said that in gest, but I guess there would be a tremendous gain in performance if you didi it. Plus, once this library is finished, all it needs is some bug fixes over time, so I don't think it would be very hard to do that in a procedural way as well..

Or I guess a php (c) extension would be even better.. Yes, indeed :)

Re: High memory usage -- include() for HTMLPurifier; the tokens array
February 13, 2008 12:37PM

Considering the efficacy of HTMLPurifier (see, for example, this thread) its resource-intensiveness may be worth it if the website doesn't get a high number of visitors. If you have to look for something else, give htmLawed a shot.

Yes well, htmLawed is probably inferior :) I'd like to see tests done with all these libraries.. :)

But still, I think that a PHP extension is a better fit for a project such as this.

Re: High memory usage -- include() for HTMLPurifier; the tokens array
February 13, 2008 01:59PM
This is simply unacceptable and a dealbreaker for sure.. I was going to use this script on my server but due to these high memory usages it simply is not an option.

I would be careful before saying that. The CPU usage is too high for use on page view for busy websites, but if you cache the results of HTML Purifier, it doesn't make a difference. In fact, for any high-performance website regardless of the filtering library, I would strongly recommend you cache the results. I've done some work with Wikipedia (poster child of high performance PHP), and they gain a lot of performance from caching with Squid proxies.

Also, 5 MB is not much memory usage at all (especially with current hardware specs). The instance of Firefox I'm running right now, for example, is using 50 MB (and can easily increase to 100 MB with extensive usage). The PHP documentation project's build scripts often exceed 1 GB of memory usage.

You may indeed want to convert it to a procedural format, I know you said that in gest, but I guess there would be a tremendous gain in performance if you didi it. Plus, once this library is finished, all it needs is some bug fixes over time, so I don't think it would be very hard to do that in a procedural way as well..

Actually, that's not the case. If you take a look at the TODO list, there is so much more that needs to be done. Implementing HTML 5, especially, will be helped out from the OOP format. Object-oriented programming can be overkill for some cases, but it certainly isn't here.

That all being said, I do agree that HTML Purifier takes up more memory/CPU than I would like. If you have any concrete suggestions that don't require rewriting the library, I'm all ears. Also, I've recently landed in the trunk a bunch of changes that give HTML Purifier autoload support (so only classes you need are loaded) and remove all ConfigSchema calls.

Or I guess a php (c) extension would be even better.. Yes, indeed :)

It probably would be written in C++, for OOP. ;-)

http://wafful.org/

Just found this, if you scroll all the way to the bottom, seems like they are working on an apache 2 input filter :) Maybe you should contact them :)

Re: High memory usage -- include() for HTMLPurifier; the tokens array
February 14, 2008 01:30PM

The approach their using, which is filtering everything without any regards to context, is not a good idea.

Btw, didn't want to make a new thread for this.. but in the standalone version I think you should at least remove the comments as well.. Reason people download that is usually for performance so PHP having to read less bytes (comments) would surely help.

Also those config schemas in there seem to be unnecessary (the explanation..) again unnecessary text.

Re: High memory usage -- include() for HTMLPurifier; the tokens array
February 15, 2008 04:52PM
I think you should at least remove the comments as well.. Reason people download that is usually for performance so PHP having to read less bytes (comments) would surely help.

By a similar token, one should also remove the whitespace from the program! I am loathe to do things like that, because these sorts of micro-optimizations don't amount to much performance gain, and I would also like users to have somewhat of a clue what went wrong if there's an error from HTML Purifier.

If you can produce a benchmark that demonstrates a clear, statistically significant performance boost from such a procedure, I'll implement it.

Also those config schemas in there seem to be unnecessary (the explanation..) again unnecessary text.

Correct! This will be fixed in 3.1.0.

Sorry, you do not have permission to post/reply in this forum.