Welcome! » Log In » Create A New Profile

Ouch! the resources! :-(

Posted by Denis 
Ouch! the resources! :-(
May 05, 2008 08:40PM

It's a neat library, but... page loads went from a nifty 650kB memory usage / 25ms to a whopping 2.8MB / 200ms+, or 1.8MB / 100ms+ once cached. even filtering an empty string requires that amount of server resources. :-(

based on my tests, including the autoloader is not resource hungry. creating a mere instance of HTMLPurifier is worth about 600kB of memory. purifying an empty string requires 600kB more. I'm already worried at the idea of filtering 20kB long blog posts... :P

say, err... while peeking at your source code, a few things came to mind. so in case you're interested, here goes:

1. the use of autoload could possibly be optimized to reduce the memory print

this is probably a bad example, but, for instance, HTMLPurifier's constructor goes something like:

        $this->config = HTMLPurifier_Config::create($config);
        
        $this->strategy     = new HTMLPurifier_Strategy_Core();
        $this->generator    = new HTMLPurifier_Generator();

and then HTMLPurifier_Strategy_Core's constructor goes on with:

        $this->strategies[] = new HTMLPurifier_Strategy_RemoveForeignElements();
        $this->strategies[] = new HTMLPurifier_Strategy_MakeWellFormed();
        $this->strategies[] = new HTMLPurifier_Strategy_FixNesting();
        $this->strategies[] = new HTMLPurifier_Strategy_ValidateAttributes();

... merely creating an instance of the class involves including zillions of files.

I'm quite certain we don't need a memory footprint of 600kB to process an empty string. :P

2. speaking of which...

public function purify($html, $config = null) {

the first line should probably be:

if ( !$html ) return $html;

and the second one could possibly be:

if ( !$this->has_any_html($html) ) return $html;

3. it would be sweet to have an option that leaves the autoload order untouched. please let me try to load my own classes first, without needing to rewrite your auto.php and autoload.php files. ;-)

better: if spl_autoload_functions() is returning anything and the __autoload() function does not exist, you're pretty safe to assume the coder who wrote the code knew what he was doing... if __autoload() exists and spl_autoload_functions() returns something as well, you're safe to assume someone included a library which, like yours, compensates inept coders by doing things "the right way" before letting the core application nuke the entire thing.

4. you're serializing/unserializing the config cache. would it not make more sense to use var_export($var, true) instead?

on top of the performance gain, there's the occasional issue with serialization and server encoding settings, e.g.:

http://trac.wordpress.org/ticket/6532#comment:5

I'm sure I'll find a few more suggestions as I use the library. it's really neat. I wish it was BSD licensed... ;-)

D.

Re: Ouch! the resources! :-(
May 05, 2008 10:22PM

here's a feature request... something like:

$valid_tags = str_replace(array("\n", "\r", "\t", ' '), '', $valid_tags);

that way, you can configure this with a cleaner looking variable.

D.

Re: Ouch! the resources! :-(
May 05, 2008 10:51PM

here's a bug:

upon using this:

		$config = HTMLPurifier_Config::createDefault();
		$config->set('Cache', 'SerializerPath', mc_config);
		$config->set('HTML', 'TidyLevel', 'none');
		$config->set('HTML', 'Trusted', true);
		$config->set('URI', 'Base', mc_home);
		$config->set('URI', 'MakeAbsolute', true);

with:

$purifier->purify(&#039;<b style="color: red;">test</b>&#039;);

I get a fatal error:

HTMLPurifier_Bootstrap::require(HTMLPurifier/AttrTransform/ScriptRequired.php) [function.HTMLPurifier-Bootstrap-require]: failed to open stream: No such file or directory in /path/to/htmlpurifier/HTMLPurifier/Bootstrap.php on line 40

It's related to HTML.Trusted as far as I can tell.

Re: Ouch! the resources! :-(
May 05, 2008 11:02PM

I'm very interested! Thank you for your comments!

the use of autoload could possibly be optimized to reduce the memory print

Having one file per class, I agree, isn't the most conducive to good performance. However, it's a choice we consciously made, and one that can be worked around: with the standalone library all the classes are in one file.

I'm quite certain we don't need a memory footprint of 600kB to process an empty string. :P

You come to a very important point, which is with regards to lazy-loading and HTML Purifier. HTML Purifier takes a very aggressive approach when it comes to loading its memory intensive components, namely the HTML definitions. PHP's objects are somewhat bulky with regards to memory.

In short, I really don't know how to fix this. Sure, for degenerate cases like empty strings, I can handle it, but other that it's difficult.

if ( !$this->has_any_html($html) ) return $html;

That's an optimization probably worth doing. I'll put that on my TODO list.

One thing to be careful about, however, is interaction of that and %HTML.Parent; let's suppose it's set to table (I've never heard of that happening before), plaintext is improper in such a context.

it would be sweet to have an option that leaves the autoload order untouched. please let me try to load my own classes first, without needing to rewrite your auto.php and autoload.php files. ;-)

As far as I know, I do give you that option! You just can't include the HTMLPurifier.autoload.php file or use our HTMLPurifier_Bootstrap::registerAutoload() function. Our autoload is array('HTMLPurifier_Bootstrap', 'autoload'); all you need is HTMLPurifier/Bootstrap.php included and you're good to go to do whatever you want.

you're pretty safe to assume the coder who wrote the code knew what he was doing

Not necessarily. Some autoloaders blithely attempt to include a file whether or not it exists or not and end up triggering a plethora of errors. It is quite important that HTMLPurifier's autoloader is called first.

if __autoload() exists and spl_autoload_functions() returns something as well, you're safe to assume someone included a library which, like yours, compensates inept coders by doing things "the right way" before letting the core application nuke the entire thing.

I'm not sure, however, what the different behavior would be. See above, I guess.

you're serializing/unserializing the config cache. would it not make more sense to use var_export($var, true) instead?

Actually, it's been shown that unserialize/serialize is faster than PHP files. If you think about it, it makes sense; when you include a PHP file, you're using the whole PHP interpreter, whereas the serialized format is a small subset.

there's the occasional issue with serialization and server encoding settings

I've never had anyone complain to me about this. It sounds like it only pertains to UTF-8 characters, however, which should not pose a problem. I'll investigate further though.

I wish it was BSD licensed... ;-)

Unfortunately, LGPL is as free as it's going to get. :-)

$valid_tags = str_replace(array("\n", "\r", "\t", ' '), '', $valid_tags);

I'm trying to figure out where the $valid_tags variable comes from. To you mean the configuration directive %HTML.Allowed? If so, I think I can modify the algorithm slightly to support arbitrary spaces and tabs (newlines are already supported).

HTMLPurifier_Bootstrap::require(HTMLPurifier/AttrTransform/ScriptRequired.php) [function.HTMLPurifier-Bootstrap-require]: failed to open stream: No such file or directory in /path/to/htmlpurifier/HTMLPurifier/Bootstrap.php on line 40

Ooh, this is a known bug. Crack open the HTMLPurifier/HTMLModule/Scripting.php file, and find the HTMLPurifier_AttrTransform_ScriptRequired class, and then move it to the "missing" file. This will be fixed in 3.1.0; it's a bug that was exposed when we switch to autoloading.

If you have anymore performance optimization suggestions, I'd be more than happy to hear them. Some of the areas where I think we can make great progress is reducing the size of HTML Purifier's internal data-structures, and combining multipass algorithms into a single algorithm.

Re: Ouch! the resources! :-(
May 05, 2008 11:18PM

BTW, have you looked into your caching options yet?

Re: Ouch! the resources! :-(
May 06, 2008 06:32AM
I've never had anyone complain to me about this. It sounds like it only pertains to UTF-8 characters, however, which should not pose a problem. I'll investigate further though.

it happened to a bunch of my customers with WordPress. I *think* it started to occur when the WP devs added this line:

if (function_exists(&#039;mb_internal_encoding&#039;)) {
	if (!@mb_internal_encoding(get_option(&#039;blog_charset&#039;)))
		mb_internal_encoding(&#039;UTF-8&#039;);
}

and/or this one:

		if ( !empty($this->charset) && version_compare(mysql_get_server_info($this->dbh), &#039;4.1.0&#039;, &#039;>=&#039;) )
 			$this->query("SET NAMES &#039;$this->charset&#039;");

historically, WP was handling utf-8 data using functions and db tables were assuming everything was iso latin. frequently, the server would run as if things were in english (php's default) and the db would run as if things were in sweedish (mysql's default). they changed the behavior: the mb_internal_encoding() call, a few alter table statements, a set names statement, and there are upcoming collate statements in queries.

the result is filthy: serialized strings with high ascii characters end up with erroneous string sizes due to the change, and fail to unserialize properly.

anyway, as you rightly point out, your use-case is safe -- it's all low ascii characters.

Not necessarily. Some autoloaders blithely attempt to include a file whether or not it exists or not and end up triggering a plethora of errors. It is quite important that HTMLPurifier's autoloader is called first.

correct. but in this case, the __autoload() function is used, rather than its spl equivalent. the spl autoloader works on a stack, as in: "try it, and fail elegantly". a coder who used it is -- imo -- bound to understand this. whereas a coder who uses __autoload() instead is almost certainly nuking the app with an require statement (or worse: require_once).

thus, I would guess that if the stack returns anything and no __autoload() function exists, you're safe to append to the stack without further worries. and that if any __autoload() function exists, you must prepend the stack.

all you need is HTMLPurifier/Bootstrap.php included and you're good to go to do whatever you want.

actually, there's a bit more to it. you also need to set the include path. ;-)

I'm trying to figure out where the $valid_tags variable comes from. To you mean the configuration directive %HTML.Allowed?

correct. it was one of my own variables. purifier was complaining about unsupported          p tags. :-)

If you have anymore performance optimization suggestions, I'd be more than happy to hear them.

yes, one. I didn't post it yesterday, as I wasn't sure it was made much of a difference. but the presentation on caching you linked to regarding using serialized data prove me right on the point.

you're using set_include_path(), which is slower than using absolute paths. use this instead:

@define(&#039;HTMLPurifierPath&#039;, dirname(__FILE__)); // where relevant, with @ in case user defined it already
include HTMLPurifierPath . &#039;/path/to/file.php&#039;;
have you looked into your caching options yet?

of course. ;-)

by the way, you make the folder 777, but the files remain 644. I then need to clean things up using sudo.

last point... here's an extended color name reference:

http://www.semiologic.com/resources/web-design/html-color-names/

D.

Re: Ouch! the resources! :-(
May 06, 2008 07:00AM

just wondering...:

var_dump(get_class_methods(&#039;HTMLPurifier_ConfigDef_Directive&#039;));
// returns __construct() and that&#039;s it...

could these ~100 object instances, and similar, be simple arrays instead?

Maybe it might be worth replacing them with some kind of "service"? e.g.:

class HTMLPurifier_ConfigDef_Service {
  protected $directives = array();

  function get($directive, $key = null) {
    return isset($key) ? $this->directives[$directive][$key] : $this->directives[$directive];
  }

  function set($directive, $mixed1, $mixed2 = null) {
    if ( isset($mixed2) )
      $this->directives[$directive][$mixed1] = $mixed2;
    else
      $this->directives[$directive] = $mixed1;
  }
}

This approach might also be worth exploring, if you prefer to be handling object variables all over:

class HTMLPurifier_ConfigDef_Service {
  protected $directives = array();

  function __get($directive) {
    return $this->directives[$directive];
  }

  function __set($directive, $value) {
    $value = (object) $value;
    $this->directives[$directive] = $value;
  }
}

$service = new HTMLPurifier_ConfigDef_Service;
$directive = array(&#039;key&#039; => &#039;value&#039;);
$service->directive = $directive;
var_dump($service, $service->directive, $service->directive->key);

D.

Re: Ouch! the resources! :-(
May 06, 2008 07:33AM

by the way... LGPL means I can use the library in BSD licensed software without polluting the entire code base with GPL, right?

D.

Re: Ouch! the resources! :-(
May 12, 2008 09:54PM

Apologies for the late response.

Yes, LGPL means you can use it in BSD licensed software.

I'm looking into your other suggestions and should be implementing them in time for the 3.1.0 release.

could these ~100 object instances, and similar, be simple arrays instead?

In theory, the object should take up the same amount of memory as the arrays, possibly less because each possible key is named. I like objects because they make documentation explicit. Do you have figures on either of these two arguments?

the result is filthy: serialized strings with high ascii characters end up with erroneous string sizes due to the change, and fail to unserialize properly.

mb_internal_encoding means that the Wordpress devs were using mb_strlen when they shouldn't have been. I'm very careful about character encoding in HTML Purifier, and we don't use the mb extension at all, so things should be good.

"try it, and fail elegantly". a coder who used it is -- imo -- bound to understand this.

This is an assumption I'm not comfortable with making, especially since:

  1. I've seen naughty SPL autoloaders out there in the wild
  2. An improper autoloader can sort-of work with HTML Purifier's, except for a few edge-cases, and thus go undetected
  3. Prepending can't hurt, except it costs a little more in terms of runtime. And if that's the problem, they should be using HTMLPurifier.includes.php anyway.

In short, you can do it the other way with a little bit of extra code, and the default doesn't break anything even when the person is doing it right, so I'm keeping things the way they are.

actually, there's a bit more to it. you also need to set the include path. ;-)

Oops; that's right. ;-)

you're using set_include_path(), which is slower than using absolute paths. use this instead:

The research in this area is murky, but you may be right. include_path, however, gives users the flexibility to have HTMLPurifier_* classes in other directory hierarchies. Once again, for performance, use HTMLPurifier.includes.php.

by the way, you make the folder 777, but the files remain 644. I then need to clean things up using sudo.

That's something I need to look into. Unfortunately, I don't know how to do Windows/Unix agnostic testing in this respect.

last point... here's an extended color name reference:

At some point I'll incorporate these in. I suppose they also need to be made case-insensitive.

Re: Ouch! the resources! :-(
May 12, 2008 10:24PM

It turns out, I can't implement the short-circuit optimization. Empty strings are problematic because it means that a lot of context variables, i.e. out-of-band information, never get setup, even though an application may expect them. Non-HTML strings are problematic because they can be transformed into HTML via Injectors or Filters.

HTML Purifier is too powerful for its own good, isn't it. :-/

Re: Ouch! the resources! :-(
May 12, 2008 11:10PM
by the way, you make the folder 777, but the files remain 644. I then need to clean things up using sudo.

Looked into it. It looks like the folder should actually be 755 (if possible; may not be if PHP is running under a different user or as nobody—777 is guaranteed to work but may be insecure in multi-user systems), so I've fixed that by applying a 022 umask before making any directories. And as far as I can tell, 644 is the correct behavior. What should the files actually be chmod'ed as?

Sorry, you do not have permission to post/reply in this forum.