|
excessive writing of serializer cache December 30, 2010 02:10AM |
Registered: 2 years ago Posts: 4 |
Hi,
I don't know a lot about how HTMLPurifier is working, however I am having a performance problem with the way it writes to its cache file. It seems that every time the HTMLPurifier_Generator is called, it calls $config->getHTMLDefinition which in turn ends up writing out its cache file to disk.
=> lib/htmlpurifier/HTMLPurifier.php: HTMLPurifier->purify() => lib/htmlpurifier/HTMLPurifier/Generator.php: HTMLPurifier_Generator->__construct() => lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config->getHTMLDefinition() => lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config->getDefinition() => lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator/Cleanup.php: HTMLPurifier_DefinitionCache_Decorator_Cleanup->set() => lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator.php: HTMLPurifier_DefinitionCache_Decorator->set() => lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer->set() => lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer->_write()
This happens *every* time the HTML Purifier is constructed, even when the cache file is already there. The cache file is being written out many times per second needlessly since it contains no modifications. Now since I am doing this on a clustered file system, it causes severe performance issues when multiple nodes in the cluster are all trying to write to the same file at the same time. It causes big delays due to having to obtain a write lock. To verify this, check your cache file (eg. HTML/4.2.0,bf0d135e38cd2bca306340d8d6a55127,1.ser) and the last modified date will always be the current minute if your site has active traffic.
So I did this small patch to use $cache->add instead of $cache->set which makes it avoid writing the cache if it already exists. So I would like to know - is this patch OK or is there a reason it needs to be a set() call?
Here's the patch:
diff --git a/lib/htmlpurifier/HTMLPurifier/Config.php b/lib/htmlpurifier/HTMLPurifier/Config.php
index 54d4085..4dfc09b 100644
--- a/lib/htmlpurifier/HTMLPurifier/Config.php
+++ b/lib/htmlpurifier/HTMLPurifier/Config.php
@@ -350,7 +350,7 @@ class HTMLPurifier_Config
if (!empty($this->definitions[$type])) {
if (!$this->definitions[$type]->setup) {
$this->definitions[$type]->setup($this);
- $cache->set($this->definitions[$type], $this);
+ $cache->add($this->definitions[$type], $this);
}
return $this->definitions[$type];
}
@@ -390,7 +390,7 @@ class HTMLPurifier_Config
$this->definitions[$type]->setup($this);
$this->lock = null;
// save in cache
- $cache->set($this->definitions[$type], $this);
+ $cache->add($this->definitions[$type], $this);
return $this->definitions[$type];
}
Can you please let me know if there's anything wrong with running this patch to avoid the filesystem contention.
Thanks
Ash
|
Re: excessive writing of serializer cache December 30, 2010 11:54AM |
Admin Registered: 6 years ago Posts: 2,640 |
Excessive writing is indeed incorrect, and I think this code is not tested on this account (this would be a good set of unit tests to add.)
I think your diagnosis, however, may be slightly incorrect. In particular, what appears to be happening here is somehow, non-setup (the setup member variable is false) versions of definitions are being written to the cache, and then when it gets read out, we set it up and then attempt to write the new version (which is set up.) Can you give me the code you're using to invoke HTML Purifier? If you could also post an example of the serialized file that would be useful.
(Nota bene: I haven't set up a test case yet on account of being on vacation, I will do so soon.)
|
Re: excessive writing of serializer cache December 30, 2010 03:37PM |
Admin Registered: 6 years ago Posts: 2,640 |
|
Re: excessive writing of serializer cache December 30, 2010 07:00PM |
Admin Registered: 6 years ago Posts: 2,640 |
Fix is pretty big, I've stuck it on another branch for now.
http://repo.or.cz/w/htmlpurifier.git/commit/f3d050c517aab64a24f077ee6ea9009381db9de4
|
Re: excessive writing of serializer cache December 30, 2010 07:06PM |
Registered: 2 years ago Posts: 4 |
Hi Ambush Commander,
Here is a cut down version of the code I am running. When I run this script it writes the cache every time:
<?php
require_once('library/HTMLPurifier.safe-includes.php');
function purify_html($text) {
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'moodlehtml');
$config->set('HTML.DefinitionRev', 1);
$config->set('Cache.SerializerPath', '.');
$config->set('Core.NormalizeNewlines', false);
$config->set('Core.ConvertDocumentToFragment', true);
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
$config->set('URI.AllowedSchemes', array('http'=>true, 'https'=>true, 'ftp'=>true, 'irc'=>true, 'nntp'=>true, 'news'=>true, 'rtsp'=>true, 'teamspeak'=>true, 'gopher'=>true, 'mms'=>true));
$config->set('Attr.AllowedFrameTargets', array('_blank'));
$def = $config->getHTMLDefinition(true);
$def->addElement('nolink', 'Block', 'Flow', array());
$def->addElement('tex', 'Inline', 'Inline', array());
$def->addElement('algebra', 'Inline', 'Inline', array());
$def->addElement('lang', 'Block', 'Flow', array(), array('lang'=>'CDATA'));
$def->addAttribute('span', 'xxxlang', 'CDATA');
$purifier = new HTMLPurifier($config);
return $purifier->purify($text);
}
purify_html("test\n");
?>
I put a debug line in HTMLPurifier_DefinitionCache_Serializer::_write to print whenever its writing the cache:
ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php Writing! ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php Writing! ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php Writing! ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php Writing!
|
Re: excessive writing of serializer cache December 30, 2010 07:11PM |
Admin Registered: 6 years ago Posts: 2,640 |
|
Re: excessive writing of serializer cache December 30, 2010 08:05PM |
Registered: 2 years ago Posts: 4 |
Thanks very much Edward for the quick work on that patch, it's definately appreciated. I'm testing it out, and it's running *much* snappier now with those two fixes.
The Customize doc you linked me to does not appear to be updated yet with the new API, but I got the new docs from your git branch.
So in my example code, should it just be changed to:
if ($def = $config->maybeGetRawHTMLDefinition();
$def->addElement('nolink', 'Block', 'Flow', array());
$def->addElement('tex', 'Inline', 'Inline', array());
$def->addElement('algebra', 'Inline', 'Inline', array());
$def->addElement('lang', 'Block', 'Flow', array(), array('lang'=>'CDATA'));
$def->addAttribute('span', 'xxxlang', 'CDATA');
}
and the rest can stay the same?
I will let you know once we've done some more load testing and will see if the problem has gone away.
Any idea when this will make it into a core release?
Many thanks
Ashley Holman
|
Re: excessive writing of serializer cache December 31, 2010 12:45AM |
Registered: 2 years ago Posts: 4 |
|
Re: excessive writing of serializer cache December 31, 2010 04:11AM |
Admin Registered: 6 years ago Posts: 2,640 |