Welcome! » Log In » Create A New Profile

excessive writing of serializer cache

Posted by ajh 
ajh
excessive writing of serializer cache
December 30, 2010 02:10AM

Hi,

I don't know a lot about how HTMLPurifier is working, however I am having a performance problem with the way it writes to its cache file. It seems that every time the HTMLPurifier_Generator is called, it calls $config->getHTMLDefinition which in turn ends up writing out its cache file to disk.

=> lib/htmlpurifier/HTMLPurifier.php: HTMLPurifier->purify()
=> lib/htmlpurifier/HTMLPurifier/Generator.php: HTMLPurifier_Generator->__construct()
=> lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config->getHTMLDefinition()
=> lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config->getDefinition()
=> lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator/Cleanup.php: HTMLPurifier_DefinitionCache_Decorator_Cleanup->set()
=> lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator.php: HTMLPurifier_DefinitionCache_Decorator->set()
=> lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer->set()
=> lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer->_write()

This happens *every* time the HTML Purifier is constructed, even when the cache file is already there. The cache file is being written out many times per second needlessly since it contains no modifications. Now since I am doing this on a clustered file system, it causes severe performance issues when multiple nodes in the cluster are all trying to write to the same file at the same time. It causes big delays due to having to obtain a write lock. To verify this, check your cache file (eg. HTML/4.2.0,bf0d135e38cd2bca306340d8d6a55127,1.ser) and the last modified date will always be the current minute if your site has active traffic.

So I did this small patch to use $cache->add instead of $cache->set which makes it avoid writing the cache if it already exists. So I would like to know - is this patch OK or is there a reason it needs to be a set() call?

Here's the patch:

diff --git a/lib/htmlpurifier/HTMLPurifier/Config.php b/lib/htmlpurifier/HTMLPurifier/Config.php
index 54d4085..4dfc09b 100644
--- a/lib/htmlpurifier/HTMLPurifier/Config.php
+++ b/lib/htmlpurifier/HTMLPurifier/Config.php
@@ -350,7 +350,7 @@ class HTMLPurifier_Config
             if (!empty($this->definitions[$type])) {
                 if (!$this->definitions[$type]->setup) {
                     $this->definitions[$type]->setup($this);
-                    $cache->set($this->definitions[$type], $this);
+                    $cache->add($this->definitions[$type], $this);
                 }
                 return $this->definitions[$type];
             }
@@ -390,7 +390,7 @@ class HTMLPurifier_Config
         $this->definitions[$type]->setup($this);
         $this->lock = null;
         // save in cache
-        $cache->set($this->definitions[$type], $this);
+        $cache->add($this->definitions[$type], $this);
         return $this->definitions[$type];
     }

Can you please let me know if there's anything wrong with running this patch to avoid the filesystem contention.

Thanks

Ash

Re: excessive writing of serializer cache
December 30, 2010 11:54AM

Excessive writing is indeed incorrect, and I think this code is not tested on this account (this would be a good set of unit tests to add.)

I think your diagnosis, however, may be slightly incorrect. In particular, what appears to be happening here is somehow, non-setup (the setup member variable is false) versions of definitions are being written to the cache, and then when it gets read out, we set it up and then attempt to write the new version (which is set up.) Can you give me the code you're using to invoke HTML Purifier? If you could also post an example of the serialized file that would be useful.

(Nota bene: I haven't set up a test case yet on account of being on vacation, I will do so soon.)

Re: excessive writing of serializer cache
December 30, 2010 03:37PM

Ah, ok, I think I understand where the problem is. I assume you are using the instructions from Customize! and pulling out a raw definition: in that case, the logic doesn't really work correctly. This needs a doc update and some careful thinking about what the desired semantics are.

Re: excessive writing of serializer cache
December 30, 2010 07:00PM

Fix is pretty big, I've stuck it on another branch for now.

http://repo.or.cz/w/htmlpurifier.git/commit/f3d050c517aab64a24f077ee6ea9009381db9de4

ajh
Re: excessive writing of serializer cache
December 30, 2010 07:06PM

Hi Ambush Commander,

Here is a cut down version of the code I am running. When I run this script it writes the cache every time:

<?php

require_once('library/HTMLPurifier.safe-includes.php');

function purify_html($text) {
  $config = HTMLPurifier_Config::createDefault();
  $config->set('HTML.DefinitionID', 'moodlehtml');
  $config->set('HTML.DefinitionRev', 1);
  $config->set('Cache.SerializerPath', '.');
  $config->set('Core.NormalizeNewlines', false);
  $config->set('Core.ConvertDocumentToFragment', true);
  $config->set('Core.Encoding', 'UTF-8');
  $config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
  $config->set('URI.AllowedSchemes', array('http'=>true, 'https'=>true, 'ftp'=>true, 'irc'=>true, 'nntp'=>true, 'news'=>true, 'rtsp'=>true, 'teamspeak'=>true, 'gopher'=>true, 'mms'=>true));
  $config->set('Attr.AllowedFrameTargets', array('_blank'));

  $def = $config->getHTMLDefinition(true);
  $def->addElement('nolink', 'Block', 'Flow', array());
  $def->addElement('tex', 'Inline', 'Inline', array());
  $def->addElement('algebra', 'Inline', 'Inline', array());
  $def->addElement('lang', 'Block', 'Flow', array(), array('lang'=>'CDATA'));
  $def->addAttribute('span', 'xxxlang', 'CDATA');

  $purifier = new HTMLPurifier($config);
  return $purifier->purify($text);
}

purify_html("test\n");

?>

I put a debug line in HTMLPurifier_DefinitionCache_Serializer::_write to print whenever its writing the cache:

ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php 
Writing!
ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php 
Writing!
ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php 
Writing!
ashley@mace:~/dev/htmlpurifier-4.2.0$ php test.php 
Writing!
Re: excessive writing of serializer cache
December 30, 2010 07:11PM

Yeah, the raw call was what I was looking for. See if you can try the tree from the link.

ajh
Re: excessive writing of serializer cache
December 30, 2010 08:05PM

Thanks very much Edward for the quick work on that patch, it's definately appreciated. I'm testing it out, and it's running *much* snappier now with those two fixes.

The Customize doc you linked me to does not appear to be updated yet with the new API, but I got the new docs from your git branch.

So in my example code, should it just be changed to:

  if ($def = $config->maybeGetRawHTMLDefinition();
    $def->addElement('nolink', 'Block', 'Flow', array());
    $def->addElement('tex', 'Inline', 'Inline', array());
    $def->addElement('algebra', 'Inline', 'Inline', array());
    $def->addElement('lang', 'Block', 'Flow', array(), array('lang'=>'CDATA'));
    $def->addAttribute('span', 'xxxlang', 'CDATA');
  }

and the rest can stay the same?

I will let you know once we've done some more load testing and will see if the problem has gone away.

Any idea when this will make it into a core release?

Many thanks

Ashley Holman

ajh
Re: excessive writing of serializer cache
December 31, 2010 12:45AM

Just confirming - we did some JMeter load tests and the problem has definitely gone away. The caching is working as it should now.

Could you please confirm when this might make it into a future release?

Thanks again.

Ash

Re: excessive writing of serializer cache
December 31, 2010 04:11AM

Yep, that example code is right.

This will go into 4.3.0, which will release as soon as I clear up a few outstanding bugs and then find the cycles to do a release.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: