Welcome! » Log In » Create A New Profile

HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens

Posted by Michael Gusev 

Hi there.

We met problem with big portions of broken html. If we have a lot of not closed tags then HTMLPurifier_Strategy_MakeWellFormed works really slow. Problem here in array_splice:

private function insertBefore($token) {
        array_splice($this->tokens, $this->t, 0, array($token));
}

private function remove() {
        array_splice($this->tokens, $this->t, 1);
}

if $this->tokens is array with 20k nodes then array_splice works really slow.

I have patch where I replaced array_splice by double linked list. It works up to 40 times faster then array_splice.

Could anyone direct me where I should send the patch or create PR?

Thank you.

diff --git a/library/HTMLPurifier/Array.php b/library/HTMLPurifier/Array.php
new file mode 100644
index 0000000000000000000000000000000000000000..4a40fa527104609d7e56cb4057516a7adeb97cd0
--- /dev/null
+++ b/library/HTMLPurifier/Array.php
@@ -0,0 +1,184 @@
+<?php
+
+class HTMLPurifier_Array implements ArrayAccess
+{
+    /**
+     * @param HTMLPurifier_ArrayNode
+     */
+    public $head = null;
+
+    /**
+     * @var int
+     */
+    protected $count = 0;
+
+    /**
+     * @var int
+     */
+    protected $offset = 0;
+
+    /**
+     * @var HTMLPurifier_ArrayNode
+     */
+    protected $offsetItem = null;
+
+
+    public function __construct(array $array = array())
+    {
+        /**
+         * @var HTMLPurifier_ArrayNode $temp
+         */
+        $temp = null;
+        $i = 0;
+
+        foreach ($array as &$v) {
+            $item = new HTMLPurifier_ArrayNode($v);
+
+            if ($this->head == null) {
+                $this->head = &$item;
+            }
+            if ($temp instanceof HTMLPurifier_ArrayNode) {
+                $item->prev = &$temp;
+                $temp->next = &$item;
+            }
+            unset($temp);
+            $temp = &$item;
+
+            $i ++;
+
+            unset($item, $v);
+        }
+        $this->count = $i;
+        $this->offset = 0;
+        $this->offsetItem = &$this->head;
+    }
+
+    protected function findIndex($offset)
+    {
+        if ($this->head == null) {
+            return array(
+                &#039;correct&#039; => false,
+                &#039;value&#039; => null
+            );
+        }
+
+        $current = &$this->head;
+        $index = 0;
+
+        if ($this->offset <= $offset && $this->offsetItem instanceof HTMLPurifier_ArrayNode) {
+            $current = &$this->offsetItem;
+            $index = $this->offset;
+        }
+
+        while ($current->next instanceof HTMLPurifier_ArrayNode && $index != $offset) {
+            $current = &$current->next;
+            $index ++;
+        }
+
+        if ($index == $offset) {
+            $this->offset = $offset;
+            $this->offsetItem = &$current;
+            return array(
+                &#039;correct&#039; => true,
+                &#039;value&#039; => &$current
+            );
+        }
+
+        return array(
+            &#039;correct&#039; => false,
+            &#039;value&#039; => &$current
+        );
+    }
+
+    public function insertBefore($offset, $value)
+    {
+        $result = $this->findIndex($offset);
+
+        $this->count ++;
+        $item = new HTMLPurifier_ArrayNode($value);
+        if ($result[&#039;correct&#039;] == false) {
+            if ($result[&#039;value&#039;] instanceof HTMLPurifier_ArrayNode) {
+                $result[&#039;value&#039;]->next = &$item;
+                $item->prev = &$result[&#039;value&#039;];
+            }
+        } else {
+            if ($result[&#039;value&#039;] instanceof HTMLPurifier_ArrayNode) {
+                $item->prev = &$result[&#039;value&#039;]->prev;
+                $item->next = &$result[&#039;value&#039;];
+            }
+            if ($item->prev instanceof HTMLPurifier_ArrayNode) {
+                $item->prev->next = &$item;
+            }
+            if ($result[&#039;value&#039;] instanceof HTMLPurifier_ArrayNode) {
+                $result[&#039;value&#039;]->prev = &$item;
+            }
+        }
+        if ($offset == 0) {
+            $this->head = &$item;
+        }
+        if ($offset <= $this->offset && $this->offsetItem instanceof HTMLPurifier_ArrayNode) {
+            $this->offsetItem = &$this->offsetItem->prev;
+        }
+    }
+
+    public function remove($offset)
+    {
+        $result = $this->findIndex($offset);
+
+        if ($result[&#039;correct&#039;]) {
+            $this->count --;
+            $item = $result[&#039;value&#039;];
+            $item->prev->next = &$result[&#039;value&#039;]->next;
+            $item->next->prev = &$result[&#039;value&#039;]->prev;
+            if ($offset == 0) {
+                $this->head = &$item->next;
+            }
+            if ($offset < $this->offset) {
+                $this->offset --;
+            } elseif ($offset == $this->offset) {
+                $this->offsetItem = &$item->next;
+            }
+        }
+    }
+
+    public function getArray()
+    {
+        $return = array();
+        $head = $this->head;
+
+        while ($head instanceof HTMLPurifier_ArrayNode) {
+            $return[] = $head->value;
+            $head = &$head->next;
+        }
+
+        return $return;
+    }
+
+    public function offsetExists($offset)
+    {
+        return $offset >= 0 && $offset < $this->count;
+    }
+
+    public function offsetGet($offset)
+    {
+        $result = $this->findIndex($offset);
+        if ($result[&#039;correct&#039;]) {
+            return $result[&#039;value&#039;]->value;
+        }
+
+        return null;
+    }
+
+    public function offsetSet($offset, $value)
+    {
+        $result = $this->findIndex($offset);
+        if ($result[&#039;correct&#039;]) {
+            $result[&#039;value&#039;]->value = &$value;
+        }
+    }
+
+    public function offsetUnset($offset)
+    {
+        $this->remove($offset);
+    }
+}
diff --git a/library/HTMLPurifier/ArrayNode.php b/library/HTMLPurifier/ArrayNode.php
new file mode 100644
index 0000000000000000000000000000000000000000..1871059b67fa786427a27616919f19f1ab2c95af
--- /dev/null
+++ b/library/HTMLPurifier/ArrayNode.php
@@ -0,0 +1,24 @@
+<?php
+
+class HTMLPurifier_ArrayNode
+{
+    public function __construct(&$value)
+    {
+        $this->value = &$value;
+    }
+
+    /**
+     * @var HTMLPurifier_ArrayNode
+     */
+    public $prev = null;
+
+    /**
+     * @var HTMLPurifier_ArrayNode
+     */
+    public $next = null;
+
+    /**
+     * @var mixed
+     */
+    public $value = null;
+}
diff --git a/library/HTMLPurifier/Strategy/MakeWellFormed.php b/library/HTMLPurifier/Strategy/MakeWellFormed.php
index c7aa1bb86d4885c6f6ce4bf9ac11ba889f36dc11..6aa92f313a70aaaf2ddc6f5cb15d400e0c8ad1d5 100644
--- a/library/HTMLPurifier/Strategy/MakeWellFormed.php
+++ b/library/HTMLPurifier/Strategy/MakeWellFormed.php
@@ -45,7 +45,7 @@ class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
     protected $context;
 
     public function execute($tokens, $config, $context) {
-
+        $tokens = new HTMLPurifier_Array($tokens);
         $definition = $config->getHTMLDefinition();
 
         // local variables
@@ -453,7 +453,7 @@ class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
         $context->destroy(&#039;CurrentToken&#039;);
 
         unset($this->injectors, $this->stack, $this->tokens, $this->t);
-        return $tokens;
+        return $tokens->getArray();
     }
 
     /**
@@ -508,7 +508,7 @@ class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
      * this token.  You must reprocess after this.
      */
     private function insertBefore($token) {
-        array_splice($this->tokens, $this->t, 0, array($token));
+        $this->tokens->insertBefore($this->t, $token);
     }
 
     /**
@@ -516,7 +516,7 @@ class HTMLPurifier_Strategy_MakeWellFormed extends HTMLPurifier_Strategy
      * occupied space.  You must reprocess after this.
      */
     private function remove() {
-        array_splice($this->tokens, $this->t, 1);
+        $this->tokens->remove($this->t);
     }
 
     /**
diff --git a/tests/HTMLPurifier/ArrayTest.php b/tests/HTMLPurifier/ArrayTest.php
new file mode 100644
index 0000000000000000000000000000000000000000..63a7bbbc8b8df02d6bdf45a71171f3ca9c53c054
--- /dev/null
+++ b/tests/HTMLPurifier/ArrayTest.php
@@ -0,0 +1,212 @@
+<?php
+
+class HTMLPurifier_ArrayTest extends UnitTestCase
+{
+    /**
+     * Data provider for the rest of tests
+     * @return array
+     */
+    public function getData()
+    {
+        return array(
+            array(array()),
+            array(array(1, 2, 3, 4))
+        );
+    }
+
+    /**
+     * Testing of initialization of properties of HTMLPurifier_Array
+     *
+     * @dataProvider getData
+     */
+    public function testConstruct($array)
+    {
+        $object = new HTMLPurifier_ArrayMock($array);
+
+        $this->assertEquals(0, $object->getOffset());
+        $this->assertEquals($object->getHead(), $object->getOffsetItem());
+        $this->assertEquals(count($array), $object->getCount());
+        $this->assertEquals($array, $object->getArray());
+    }
+
+    /**
+     * Testing of offset & offsetItem properties while seeking/removing/inserting
+     */
+    public function testFindIndex()
+    {
+        $array = array(1, 2, 3, 4, 5);
+        $object = new HTMLPurifier_ArrayMock($array);
+        for ($i = 0; $i < $object->getCount(); $i ++) {
+            $object[$i];
+            $this->assertEquals($i, $object->getOffset());
+            $this->assertEquals($array[$i], $object->getOffsetItem()->value);
+        }
+
+        $object[2];
+        $this->assertEquals(2, $object->getOffset());
+        $this->assertEquals(3, $object->getOffsetItem()->value);
+        $object->remove(2);
+        $this->assertEquals(2, $object->getOffset());
+        $this->assertEquals(4, $object->getOffsetItem()->value);
+
+        $object[1];
+        $this->assertEquals(1, $object->getOffset());
+        $this->assertEquals(2, $object->getOffsetItem()->value);
+        $object->insertBefore(1, &#039;a&#039;);
+        $this->assertEquals(1, $object->getOffset());
+        $this->assertEquals(&#039;a&#039;, $object->getOffsetItem()->value);
+    }
+
+    /**
+     * Testing that behavior of insertBefore the same as array_splice
+     *
+     * @dataProvider getData
+     */
+    public function testInsertBefore($array)
+    {
+        $object = new HTMLPurifier_ArrayMock($array);
+
+        $index = 0;
+        array_splice($array, $index, 0, array(&#039;a&#039;));
+        $object->insertBefore($index, &#039;a&#039;);
+        $this->assertEquals($array, $object->getArray());
+
+        $index = 2;
+        array_splice($array, $index, 0, array(&#039;a&#039;));
+        $object->insertBefore($index, &#039;a&#039;);
+        $this->assertEquals($array, $object->getArray());
+
+        $index = count($array) * 2;
+        array_splice($array, $index, 0, array(&#039;a&#039;));
+        $object->insertBefore($index, &#039;a&#039;);
+        $this->assertEquals($array, $object->getArray());
+    }
+
+    /**
+     * Testing that behavior of remove the same as array_splice
+     *
+     * @dataProvider getData
+     */
+    public function testRemove($array)
+    {
+        $object = new HTMLPurifier_ArrayMock($array);
+
+        $index = 0;
+        array_splice($array, $index, 1);
+        $object->remove($index);
+        $this->assertEquals($array, $object->getArray());
+
+        $index = 2;
+        array_splice($array, $index, 1);
+        $object->remove($index);
+        $this->assertEquals($array, $object->getArray());
+
+        $index = count($array) * 2;
+        array_splice($array, $index, 1);
+        $object->remove($index);
+        $this->assertEquals($array, $object->getArray());
+    }
+
+    /**
+     * Testing that object returns original array
+     *
+     * @dataProvider getData
+     */
+    public function testGetArray($array)
+    {
+        $object = new HTMLPurifier_ArrayMock($array);
+        $this->assertEquals($array, $object->getArray());
+    }
+
+    /**
+     * Testing ArrayAccess interface
+     *
+     * @dataProvider getData
+     */
+    public function testOffsetExists($array)
+    {
+        $object = new HTMLPurifier_ArrayMock($array);
+        $this->assertEquals(isset($array[0]), isset($object[0]));
+    }
+
+    /**
+     * Testing ArrayAccess interface
+     */
+    public function testOffsetGet()
+    {
+        $array = array(1, 2, 3);
+        $object = new HTMLPurifier_ArrayMock($array);
+        foreach ($array as $k => $v) {
+            $this->assertEquals($v, $object[$k]);
+        }
+    }
+
+    /**
+     * Testing ArrayAccess interface
+     */
+    public function testOffsetSet()
+    {
+        $array = array(1, 2, 3);
+        $object = new HTMLPurifier_ArrayMock($array);
+        foreach ($array as $k => $v) {
+            $v = $v * 2;
+            $object[$k] = $v;
+            $this->assertEquals($v, $object[$k]);
+        }
+    }
+
+    /**
+     * Testing ArrayAccess interface
+     * There is one difference: keys are updated as well, they are started from 0
+     */
+    public function testOffsetUnset()
+    {
+        $object = new HTMLPurifier_ArrayMock(array(1, 2, 3, 4));
+        unset($object[1]);
+        $this->assertEquals(array(1, 3, 4), $object->getArray());
+        unset($object[0]);
+        $this->assertEquals(array(3, 4), $object->getArray());
+        unset($object[1]);
+        $this->assertEquals(array(3), $object->getArray());
+        unset($object[0]);
+        $this->assertEquals(array(), $object->getArray());
+    }
+}
+
+/**
+ * Mock for some protected properties of HTMLPurifier_Array
+ */
+class HTMLPurifier_ArrayMock extends HTMLPurifier_Array
+{
+    /**
+     * @return HTMLPurifier_ArrayNode|null
+     */
+    public function getHead()
+    {
+        return $this->head;
+    }
+
+    /**
+     * @return int
+     */
+    public function getOffset()
+    {
+        return $this->offset;
+    }
+
+    /**
+     * @return int
+     */
+    public function getCount()
+    {
+        return $this->count;
+    }
+
+    /**
+     * @return HTMLPurifier_ArrayNode|null
+     */
+    public function getOffsetItem()
+    {
+        return $this->offsetItem;
+    }
+}
Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
April 16, 2013 04:41PM

Hi. We can't use it because we need to push elements in middle of array (insertBefore method). SQLDoubleLinkedList doesn't have that ability.

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
April 17, 2013 03:31AM

Ugh, looks like that's true. What a travesty. Would it be possible to rename HTMLPurifier_Array to something that is more accurate?

Hi. No problem. Could you propose name for that class?

is HTMLPurifier_SplDoublyLinkedList okay for you?

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
April 23, 2013 07:51PM

Drop SPL, since it's not really SPL.

Some other code review comments: this code uses references (&) quite a bit. In PHP5, object identity semantics work the way you'd expect them to, so as long as the entities being passed around are objects, they should get shared automatically. I'd expect most of the ampersands to not be necessary.

Hi. About links between objects you are right. But there is another link. I need link not to object but to variable.

<?php

class a
{
    public $a = 0;
}

$o1 = new a();
$o1->a = 1;
$o2 = $o1;
$o1 = new a();
$o1->a = 2;

// $o1->a = 2;
// $o2->a = 1; // It&#039;s what I don&#039;t need.

$o1 = new a();
$o1->a = 1;
$o2 = &$o1;
$o1 = new a();
$o1->a = 2;

// $o1->a = 2;
// $o2->a = 2; // It&#039;s what I need.

In other words I need $o2 be link to $o1, not to object of $o1. So if $o1 becomes string or int or another object then $o2 will be also string or int or that object.

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
April 25, 2013 05:11PM

OK. It will take me a little bit of time to CR this properly; sorry about the delay.

Sure. Feel free to touch me any time if you have any questions about patch.

Are there any news/updates?

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
May 15, 2013 10:59AM

I think this code does a quite few suboptimal things. In particular, the notion of a cursor into the list is built into the array, whereas a more standard design is to have an external iterator object, or even just manipulate the list nodes directly. Similarly, findIndex iterates over the list, which is absolutely wacky! If I have a node, I get efficient O(1) access before and after. Don't expose the stupid interface!

For that task movement of cursor has only one direction: forward. In that case I skipped part of backward movement. findIndex doesn't go over the list. It goes from $this->offset. In our case $this->offset is current or prev element so we don't spend a lot of time to find element only one/two ++ operations.

There is only one problem: if we need lower element than $this->offset then findIndex starts from 0. But I couldn't find that usage in HTMLPurifier_Strategy_MakeWellFormed class. I can add -- iteration instead of ++ iteration from 0 if you wish. In that case to get prev element we should spend one/two -- operations instead of go over the list from 0 by ++ operation.

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
May 18, 2013 12:23PM

I guess my point is, if you are planning on not using this iterating functionality, don't give an interface that talks about numeric indexes.

Looking at what the code is doing right now, it all hinges around the $this->t variable, which is a numeric offset. Replace that instead with the current node, and all of the functions you need should magically resolve themselves. Don’t slavishly adhere to the array_splice interface!

Hi.

So I'm little bit confused.

Do you want me to implement Iterator interface? or do you want me to move double linked list inside of HTMLPurifier_Strategy_MakeWellFormed class instead of separate class and to use $this->t as pointer?

I think the first one is better solution and it's easy to implement than to add DLL as additional methods of HTMLPurifier_Strategy_MakeWellFormed class.

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
May 22, 2013 10:46PM

What I would suggest is implementing the doubly linked list much in the same way you would have implemented it in C. So do not implement an iterator, keep the classes for the nodes of the linked list but drop the "wrapper" class for the entire linked list. MakeWellFormed has a pointer to the first element of the list which gets updated as we traverse.

It's probably worth considering what the overhead of serializing to and from list to the linked list is; perhaps we should just always be in linked list. I'm sorry to be so picky about this but it is a core code of HTML Purifier so it needs to be held to a high standard.

Hi.

I got you point of view.

What I would suggest is implementing the doubly linked list much in the same way
you would have implemented it in C.

Could you show me example of that code? Because I googled and found a lot of different realizations of DLL in C.

I would like to propose to use http://php.net/manual/en/class.spldoublylinkedlist.php interface. Just to have ability to remove our class when SplDoublyLinkedList will have all requested functionality.

So do not implement an iterator, keep the classes for the nodes of the linked list
but drop the "wrapper" class for the entire linked list.

We can't drop wrapper because HTMLPurifier_Array is realization of DLL in PHP. We need method to create DLL from array, method to create array from DLL and pointer to the first element. In that case if we remove HTMLPurifier_Array we have to add those methods and pointer to another place. (My proposal in the end of that post).

MakeWellFormed has a pointer to the first element of the list which gets updated as we traverse.

Do you want me to update code of execute method to manipulate with $tokens etc. as object instead of as array? It make sense for me since $tokens is object of DLL.

It&#039;s probably worth considering what the overhead of serializing to and from list
to the linked list is; perhaps we should just always be in linked list. I&#039;m sorry to
be so picky about this but it is a core code of HTML Purifier so it needs to be
held to a high standard.

I agree with that. I can create new patch and update $tokens to be HTMLPurifier_Array object instead of php array. It should make HTMLPurifier more faster than proposed solution.

So my proposal is: To rename HTMLPurifier_Array to HTMLPurifier_ SplDoublyLinkedList and to implement SplDoublyLinkedList methods with our new methods as "add" and "remove". To update core code to make $tokens to be HTMLPurifier_ SplDoublyLinkedList object instead of array.

What are you thinking about?

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
June 04, 2013 03:47AM
Could you show me example of that code? Because I googled and found a lot of different realizations of DLL in C.

Right, this is more of a philosophical thing. In C, the way I tend to see a doubly linked list implemented is as a single struct, "node", which has three fields: next, prev, and data. Then, in whatever structure contains the linked list, you also have a head and tail pointer. That's it: it's very simple. (For this reason, I'm not keen on using the SPL interface if we have to reimplement it ourselves.)

In that case if we remove HTMLPurifier_Array we have to add those methods and pointer to another place.

Yep, nothing wrong with that.

The quadratic behaviour is not limited to MakeWellFormed. FixNesting is doing array_splices too.

So I think it makes sense to just use a custom linked list for the tokens from the very beginning. I'm not sure SplDoublyLinkedList is going anywhere. They recently added an .add() method, but there's still no remove as far as I can tell. With a custom implementation it's important to break cycles explicit in destructors by setting references to null, though, as the garbage collector can't handle them.

But if I remove the call to execute strategies from purify(), there's still quadratic behaviour in input tokenization (tokenizeHTML). With strategy execution and output generation removed, the following test script takes ~1s on my machine with 10000 tags and ~4s to execute with 20000 tags.

<?
    require_once("htmlpurifier-4.5.0/library/HTMLPurifier.auto.php");

    $config = HTMLPurifier_Config::createDefault();
    $in = array();
    for ($i = 0; $i < 20000; $i++) {
      $in[] = "<p>". $i . "</p>";
    }
    $input = implode($in);

    $htmlpurifier = new HTMLPurifier($config);
    $htmlpurifier->purify($input);
?>

I spent a little bit of time poking at the tokenizer code, and it seems the quadratic stuff is caused by array_shift in tokenizeDOM in DOMLex.php. With a completely flat DOM with 20000 tags, the code is going through a 20000-element array by popping from the front, and front-popping PHP arrays is O(N) per operation according to a recent comment (at the bottom) here:

http://php.net/manual/en/function.array-shift.php

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
September 17, 2013 03:49AM

Nice catch!

diff --git a/library/HTMLPurifier.includes.php b/library/HTMLPurifier.includes.php
index 18cb001..f18db99 100644
--- a/library/HTMLPurifier.includes.php
+++ b/library/HTMLPurifier.includes.php
@@ -57,6 +57,7 @@ require &#039;HTMLPurifier/Lexer.php&#039;;
 require &#039;HTMLPurifier/PercentEncoder.php&#039;;
 require &#039;HTMLPurifier/PropertyList.php&#039;;
 require &#039;HTMLPurifier/PropertyListIterator.php&#039;;
+require &#039;HTMLPurifier/Queue.php&#039;;
 require &#039;HTMLPurifier/Strategy.php&#039;;
 require &#039;HTMLPurifier/StringHash.php&#039;;
 require &#039;HTMLPurifier/StringHashParser.php&#039;;
diff --git a/library/HTMLPurifier.safe-includes.php b/library/HTMLPurifier.safe-includes.php
index e23a81a..b984755 100644
--- a/library/HTMLPurifier.safe-includes.php
+++ b/library/HTMLPurifier.safe-includes.php
@@ -51,6 +51,7 @@ require_once $__dir . &#039;/HTMLPurifier/Lexer.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/PercentEncoder.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/PropertyList.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/PropertyListIterator.php&#039;;
+require_once $__dir . &#039;/HTMLPurifier/Queue.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/Strategy.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/StringHash.php&#039;;
 require_once $__dir . &#039;/HTMLPurifier/StringHashParser.php&#039;;
diff --git a/library/HTMLPurifier/Lexer/DOMLex.php b/library/HTMLPurifier/Lexer/DOMLex.php
index c07ef4c..7207544 100644
--- a/library/HTMLPurifier/Lexer/DOMLex.php
+++ b/library/HTMLPurifier/Lexer/DOMLex.php
@@ -92,11 +92,11 @@ class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
     protected function tokenizeDOM($node, &$tokens)
     {
         $level = 0;
-        $nodes = array($level => array($node));
+        $nodes = array($level => new HTMLPurifier_Queue(array($node)));
         $closingNodes = array();
         do {
-            while (!empty($nodes[$level])) {
-                $node = array_shift($nodes[$level]); // FIFO
+            while (!$nodes[$level]->isEmpty()) {
+                $node = $nodes[$level]->shift(); // FIFO
                 $collect = $level > 0 ? true : false;
                 $needEndingTag = $this->createStartNode($node, $tokens, $collect);
                 if ($needEndingTag) {
@@ -104,9 +104,9 @@ class HTMLPurifier_Lexer_DOMLex extends HTMLPurifier_Lexer
                 }
                 if ($node->childNodes && $node->childNodes->length) {
                     $level++;
-                    $nodes[$level] = array();
+                    $nodes[$level] = new HTMLPurifier_Queue();
                     foreach ($node->childNodes as $childNode) {
-                        array_push($nodes[$level], $childNode);
+                        $nodes[$level]->push($childNode);
                     }
                 }
             }
diff --git a/library/HTMLPurifier/Queue.php b/library/HTMLPurifier/Queue.php
new file mode 100644
index 0000000..f58db90
--- /dev/null
+++ b/library/HTMLPurifier/Queue.php
@@ -0,0 +1,56 @@
+<?php
+
+/**
+ * A simple array-backed queue, based off of the classic Okasaki
+ * persistent amortized queue.  The basic idea is to maintain two
+ * stacks: an input stack and an output stack.  When the output
+ * stack runs out, reverse the input stack and use it as the output
+ * stack.
+ *
+ * We don&#039;t use the SPL implementation because it&#039;s only supported
+ * on PHP 5.3 and later.
+ *
+ * Exercise: Prove that push/pop on this queue take amortized O(1) time.
+ *
+ * Exercise: Extend this queue to be a deque, while preserving amortized
+ * O(1) time.  Some care must be taken on rebalancing to avoid quadratic
+ * behaviour caused by repeatedly shuffling data from the input stack
+ * to the output stack and back.
+ */
+class HTMLPurifier_Queue {
+    private $input;
+    private $output;
+
+    public function __construct($input = array()) {
+        $this->input = $input;
+        $this->output = array();
+    }
+
+    /**
+     * Shifts an element off the front of the queue.
+     */
+    public function shift() {
+        if (empty($this->output)) {
+            $this->output = array_reverse($this->input);
+            $this->input = array();
+        }
+        if (empty($this->output)) {
+            return NULL;
+        }
+        return array_pop($this->output);
+    }
+
+    /**
+     * Pushes an element onto the front of the queue.
+     */
+    public function push($x) {
+        array_push($this->input, $x);
+    }
+
+    /**
+     * Checks if it&#039;s empty.
+     */
+    public function isEmpty() {
+        return empty($this->input) && empty($this->output);
+    }
+}

Did anyone test this patch to see if it actually improves performance? A very rough test I made does not show performance is improved ...

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
October 06, 2013 01:40PM

Did you test it on a large input size? I expect the new code to be slightly slower on small sizes; what we're really going for here is an asymptotic improvement.

I was aiming to purify a large html, to see how htmlpurifier performs. I'm not really profiling htmlpurifier and therefore I can't exactly repeat the test (did not save the input). However, I'm concerned with htmlpurifier performance on a server I have.

A real user input with 8530 occurrences of '<' (should be either opening or closing tags) took 259 seconds to purify. Together with normal traffic, this took my site down for a short while.

Would you be able to take a look at that input? How can I send you the file?

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
October 07, 2013 04:23PM

Yeah, this thread is about some quadratic behavior in other parts of HTML Purifier too. I've got someone interested in sponsoring a fix for this problem but I've been kind of swamped under recently.

That's great!

Sent you the file to the admin email.

Hi,

first, a big thank you for the wonderful HTML Purifier!

I might be running into the same problem (or into an infinite loop, not sure) - my 777 kB HTML file (ugly MS Word output) takes forever to process on HTML Purifier 4.5.0. ("Forever" means that after running on my Mac for 40 minutes, it died because it reached the PHP memory limit. On the server I killed it after an hour.)

You can download that HTML file here: http://www.strehle.de/tmp/htmlpurifier_stresstest.html

The code I'm using to test/reproduce this:

<?php

require_once &#039;HTMLPurifier.standalone.php&#039;;

$html = file_get_contents(&#039;htmlpurifier_stresstest.html&#039;);

$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set(&#039;Cache.SerializerPath&#039;, &#039;/tmp&#039;);
$purifier_config->autoFinalize = false;
$purifier_config->set(&#039;URI.DisableExternalResources&#039;, true);

$purifier = new HTMLPurifier($purifier_config);
echo $purifier->purify($html) . "\n";

?>

Might be an interesting test case if you can reproduce this on your side.

Kind regards, Tim Strehle

Re: HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens
November 07, 2013 05:55PM

Hey Tim,

Another HTML Purifier user recently commissioned some improvements in this area. Please try a latest dev checkout and see if that helps you.

Hi,

wow, my test file http://www.strehle.de/tmp/htmlpurifier_stresstest.html (see above) is processed literally a 1,000 times faster with the latest development version from repo.or.cz compared to 4.5.0!

On my Mac, I ran into the memory limit after about 2 seconds, instead of after 40 minutes with version 4.5.0. And with ini_set('memory_limit', '384M') the file processes just fine now.

Thanks a lot for the hint and fixes. Are you planning on releasing a new official version soon? Otherwise I’ll just go with the current development version.

Kind regards, Tim Strehle

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: