<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Forums - Internals</title>
        <description>Discussion about development and new features for HTML Purifier.</description>
        <link>http://htmlpurifier.org/phorum/list.php?5</link>
        <lastBuildDate>Sun, 26 May 2013 02:36:47 -0400</lastBuildDate>
        <generator>Phorum 5.2.18</generator>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6982,6982#msg-6982</guid>
            <title>Mnify output (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6982,6982#msg-6982</link>
            <description><![CDATA[<p>It might be nice to minify the output. I have a rather simple implementation for minifying xHTML 1.0:</p>

<pre>
//Combine all white space
$html = preg_replace('/\s+/us', ' ', $html);
//Strip all white space surrounding block level elements
$html =  preg_replace(
    '/\s*(&lt;[\/]{0,1}(address|blockquote|center|dir|div|dl|fieldset|form|h1|h2|h3|h4|h5|h6|hr|isindex|menu|noframes|noscript|ol|p|pre|table|ul|dd|dt|frameset|li|tbody|td|tfoot|th|thead|tr|applet|button|del|iframe|insmap|object|script)(\s[^&gt;]+?)*?&gt;)\s*/us',
    '$1',
    $html
);
//Strip all whitespace surounding line breaks
$html = preg_replace('/\s*&lt;br \/&gt;\s*/us', '&lt;br /&gt;', $html);
</pre>

<p>It's has a few backsides that i personally can live with</p>

<p>1: Does not respect literal (not written an an html entity) nbsp</p>

<p>2: Strips whitespace form "pre" content.</p>

<p>I think this should be rather fixable though</p>

<p>Edit: Line breaks, don't strip look-a-likes, less greedy regex</p>

<p>Edited 5 time(s). Last edit at 05/22/2013 08:00PM by AJenbo.</p>]]></description>
            <dc:creator>AJenbo</dc:creator>
            <category>Internals</category>
            <pubDate>Wed, 22 May 2013 18:47:41 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6970,6970#msg-6970</guid>
            <title>Linkify bug (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6970,6970#msg-6970</link>
            <description><![CDATA[<p>Linkify directive doesn't work correctly when using no-breaking space or comma.</p>

<p>Example: <a href="http://bit.ly/11RAUaU">http://bit.ly/11RAUaU</a></p>

<p>Proposed fix:</p>

<p>Linkify.php line 24 change from:
</p>

<pre>
$bits = preg_split('#((?:https?|ftp)://[^\s\'"&lt;&gt;()]+)#S', $token-&gt;data, -1, PREG_SPLIT_DELIM_CAPTURE);
</pre>

<p>to:</p>

<pre>
$bits = preg_split('#((?:https?|ftp)://[^\s\'",&lt;&gt;()]+)#Su', $token-&gt;data, -1, PREG_SPLIT_DELIM_CAPTURE);
</pre>]]></description>
            <dc:creator>nAS</dc:creator>
            <category>Internals</category>
            <pubDate>Tue, 21 May 2013 18:25:48 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6967,6967#msg-6967</guid>
            <title>Invalid PHPDoc - HTMLPurifier.php(82) (no replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6967,6967#msg-6967</link>
            <description><![CDATA[<p>Line 82 of HTMLPurifier.php is</p>

<p>     * @param $config Optional HTMLPurifier_Config object for all instances of</p>

<p>And should be:</p>

<p>     * @param $config HTMLPurifier_Config object for all instances of</p>]]></description>
            <dc:creator>laurin1</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 17 May 2013 19:39:41 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6966,6966#msg-6966</guid>
            <title>Invalid PHPDoc - Config.php(136) (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6966,6966#msg-6966</link>
            <description><![CDATA[<p>Line 136 of  Config.php is</p>

<p>     * @return Default HTMLPurifier_Config object.</p>

<p>And should be:</p>

<p>     * @return HTMLPurifier_Config object.</p>]]></description>
            <dc:creator>laurin1</dc:creator>
            <category>Internals</category>
            <pubDate>Sat, 18 May 2013 11:49:16 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6921,6921#msg-6921</guid>
            <title>Scheme parsed incorrectly (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6921,6921#msg-6921</link>
            <description><![CDATA[<p>Code below is parsed incorrectly:</p>

<pre>
&lt;a href="{:test:}"&gt;&lt;/a&gt;
</pre>

<p>Problem here because of incorrect detection of scheme in URL.</p>

<pre>
RFC 1738
2.1. The main parts of URLs

Scheme names consist of a sequence of characters. The lower case
letters "a"--"z", digits, and the characters plus ("+"), period
("."), and hyphen ("-") are allowed. For resiliency, programs
interpreting URLs should treat upper case letters as equivalent to
lower case in scheme names (e.g., allow "HTTP" as well as "http").
</pre>

<p>patch for fix</p>

<pre>
diff --git a/library/HTMLPurifier/URIParser.php b/library/HTMLPurifier/URIParser.php
index 7179e4ab8991077aa2ff4b3a4fca0a4eafd416e0..a7e5dd66eaab4fa6cc8b3daa3856ec450f1c35ce 100644
--- a/library/HTMLPurifier/URIParser.php
+++ b/library/HTMLPurifier/URIParser.php
@@ -30,7 +30,7 @@ class HTMLPurifier_URIParser
         // Note that ["&lt;&gt;] are an addition to the RFC's recommended
         // characters, because they represent external delimeters.
         $r_URI = '!'.
-            '(([^:/?#"&lt;&gt;]+):)?'. // 2. Scheme
+            '(([a-zA-Z0-9\.\+\-]+):)?'. // 2. Scheme
             '(//([^/?#"&lt;&gt;]*))?'. // 4. Authority
             '([^?#"&lt;&gt;]*)'.       // 5. Path
             '(\?([^#"&lt;&gt;]*))?'.   // 7. Query
</pre>]]></description>
            <dc:creator>Michael Gusev</dc:creator>
            <category>Internals</category>
            <pubDate>Tue, 16 Apr 2013 16:57:32 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6919,6919#msg-6919</guid>
            <title>HTMLPurifier_Strategy_MakeWellFormed is slow with big array of tokens (17 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6919,6919#msg-6919</link>
            <description><![CDATA[<p>Hi there.</p>

<p>We met problem with big portions of broken html.
If we have a lot of not closed tags then HTMLPurifier_Strategy_MakeWellFormed works really slow.
Problem here in array_splice:</p>

<pre>
private function insertBefore($token) {
        array_splice($this-&gt;tokens, $this-&gt;t, 0, array($token));
}

private function remove() {
        array_splice($this-&gt;tokens, $this-&gt;t, 1);
}
</pre>

<p>if $this-&gt;tokens is array with 20k nodes then array_splice works really slow.</p>

<p>I have patch where I replaced array_splice by double linked list. It works up to 40 times faster then array_splice.</p>

<p>Could anyone direct me where I should send the patch or create PR?</p>

<p>Thank you.</p>]]></description>
            <dc:creator>Michael Gusev</dc:creator>
            <category>Internals</category>
            <pubDate>Thu, 23 May 2013 07:59:28 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6887,6887#msg-6887</guid>
            <title>Config Caching for Speed (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6887,6887#msg-6887</link>
            <description><![CDATA[<p>I'll preface this by saying that we're probably using HTMLPurifier in a very nonstandard way. :) One example of a place we use this is in our Reporting system. We obviously want to allow our own scripts and custom HTML and such on the page, but any user-editable text on the page (of which there is <b>much</b>) we want to purify, right? We keep one HTMLPurifier instance around with one HTMLPurifier_Config, and then purify hundreds of strings, one-by-one, as we encounter them while parsing the report's definition. Simple, right? If there's a more preferred way of doing this, I'd certainly like to hear about it. :)</p>

<p>In the meantime, and as the hard-nosed optimizer on our Engineering team, I've done some profiling of how HTMLPurifier runs in our context. Since <a href="http://htmlpurifier.org/docs/enduser-slow.html">this page</a> says I should report my findings, well...</p>

<p>So far I've optimized two things in HTMLPurifier_Config. Even though we use the same instance of the Config across all our strings (in my profiling examples, we're purifying 431 strings), each purify() call ends up calling HTMLPurifier_Config-&gt;getAll() twice. In the source code, it specifically says "This is a pretty inefficient function, avoid if you can". 863 calls to getAll() (resulting also in roughly 100k calls to php::explode) profiled at about 1.6 seconds cumulative execution. Since the Config is finalized once and never changed after, I put a quick cache on the front of the function that holds the $ret value of the function. This reduced getAll's overhead from 1.6s to about 6 milliseconds.</p>

<p>My other optimization is in the get() method. Every get() would result in several simple error checks, and then a recursive call to parent configs looking for the key in question. In my example, my 431 string purifications end up calling HTMLPurifier_Config-&gt;get() about 37k times, resulting in around 73k recursive calls and calls to -&gt;has() on PropertyList. My profiles show this around 3.2s total time. Putting a small cache on the front of the get method reduces the number of recursive calls to about 5k, and the time footprint is reduced to about 329ms.</p>

<p>If anyone requests, I'll post the changes to my code. They're a bit hacked in and perhaps not up to HTMLPurifier standards, so maybe my overly-verbose explanation is sufficient. :)</p>

<p>Enjoy!</p>]]></description>
            <dc:creator>xathien</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 22 Mar 2013 19:36:13 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6871,6871#msg-6871</guid>
            <title>XSS against nature (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6871,6871#msg-6871</link>
            <description><![CDATA[<p>Having years programing all this about XSS makes me sad day by day
It is understandable there is no security at any matter, the car the house, anything someway or another is insecure, but getting up to this levels goes beyond my mind.
Of course there are many reasons to protect.
From the more difficult aspects I have known is Overflow which is related to most of all to system structure
Code injection? Many years ago, I thougt there could be in the engine PHP or whatever, a prefix and/or sufix for all commands, as simple to have in php.ini related to the engine
And of course, from manual to php editors handle those pre-sufixes
That way there were no problem with any injection, hard to find your assigned prefixes to commands
fast code, no need to clean, no need to nothing.
Isn't it the same what does Java Pcode, PHP interpreter, javascript or anyother? if is not compiled is interpreted, if is interpreted you have a translation table, then could be prefixes, renames, and sufixes
which I don't believe a hacker will take time to find-out
For now, I need how to let a visitor let write code in comments (not functional code) with this thing that looks like antivirus which one ends turning off because at the end does not lets you work, I do not find how
We need to be secure to some degree, but not living in a self sanitized jail against viruses.</p>]]></description>
            <dc:creator>AvenidaGez</dc:creator>
            <category>Internals</category>
            <pubDate>Sun, 24 Mar 2013 17:21:01 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6844,6844#msg-6844</guid>
            <title>Bug in 'A' tag correction. (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6844,6844#msg-6844</link>
            <description><![CDATA[<p>Hello, I believe I've discovered a bug (in 4.5.0) in regards to correction of improperly formatted URLs.</p>

<p>Given the following code:</p>

<pre>
&lt;a href="<a href="http://www.example.com">http://www.example.com</a>&gt;example&lt;/a&gt;
</pre>

<p>HTML Purifier will correct it to:</p>

<pre>
&lt;a href="<a href="http://www.example.com">http://www.example.com</a>"&gt;&lt;/a&gt;
</pre>

<p>It properly adds in the missing quotation mark, but then strangely, removes the text that is enclosed in the 'A' tag.  </p>

<p>Thanks for making a fine product.</p>]]></description>
            <dc:creator>Joel</dc:creator>
            <category>Internals</category>
            <pubDate>Sat, 02 Mar 2013 14:55:59 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6683,6683#msg-6683</guid>
            <title>include path override? (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6683,6683#msg-6683</link>
            <description><![CDATA[<p>I'm using HTMLPurifier.standalone.php and I see it does in its code this:</p>

<pre>
set_include_path(HTMLPURIFIER_PREFIX . PATH_SEPARATOR . get_include_path());
</pre>

<p>This puts purifier path at the beginning of the include path. However, this has bad effect inside a big application, since it means every time it includes a file, it looks in purifier directory first. Is this line really necessary? As far as I can see, files are loaded using HTMLPURIFIER_PREFIX anyway, so no reason to add anything to include path. Is there any reason to do it that I fail to see?</p>]]></description>
            <dc:creator>smalyshev</dc:creator>
            <category>Internals</category>
            <pubDate>Wed, 31 Oct 2012 17:40:43 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6477,6477#msg-6477</guid>
            <title>Multiple values possible for URI.Host (2 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6477,6477#msg-6477</link>
            <description><![CDATA[<p>Hi,</p>

<p>I would like to use HTML Purifier with the option "HTML.TargetBlank" for example but taking in account multiples domains in the "URI.Host" option for different site languages.</p>

<p>Do you think about to develop this behavior?</p>]]></description>
            <dc:creator>daniels</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 29 Jun 2012 10:42:09 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6405,6405#msg-6405</guid>
            <title>Extracting elements (2 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6405,6405#msg-6405</link>
            <description><![CDATA[<p>Hello! </p>

<p>I was wondering if there is a way to extract elements from the purified text. Something like:</p>

<p>&lt;?php</p>

<p>$strPurified = $purifier-&gt;Purify($DirtyHtml);
$arrElements = HTMLPurifier::Extract($strPurifier, 'a,img,b');</p>

<p>?&gt;</p>

<p>and then use something like this: </p>

<p>&lt;?php</p>

<p>$strFirstLinkInText = $arrElements['a'][0];</p>

<p>?&gt;</p>

<p>Wouldn't that be a great addition? Since HTMLPurifier already is able to completely tear apart HTML and rejoin it, this would be a great addition for implementing some functionality on the server side which normally we should not want be done on the client side. </p>

<p>Regards,
Vaibhav</p>]]></description>
            <dc:creator>Vaibhav Kaushal</dc:creator>
            <category>Internals</category>
            <pubDate>Mon, 30 Jul 2012 08:56:26 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6294,6294#msg-6294</guid>
            <title>DisplayRemoteLinkURI injector (9 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6294,6294#msg-6294</link>
            <description><![CDATA[<p>I've created a new injector named AutoFilter.DisplayRemoteLinkURI and explained at the link below about how to add it. It works like DisplayLinkURI but just for remote URL's. Local URL's stay the same. And I couldn't figure how to use DisableExternal URIFilter inside it, so I wrote a temporary function to check if link is remote or local. </p>

<p><a href="http://stackoverflow.com/a/9804323/1262700">http://stackoverflow.com/a/9804323/1262700</a></p>

<p>Just wanted to inform you. If it's hacking the core and not allowed, I'd remove that.</p>]]></description>
            <dc:creator>tpaksu</dc:creator>
            <category>Internals</category>
            <pubDate>Sat, 24 Mar 2012 23:20:27 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,6233,6233#msg-6233</guid>
            <title>URI.DisableResources not loading (2 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,6233,6233#msg-6233</link>
            <description><![CDATA[<p>I'v been trying to get the URI.DisableResources directive to work, but setting the config directive to true didn't seem to actual change anything. Digging into the code (which I'm very unfamiliar with as this is my first time working with HTMLPurifier), it seems to me that the URI.DisableResources filter is not being loaded into the HTMLPurifier_URIDefinition class.</p>

<pre>
&lt;?php
class HTMLPurifier_URIDefinition extends HTMLPurifier_Definition
{
// ...
    public function __construct() {
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_DisableExternal());
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_DisableExternalResources());
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_HostBlacklist());
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_SafeIframe());
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_MakeAbsolute());
        $this-&gt;registerFilter(new HTMLPurifier_URIFilter_Munge());
    }
</pre>

<p>Shouldn't there be a line in there for URIFilter_DisableResources? </p>

<p><code>$this-&gt;registerFilter(new HTMLPurifier_URIFilter_DisableResources());</code></p>

<p>-Brent</p>]]></description>
            <dc:creator>Brent C</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 02 Mar 2012 13:26:34 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5520,5520#msg-5520</guid>
            <title>Deprecating trigger_error calls in the Tarot cards? (8 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5520,5520#msg-5520</link>
            <description><![CDATA[<p>In some environments setting up your own PHP error_handler using <b>set_error_handler()</b> is not an option.  Have you considered converting all the trigger_error() calls to throw exceptions?</p>

<p>There are 51 locations that have trigger_error calls and 23 that throw *Exception in 4.2.0.</p>

<p>Is that a 5.x change? Or something that could be rolled into <s>4.3.x</s> 4.4.0?</p>]]></description>
            <dc:creator>Yzmir Ramirez</dc:creator>
            <category>Internals</category>
            <pubDate>Mon, 11 Apr 2011 15:05:16 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5320,5320#msg-5320</guid>
            <title>AutoFormat.LinkifyEmail Option (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5320,5320#msg-5320</link>
            <description><![CDATA[<p>I've added an option to automatically linkify e-mail addresses. Here's the commit message:</p>

<pre>
    Add AutoFormat.LinkifyEmail Option.
    
    If enabled, automatically provide links to e-mail addresses.
    For example,
      Send an e-mail to user@example.com.
    is transformed to
      Send an e-mail to &lt;a href="mailto:user@example.com"&gt;user@example.com&lt;/a&gt;.
    
    Signed-off-by: Bradley M. Froehle &lt;brad.froehle@gmail.com&gt;
</pre>

<p>The code is available in the 'linkify-email' branch of <a href="http://repo.or.cz/w/htmlpurifier/bfroehle.git.">http://repo.or.cz/w/htmlpurifier/bfroehle.git.</a></p>]]></description>
            <dc:creator>bfroehle</dc:creator>
            <category>Internals</category>
            <pubDate>Tue, 15 Feb 2011 13:07:40 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5319,5319#msg-5319</guid>
            <title>HTML.SafeIframe Option (26 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5319,5319#msg-5319</link>
            <description><![CDATA[<p>I've attempted to extend HTMLPurifier to allow (some) iframes.  Here's the commit message:</p>

<blockquote>
<p>Add new HTML.SafeIframe and URI.IframeHostWhitelist options.</p>

<p>Many online video providers (YouTube, Vimeo) and other web applications
(Google Maps, Google Calendar, etc) provide embed code in iframe format.
This introduces two new settings:
</p><ul><li><p> <code>HTML.SafeIframe</code> / bool, default: FALSE</p>

<p>    Whether or not to display iframe content.</p></li>
<li><p> <code>URI.IframeHostWhitelist</code> / list, default: array()</p>

<p>    A list of whitelisted hosts for iframe content.  Iframes are allowed
    only if the host explicitly matches an element of this array.</p></li>
</ul></blockquote>

<p>The code is available in the 'iframe' branch of <a href="http://repo.or.cz/w/htmlpurifier/bfroehle.git.">http://repo.or.cz/w/htmlpurifier/bfroehle.git.</a></p>]]></description>
            <dc:creator>bfroehle</dc:creator>
            <category>Internals</category>
            <pubDate>Mon, 26 Dec 2011 08:48:11 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5203,5203#msg-5203</guid>
            <title>Cache directory permissions setting (8 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5203,5203#msg-5203</link>
            <description><![CDATA[<p>Hello,</p>

<p>at present the cache directory permissions are hardcoded to be 0755, sometimes it is useful to specify custom permissions (executing PHP from two different user accounts, phpunit, safe mode, etc.).</p>

<p>I have hacked the standard HTML Purifier to use custom permissions for the Moodle project - <a href="https://github.com/moodle/custom-htmlpurifier/compare/882ffed9babb9ddc20bfb0979b14bb52d64c96c4...MOODLE_20_STABLE">https://github.com/moodle/custom-htmlpurifier/compare/882ffed9babb9ddc20bfb0979b14bb52d64c96c4...MOODLE_20_STABLE</a></p>

<p>Is this interesting? Could I do anything to get something similar included in HTML Purifier?</p>

<p>Thanks for all your hard work on this library!</p>

<p>Petr</p>]]></description>
            <dc:creator>skodak</dc:creator>
            <category>Internals</category>
            <pubDate>Thu, 13 Jan 2011 21:33:33 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5188,5188#msg-5188</guid>
            <title>Iterative traversal of DOM [patch] (7 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5188,5188#msg-5188</link>
            <description><![CDATA[<p>Hi,</p>

<p>Since there are some deep DOMs you can hit the maximum nesting level limit in tokenizeDOM (we've experienced this even with maximum nesting level of 300). So here is an iterative version of the same function with simple quoue/dequoue approach. I was going to upload a .patch file, but no file upload is permitted. </p>

<p>NOTE: createStartNode() and createEndNode() are helper functions (actually parts of the old body of tokenizeDOM) and do not exists in the original code. You have to replace the original tokenizeDOM() function with these 3 functions:</p>

<pre>
/**
 * Iterative function that tokenizes a node, putting it into an accumulator.
 * To iterate is human, to recurse divine - L. Peter Deutsch
 * @param $node     DOMNode to be tokenized.
 * @param $tokens   Array-list of already tokenized tokens.
 * @returns Tokens of node appended to previously passed tokens.
 */
protected function tokenizeDOM($node, &amp;$tokens) {

	$level = 0;
	$nodes = array($level =&gt; array($node));
	$closingNodes = array();
	do {
		while (!empty($nodes[$level])) {
			$node = array_shift($nodes[$level]); // FIFO
			$collect = $level &gt; 0 ? true : false;
			$needEndingTag = $this-&gt;createStartNode($node, $tokens, $collect);
			if ($needEndingTag) {
				$closingNodes[$level][] = $node;
			}
			if ($node-&gt;childNodes &amp;&amp; $node-&gt;childNodes-&gt;length) {
				$level++;
				$nodes[$level] = array();
				foreach ($node-&gt;childNodes as $childNode) {
					array_push($nodes[$level], $childNode);
				}
			}
		}
		$level--;
		if ($level &amp;&amp; isset($closingNodes[$level])) {
			while($node = array_pop($closingNodes[$level])) {
				$this-&gt;createEndNode($node, $tokens);
			}
		}
	} while ($level &gt; 0);
}

/**
 * @param $node  DOMNode to be tokenized.
 * @param $tokens   Array-list of already tokenized tokens.
 * @param $collect  Says whether or start and close are collected, set to
 *					false at first recursion because it's the implicit DIV
 *					tag you're dealing with.
 * @returns bool if the token needs an endtoken
 */
protected function createStartNode($node, &amp;$tokens, $collect)
{
	// intercept non element nodes. WE MUST catch all of them,
	// but we're not getting the character reference nodes because
	// those should have been preprocessed
	if ($node-&gt;nodeType === XML_TEXT_NODE) {
		$tokens[] = $this-&gt;factory-&gt;createText($node-&gt;data);
		return false;
	} elseif ($node-&gt;nodeType === XML_CDATA_SECTION_NODE) {
		// undo libxml's special treatment of &lt;script&gt; and &lt;style&gt; tags
		$last = end($tokens);
		$data = $node-&gt;data;
		// (note $node-&gt;tagname is already normalized)
		if ($last instanceof HTMLPurifier_Token_Start &amp;&amp; ($last-&gt;name == 'script' || $last-&gt;name == 'style')) {
			$new_data = trim($data);
			if (substr($new_data, 0, 4) === '&lt;!--') {
				$data = substr($new_data, 4);
				if (substr($data, -3) === '--&gt;') {
					$data = substr($data, 0, -3);
				} else {
					// Highly suspicious! Not sure what to do...
				}
			}
		}
		$tokens[] = $this-&gt;factory-&gt;createText($this-&gt;parseData($data));
		return false;
	} elseif ($node-&gt;nodeType === XML_COMMENT_NODE) {
		// this is code is only invoked for comments in script/style in versions
		// of libxml pre-2.6.28 (regular comments, of course, are still
		// handled regularly)
		$tokens[] = $this-&gt;factory-&gt;createComment($node-&gt;data);
		return false;
	} elseif (
		// not-well tested: there may be other nodes we have to grab
		$node-&gt;nodeType !== XML_ELEMENT_NODE
	) {
		return false;
	}

	$attr = $node-&gt;hasAttributes() ? $this-&gt;transformAttrToAssoc($node-&gt;attributes) : array();

	// We still have to make sure that the element actually IS empty
	if (!$node-&gt;childNodes-&gt;length) {
		if ($collect) {
			$tokens[] = $this-&gt;factory-&gt;createEmpty($node-&gt;tagName, $attr);
		}
		return false;
	} else {
		if ($collect) {
			$tokens[] = $this-&gt;factory-&gt;createStart(
				$tag_name = $node-&gt;tagName, // somehow, it get's dropped
				$attr
			);
		}
		return true;
	}
}

protected function createEndNode($node, &amp;$tokens)
{
	$tokens[] = $this-&gt;factory-&gt;createEnd($node-&gt;tagName);
}
</pre>

<p>Maxim Krizhanovsky</p>

<p>Favit Network</p>

<p><a href="http://favit.com">favit.com</a></p>]]></description>
            <dc:creator>Darhazer</dc:creator>
            <category>Internals</category>
            <pubDate>Wed, 19 Jan 2011 17:10:38 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5164,5164#msg-5164</guid>
            <title>excessive writing of serializer cache (8 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5164,5164#msg-5164</link>
            <description><![CDATA[<p>Hi,</p>

<p>I don't know a lot about how HTMLPurifier is working, however I am having a performance problem with the way it writes to its cache file.  It seems that every time the HTMLPurifier_Generator is called, it calls $config-&gt;getHTMLDefinition which in turn ends up writing out its cache file to disk.</p>

<pre>
=&gt; lib/htmlpurifier/HTMLPurifier.php: HTMLPurifier-&gt;purify()
=&gt; lib/htmlpurifier/HTMLPurifier/Generator.php: HTMLPurifier_Generator-&gt;__construct()
=&gt; lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config-&gt;getHTMLDefinition()
=&gt; lib/htmlpurifier/HTMLPurifier/Config.php: HTMLPurifier_Config-&gt;getDefinition()
=&gt; lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator/Cleanup.php: HTMLPurifier_DefinitionCache_Decorator_Cleanup-&gt;set()
=&gt; lib/htmlpurifier/HTMLPurifier/DefinitionCache/Decorator.php: HTMLPurifier_DefinitionCache_Decorator-&gt;set()
=&gt; lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer-&gt;set()
=&gt; lib/htmlpurifier/HTMLPurifier/DefinitionCache/Serializer.php: HTMLPurifier_DefinitionCache_Serializer-&gt;_write()
</pre>

<p>This happens *every* time the HTML Purifier is constructed, even when the cache file is already there.  The cache file is being written out many times per second needlessly since it contains no modifications.  Now since I am doing this on a clustered file system, it causes severe performance issues when multiple nodes in the cluster are all trying to write to the same file at the same time.  It causes big delays due to having to obtain a write lock.  To verify this, check your cache file (eg. HTML/4.2.0,bf0d135e38cd2bca306340d8d6a55127,1.ser) and the last modified date will always be the current minute if your site has active traffic.</p>

<p>So I did this small patch to use $cache-&gt;add instead of $cache-&gt;set which makes it avoid writing the cache if it already exists.  So I would like to know - is this patch OK or is there a reason it needs to be a set() call?</p>

<p>Here's the patch:</p>

<pre>
diff --git a/lib/htmlpurifier/HTMLPurifier/Config.php b/lib/htmlpurifier/HTMLPurifier/Config.php
index 54d4085..4dfc09b 100644
--- a/lib/htmlpurifier/HTMLPurifier/Config.php
+++ b/lib/htmlpurifier/HTMLPurifier/Config.php
@@ -350,7 +350,7 @@ class HTMLPurifier_Config
             if (!empty($this-&gt;definitions[$type])) {
                 if (!$this-&gt;definitions[$type]-&gt;setup) {
                     $this-&gt;definitions[$type]-&gt;setup($this);
-                    $cache-&gt;set($this-&gt;definitions[$type], $this);
+                    $cache-&gt;add($this-&gt;definitions[$type], $this);
                 }
                 return $this-&gt;definitions[$type];
             }
@@ -390,7 +390,7 @@ class HTMLPurifier_Config
         $this-&gt;definitions[$type]-&gt;setup($this);
         $this-&gt;lock = null;
         // save in cache
-        $cache-&gt;set($this-&gt;definitions[$type], $this);
+        $cache-&gt;add($this-&gt;definitions[$type], $this);
         return $this-&gt;definitions[$type];
     }
</pre>

<p>Can you please let me know if there's anything wrong with running this patch to avoid the filesystem contention.</p>

<p>Thanks</p>

<p>Ash</p>]]></description>
            <dc:creator>ajh</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 31 Dec 2010 04:11:14 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5117,5117#msg-5117</guid>
            <title>scripts should not be stripped out (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5117,5117#msg-5117</link>
            <description><![CDATA[<p>i want all the scripts that are present in the user contributed input to be saved in a file. i want only the scripts rest of the html is to be purified echoed back to client.
how can it be done.</p>]]></description>
            <dc:creator>sharath</dc:creator>
            <category>Internals</category>
            <pubDate>Thu, 02 Dec 2010 06:34:54 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,5056,5056#msg-5056</guid>
            <title>mbstring.func_overload enabled (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,5056,5056#msg-5056</link>
            <description><![CDATA[<p>Intro:
we use htmlpurifier with the EGroupware project.
Right now we use Version 4.1.1 (for our latest release and developer tree)</p>

<p>We encountered problems with URI containing umlauts or specialchars using UTF-8 and mbstring.func_overload = 7

URIs where chopped off by the number of chars equaling the amount of umlauts with the URI in question.
The problem did not occur when switching off mbstring.func_overload</p>

<p>
We figured out that even encoding the umlauts could not solve the problem, as htmlpurifier,
due to a UTF-8 specific feature, automatically resolves all entities (According to the Docs). 
The helper/workaround mentioned in Configuration Docs -&gt; Core.EscapeNonASCIICharacters could 
not resolve the issue.</p>

<p>Since we need mbstring.func_overload we would like to suggest a change to</p>

<p>File:
library/HTMLPurifier/PercentEncoder.php</p>

<p>Function:
encode</p>

<pre>
Index: library/HTMLPurifier/PercentEncoder.php
===================================================================
--- library/HTMLPurifier/PercentEncoder.php	(Revision 32902)
+++ library/HTMLPurifier/PercentEncoder.php	(Arbeitskopie)
@@ -49,13 +49,15 @@
      */
     public function encode($string) {
         $ret = '';
+        //error_log(__METHOD__.__LINE__.$string);
         for ($i = 0, $c = strlen($string); $i &lt; $c; $i++) {
-            if ($string[$i] !== '%' &amp;&amp; !isset($this-&gt;preserve[$int = ord($string[$i])]) ) {
+            if (substr($string,$i,1) !== '%' &amp;&amp; !isset($this-&gt;preserve[$int = ord(substr($string,$i,1))]) ) {
                 $ret .= '%' . sprintf('%02X', $int);
             } else {
-                $ret .= $string[$i];
+                $ret .= substr($string,$i,1);
             }
         }
+        //error_log(__METHOD__.__LINE__.$ret);
         return $ret;
     }
</pre>

<p>Best regards
Leithoff, Klaus</p>]]></description>
            <dc:creator>Klaus Leithoff</dc:creator>
            <category>Internals</category>
            <pubDate>Tue, 09 Nov 2010 11:30:10 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4948,4948#msg-4948</guid>
            <title>New configurable option - disable CSS filters (9 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4948,4948#msg-4948</link>
            <description><![CDATA[<p>Hello, would it be possible to create a new configurable option, something like:</p>

<p>CSS.disable = boolean</p>

<p>I need that HTMLPurifier doesn't touch style property of element and keeps it as it is. The only solution I have found is to customize file HTMLPurifier/AttrDef/CSS.php as follows:</p>

<pre>
class HTMLPurifier_AttrDef_CSS extends HTMLPurifier_AttrDef
{

    public function validate($css, $config, $context) {

      return $css;

      .........      
</pre>

<p>It would be much better if I wouldn't have to make this custom change and rather use configurable option.</p>]]></description>
            <dc:creator>Thimble</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 12 Nov 2010 13:45:55 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4886,4886#msg-4886</guid>
            <title>Content Lost using Filter.ExtractStyleBlocks with large and/or many style block (11 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4886,4886#msg-4886</link>
            <description><![CDATA[<p>HTMLPurifier_Filter_ExtractStyleBlocks:43
</p>

<pre>
$html = preg_replace_callback('#&lt;style(?:\s.*)?&gt;(.+)&lt;/style&gt;#isU', array($this, 'styleCallback'), $html);
</pre>

<p>When there is a lot of style preg_replace_callback return null</p>

<p>Also, if there is many style block, this regular expression will catch all content between the two style blocks.</p>

<pre>
&lt;style&gt;
...
&lt;/style&gt;
Content placed here is lost !!
&lt;style&gt;
...
&lt;/style&gt;
</pre>

<p>I fixed both problem with lazy matching in the regular expression.
</p>

<pre>
#&lt;style(?:\s.*)?&gt;(.+?)&lt;/style&gt;#isU
</pre>]]></description>
            <dc:creator>rickdt</dc:creator>
            <category>Internals</category>
            <pubDate>Mon, 13 Sep 2010 16:06:27 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4820,4820#msg-4820</guid>
            <title>Extending few classes (7 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4820,4820#msg-4820</link>
            <description><![CDATA[<p>Hello everyone,</p>

<p>Moodle is using a modified version of HTML Purifier - you can see a diff between Moodle and vanilla version below.
This is a bit problematic for me as that forces me to use (in Debian, I'm a Maintainer) the version bundled with Moodle - and I'd prefer to use your version, that is already packaged for Debian.
The adjustments that Moodle did are necessary, to re-implement them I would need to extend few classes: 
* HTMLPurifier_AttrDef_Lang
* HTMLPurifier_HTMLModule_Text
* HTMLPurifier_HTMLModule_XMLCommonAttributes</p>

<p>Could you suggest some way of having HTML Purifier to use the extended classes (how could I inject them). Of course I would like to do it without modifying/patching the HTML Purifier code.</p>

<p>cheers,
Tomek</p>

<pre>
diff -ru vanilla/HTMLPurifier/AttrDef/Lang.php moodle/HTMLPurifier/AttrDef/Lang.php
--- vanilla/HTMLPurifier/AttrDef/Lang.php	2010-06-01 04:22:39.000000000 +0100
+++ moodle/HTMLPurifier/AttrDef/Lang.php	2010-05-22 01:04:23.000000000 +0100
@@ -9,6 +9,10 @@
 
     public function validate($string, $config, $context) {
 
+// moodle change - we use special lang strings unfortunatelly
+        return preg_replace('/[^0-9a-zA-Z_-]/', '', $string);
+// moodle change end
+
         $string = trim($string);
         if (!$string) return false;
 
diff -ru vanilla/HTMLPurifier/HTMLModule/Text.php moodle/HTMLPurifier/HTMLModule/Text.php
--- vanilla/HTMLPurifier/HTMLModule/Text.php	2010-06-01 04:22:39.000000000 +0100
+++ moodle/HTMLPurifier/HTMLModule/Text.php	2010-05-22 01:04:23.000000000 +0100
@@ -45,6 +45,13 @@
         $this-&gt;addElement('span', 'Inline', 'Inline', 'Common');
         $this-&gt;addElement('br',   'Inline', 'Empty',  'Core');
 
+        // Moodle specific elements - start
+        $this-&gt;addElement('nolink',  'Inline', 'Flow');
+        $this-&gt;addElement('tex',     'Inline', 'Flow');
+        $this-&gt;addElement('algebra', 'Inline', 'Flow');
+        $this-&gt;addElement('lang',    'Inline', 'Flow', 'I18N');
+        // Moodle specific elements - end
+        
         // Block Phrasal --------------------------------------------------
         $this-&gt;addElement('address',     'Block', 'Inline', 'Common');
         $this-&gt;addElement('blockquote',  'Block', 'Optional: Heading | Block | List', 'Common', array('cite' =&gt; 'URI') );
diff -ru vanilla/HTMLPurifier/HTMLModule/XMLCommonAttributes.php moodle/HTMLPurifier/HTMLModule/XMLCommonAttributes.php
--- vanilla/HTMLPurifier/HTMLModule/XMLCommonAttributes.php	2010-06-01 04:22:39.000000000 +0100
+++ moodle/HTMLPurifier/HTMLModule/XMLCommonAttributes.php	2010-05-22 01:04:23.000000000 +0100
@@ -5,9 +5,11 @@
     public $name = 'XMLCommonAttributes';
 
     public $attr_collections = array(
+/* moodle comment - xml:lang breaks our multilang
         'Lang' =&gt; array(
             'xml:lang' =&gt; 'LanguageCode',
         )
+*/
     );
 }
 
diff -ru vanilla/HTMLPurifier/Lexer.php moodle/HTMLPurifier/Lexer.php
--- vanilla/HTMLPurifier/Lexer.php	2010-06-01 04:22:39.000000000 +0100
+++ moodle/HTMLPurifier/Lexer.php	2010-07-06 01:04:03.000000000 +0100
@@ -252,8 +252,10 @@
     public function normalize($html, $config, $context) {
 
         // normalize newlines to \n
-        $html = str_replace("\r\n", "\n", $html);
-        $html = str_replace("\r", "\n", $html);
+        if ($config-&gt;get('Output.Newline')!=="\n") {
+            $html = str_replace("\r\n", "\n", $html);
+            $html = str_replace("\r", "\n", $html);
+        }
 
         if ($config-&gt;get('HTML.Trusted')) {
             // escape convoluted CDATA
</pre>]]></description>
            <dc:creator>Tomasz Muras</dc:creator>
            <category>Internals</category>
            <pubDate>Sat, 11 Sep 2010 02:56:43 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4692,4692#msg-4692</guid>
            <title>Stop stripping tags that are outside of the body.. (1 reply)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4692,4692#msg-4692</link>
            <description><![CDATA[<p>We are trying to figure out how to make sure html, head, meta, style, title, aren't stripped.</p>

<p>In an effort to stop them from being removed by HTMLPurifier we used the addElement method.  We seem to be able to get them all to work EXCEPT HEAD.</p>

<pre>
                        $oDef = $oConfig-&gt;getHTMLDefinition(true);

                        $oDef-&gt;addElement(
                                'style', // name
                                false, // content set
                                'Optional: #PCDATA', // allowed children
                                'Common', // attribute collection
                                array( // attributes
                                        'type' =&gt; 'CDATA',
                                ));

                        $oDef-&gt;addElement(
                                'title', // name
                                false, // content set
                                'Optional: #PCDATA', // allowed children
                                'I18N', // attribute collection
                                array( // attributes
                                ));

                        $oDef-&gt;addElement(
                                'meta', // name
                                false, // content set
                                'Empty', // allowed children
                                'I18N', // attribute collection
                                array( // attributes
                                        'http-equiv' =&gt; 'CDATA',
                                        'name' =&gt; 'CDATA',
                                        'content' =&gt; 'CDATA',
                                        'scheme' =&gt; 'CDATA',
                                ));

                        $oDef-&gt;addElement(
                                'head', // name
                                false, // content set
                                'Optional: Flow | #PCDATA | title | style | meta', // allowed children
                                'Common', // attribute collection
                                array( // attributes
                                ));

                        $oDef-&gt;addElement(
                                'body', // name
                                false, // content set
                                'Optional: Flow | #PCDATA | Inline', // allowed children
                                'Common', // attribute collection
                                array( // attributes
                                ));

                        $html = $oDef-&gt;addElement(
                                'html',  // name
                                false, // content set
                                'Optional: Flow | #PCDATA | head | body | title | style | meta', // allowed children
                                'Common', // attribute collection
                                array( // attributes
#                                       'action*' =&gt; 'URI',
#                                       'method' =&gt; 'Enum#get|post',
#                                       'name' =&gt; 'ID'
                                ));
                        $html-&gt;excludes = array('html'=&gt;true);

</pre>

<p>You may ask why we have title, style, meta inside the html.  It's because the HEAD isn't working yet.  So in the meantime we put them there since they seem to render even though they are children of HEAD.</p>]]></description>
            <dc:creator>Chris Altman</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 18 Jun 2010 08:52:35 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4672,4672#msg-4672</guid>
            <title>Ruleset validation. (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4672,4672#msg-4672</link>
            <description><![CDATA[<p>HTMLPurifier does not do a good job of validating it's configuration and handling unexpected values gracefully. In some cases, HTMLPurifier can terminate abruptly if its configuration is not set properly.</p>

<p>Would it be possible to provide an API to validate added rules?  I didnt see a bug tracker to log the request, so I apologize if this is the wrong place.</p>

<p>Drak</p>]]></description>
            <dc:creator>Drak</dc:creator>
            <category>Internals</category>
            <pubDate>Sat, 19 Jun 2010 21:27:07 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4640,4640#msg-4640</guid>
            <title>Non Latin Domains (6 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4640,4640#msg-4640</link>
            <description><![CDATA[<p>Hello,
I searched the docs and the forum, but I didn't find any information about the non-latin domains. I tested with several domains and they all were removed by the htmlpurifier (tested with 4.0.0, 4.1.0, 4.1.1).</p>

<p><a href="http://non-latin-domain">Test</a> results in <a href="">Test</a></p>

<p>Does the library support non-latin domains or is this feature on your to-do list?
Any ideas for quick fix of the issue?</p>

<p>Thanks!</p>

<p>Georgi</p>]]></description>
            <dc:creator>Georgi</dc:creator>
            <category>Internals</category>
            <pubDate>Fri, 06 Jan 2012 12:05:49 -0500</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4616,4616#msg-4616</guid>
            <title>Should Purifier do a double run? (3 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4616,4616#msg-4616</link>
            <description><![CDATA[<p>With HTML Purifier set to remove empty and to remove spans without attributes,</p>

<p>&amp;amp;amp;amp;amp;amp;lt;pre&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;![CDATA[</p>

<p>&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;span style=&amp;amp;amp;amp;amp;amp;quot;font-family: &amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Installation and Testing of the Electrical &amp;amp;amp;amp;amp;amp;amp; Instrumentation&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Works&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;bull; Installation of Primary &amp;amp;amp;amp;amp;amp;amp; Secondary Containment&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;/span&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt; &amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt; &amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;</p>

<p>]]&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/pre&amp;amp;amp;amp;amp;amp;gt;</p>

<p>Produces the following purified output:</p>

<p>&amp;amp;amp;amp;amp;amp;lt;pre&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;![CDATA[</p>

<p>&amp;amp;amp;amp;amp;amp;lt;p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Installation and Testing of the Electrical &amp;amp;amp;amp;amp;amp;amp;amp; Instrumentation&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Works&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;• Installation of Primary &amp;amp;amp;amp;amp;amp;amp;amp; Secondary Containment&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;</p>

<p>
]]&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/pre&amp;amp;amp;amp;amp;amp;gt;</p>

<p>The purified output is valid, of course, but it still contains an empty element.</p>

<p>If you run that through again, it's further purified, to remove the empty paragraph:</p>

<p>&amp;amp;amp;amp;amp;amp;lt;pre&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;![CDATA[</p>

<p>&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Installation and Testing of the Electrical &amp;amp;amp;amp;amp;amp;amp;amp; Instrumentation&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;Works&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;
&amp;amp;amp;amp;amp;amp;lt;p align=&amp;amp;amp;amp;amp;amp;quot;left&amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;gt;• Installation of Primary &amp;amp;amp;amp;amp;amp;amp;amp; Secondary Containment&amp;amp;amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;amp;amp;gt;</p>

<p>]]&amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;lt;/pre&amp;amp;amp;amp;amp;amp;gt;</p>

<p>Should Purifier run on a loop, repeatedly purifying until no changes are made to the HTML string?</p>

<p>TRiG.</p>]]></description>
            <dc:creator>TRiG</dc:creator>
            <category>Internals</category>
            <pubDate>Wed, 26 May 2010 13:39:14 -0400</pubDate>
        </item>
        <item>
            <guid>http://htmlpurifier.org/phorum/read.php?5,4591,4591#msg-4591</guid>
            <title>Custom filters on DOM level (5 replies)</title>
            <link>http://htmlpurifier.org/phorum/read.php?5,4591,4591#msg-4591</link>
            <description><![CDATA[<p>If I understand correctly (please say if I don't), HTML Purifier parses the DOM tree with DOMDocument. Most of the manipulations operate on this DOM tree, before it is turned back into a html string.</p>

<p>Custom filters, on the other hand (subclasses of HTMLPurifier_Filter) operate on html strings with regular expressions.</p>

<p>Would it be possible to add an interface for custom filters that operate directly on the DOM tree? I imagine this could be more powerful than regex filtering.</p>

<p>The parsing is a quite expensive operation (right?), so I guess it's a good idea to only do it once.</p>

<p>See also
<a href="http://drupal.org/node/808868">http://drupal.org/node/808868</a></p>]]></description>
            <dc:creator>donquixote</dc:creator>
            <category>Internals</category>
            <pubDate>Tue, 25 May 2010 16:46:30 -0400</pubDate>
        </item>
    </channel>
</rss>
