Welcome! » Log In » Create A New Profile

HTMLPurifier and <code> tags

Posted by dale 
HTMLPurifier and <code> tags
October 28, 2011 07:52PM

I have a forum in which users already have 10 years of learning to use

code here

to enter code, so using the pre + cdata method isnt possible, its also not possible to use jquery to process the submission and dynamically code into pre + cdata since cdata isnt part of the dom, just xml

Does htmlpurifier have any functionality in which to help doing that? obviously it cant be done on the way out html purifier since the stuff inside the code block is going to be blocked by purifier, and I would really like to avoid doing manual string processing for all the same reasons that htmlpurifier exists.

Cheers Dale

Re: HTMLPurifier and &lt;code&gt; tags
October 28, 2011 10:34PM

While manual text processing is suboptimal, it is probably the right thing here. As long as you do it before HTML Purifier it won't cause correctness problems.

Re: HTMLPurifier and &lt;code&gt; tags
October 29, 2011 12:53AM

ok cool thanks

2 last questions while I am here, is there any functionality in place to autocorrect links in html, so

<a href="google.com">google</a>

becomes the correct link, the other one is

I have modified the youtube filter to match plain youtube links, such as http://www.youtube.com/watch?v=oHg5SJYRHA0 and auto embed, however when I enable linkify, the linkify catches the link before the youtube filter and turns it into a link before I can match it in the youtube filter, is there any way to enforce the order of those filters. I cant auto embed links as if the user has just created link then I dont want to embed it

Thanks for the speedy responce

Re: HTMLPurifier and &lt;code&gt; tags
October 29, 2011 01:37PM

Hmm, both of those are kind of tricky. We have no heuristics ("google.com" is a perfectly valid file name!) As for YouTube, can you run the translation code before Linkify?

Re: HTMLPurifier and &lt;code&gt; tags
February 17, 2012 07:08AM

is there any functionality in place to autocorrect links in html, so

<a href="google.com">google</a>

becomes the correct link

Out of the box, HTML Purifier doesn't offer anything (see Ambush Commander's answer), but if you can think of a way to make the transformation as you envision it, you can write an attribute transformation class for yourself:

class HTMLPurifier_AttrTransform_URLHeuristic extends HTMLPurifier_AttrTransform
    public function transform($attr, $config, $context) {
        // this is a (sloppy) example implementation
        // you probably want to code something better!
        if (isset($attr[&#039;href&#039;]) && (strpos(strtolower($attr[&#039;href&#039;]), &#039;http://&#039;) !== 0)) {
            // relative link of some sort (or non-http protocol)
            $tldHeuristics = array(&#039;.com&#039;, &#039;.org&#039;, &#039;.net&#039;); // etc
            foreach ($tldHeuristics as $tld) {
                if (strpos($attr[&#039;href&#039;], $tld) !== false) {
                    $attr[&#039;href&#039;] = &#039;http://&#039; . $attr[&#039;href&#039;];
        return $attr;


// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement(&#039;a&#039;);
$anchor->attr_transform_pre[] = new HTMLPurifier_AttrTransform_URLHeuristic();
// purify down here

Hope that helps.

(Edited for formatting after a forum glitch.)

Re: HTMLPurifier and &lt;code&gt; tags
February 19, 2012 01:18PM


I think I have a similar question to the OP, but I don't understand what he is asking about.

I have discovered HTML Purifier and would like to use it for my web site as I am building a wiki-style web site for game programming. I am using GeSHi to format and highlight my source code syntax and am having trouble integrating HTML Purifier with GeSHi.

I am using this line: echo preg_replace_callback('/\[(code|note|url)=?([^\]]*)\]([\w\W]*?)\[\/\1\]/', 'format_code', $contents_parent_id[$j]['contents']);

... which calls the "format_code" function, which has GeSHi formatting within. Any code between [code] and [/code] comes out highlighted/colour coded. However, when I then run the string through HTML Purifier, the "<" and ">" and removed and ruin mess up my source code. It is the same if I run the HTML Purifier before I run GeSHi.

Is it possible to force HTML Purifier to ignore "<", ">" and any instance where it is not being used for HTML markup? In programming the less than, greater than and bit-wise shift operators are important and I can't have them removed from the source code.

Hope someone can help



Re: HTMLPurifier and &lt;code&gt; tags
February 19, 2012 01:52PM

I think the easiest way to proceed here is to run GeShi on the document first, and make sure it outputs VALID HTML. Then HTML Purifier will not remove tags.

Re: HTMLPurifier and &lt;code&gt; tags
February 19, 2012 04:56PM


Thank you for responding. I've tried running HTML Purifier both before and after GeSHi to no avail. However, I have come to a conclusion, of sorts. Before, I was running these two lines as follows:

$contents_parent_id[$j]['contents'] = $purifier->purify($contents_parent_id[$j]['contents'])

preg_replace_callback('/\[(code|note|url)=?([^\]]*)\]([\w\W]*?)\[\/\1\]/', 'format_code', $contents_parent_id[$j]['contents'])

... which was not working. However, after some experimenting I was able to get a good working outcome, but still not perfect. By combining the above code like so:

$contents_parent_id[$j]['contents'] = $purifier->purify(preg_replace_callback('/\[(code|note|url)=?([^\]]*)\]([\w\W]*?)\[\/\1\]/', 'format_code', $contents_parent_id[$j]['contents']));

... I am able to get a "purified" outcome with fully highlighted code, but there's a little problem I don't know how to overcome. The outcomes I am getting are along the lines of (example):

 #include  int main() { printf("Hello world"); cout << "Hello World" << endl; }

Note the circumflexed A being printed within the source. I've done some Googling and apparantly "the non-breaking spaces are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character"


"That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 is ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as "Â ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.

What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your   strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business. "

I realise at this point the problem may very well be with some other part of my code. However, I would first like to ask if this could be something generated by HTML Purifier?

Thank you again for your time and apologies for the long message.

Re: HTMLPurifier and &lt;code&gt; tags
February 19, 2012 05:05PM

Okay, the above problem seems to have suddenly disappeared and the code is now working flawlessly. Wonder what that was?

Well, anyway, thanks very much for your time Ambush Commander. Perhaps this thread will serve to help somebody in the future wanting to use GeSHi and HTML Purifier together, the method I offered above seems to do the trick quite nicely.


Sorry, you do not have permission to post/reply in this forum.