Welcome! » Log In » Create A New Profile

&lang purified to %E2%8C%A9

Posted by mfb 
mfb
&lang purified to %E2%8C%A9
June 23, 2008 01:25AM

HTML Purifier 3.1.1 is converting this

<a href="http://www.yesh-din.org/site/index.php?page=report&lang=en">test link</a>

to

<a href="http://www.yesh-din.org/site/index.php?page=report%E2%8C%A9=en">test link</a>

Seems like it might be a bug in HTML Purifier?

By the way, yes I know it's incorrect HTML.. that's why I'm using HTML Purifier ;)

Re: &lang purified to %E2%8C%A9
June 23, 2008 10:37AM

〈 is a valid entity reference, so HTML Purifier interprets it as such. (It's a left angular bracket).

Looking at Firefox's behavior in this case, it will parse the "special five" entities that miss the ending semicolon as such, but will not parse the extra entities like that. It might be a good idea to do this for HTML Purifier too, although it may be difficult to do given the way HTML Purifier currently processes entities (it basically does a simple search and replace).

mfb
Re: &lang purified to %E2%8C%A9
June 23, 2008 01:06PM

Given how common the pattern "&lang=" and other "&xxx=" is in URLs it seems like a good idea to just convert & to &amp; in these cases.

Re: &lang purified to %E2%8C%A9
June 24, 2008 10:46PM

As I said, the better behavior is obvious, but how to implement it without slowing HTML Purifier to a halt is difficult to say.

Let me elaborate on what HTML Purifier currently does, and a possible way to fix this.

Before performing any parsing, HTMLPurifier_Lexer will call its normalize() function. From this function, substituteNonSpecialEntities() is called, which is a cool regex that matches anything that looks like an entity, compares its value against our named entity table (or converts it if it's numeric) and replaces it with the real character if it is matched. Then the parsing begins.

During the regex substitution, HTML Purifier has no way of knowing if it is inside or outside an attribute, and I'm not going to add heuristics for this. The only way of doing this is moving entity parsing inside the lexing process. This is what most HTML parsers do, however, since we've implemented ours in PHP, we've taken lots of shortcuts to keep things fast.

It might be possible to move entity parsing into the parser, but I don't know how much of a performance difference this will cause.

dkrnl
Re: &lang purified to %E2%8C%A9
February 25, 2016 11:07PM

Hot fix before purify:


$pattern = "~&((" . implode("|", array_map(function($i) {
    return preg_quote(str_replace(array("&", ";"), "", $i), "~");
}, get_html_translation_table(HTML_ENTITIES, ENT_COMPAT, "UTF-8"))) . "|#\\d+)[^;])~";

$string = preg_replace($pattern, "&amp;\\1", $string);

$purifier->purify($string, $config)

Re: &lang purified to %E2%8C%A9
March 07, 2017 08:35PM

This is fixed in HEAD now.

Sorry, you do not have permission to post/reply in this forum.