Welcome! » Log In » Create A New Profile

Purifier strips valid content when html is badly malformed

Posted by exien 
Purifier strips valid content when html is badly malformed
November 01, 2012 06:31PM

HTML Purifier deletes valid content from a string when the HTML is very badly malformed.

To reproduce this, plug in the following HTML into the HTML Purifier demo.

<DIV><PRE style="WORD-WRAP: break-word" id=PreBody><PRE><PRE style="WORD-WRAP: break-word" id=PreBody><PRE><FONT size=2 face=Arial>

Missing content!!

<B>1) Details:</B>   

By default, HTML Purifier may remove some of your spacing indentation. Turn on CollectErrors or experimental features order to fully preserve whitespace.

<B>2) Instruction Instruction Instruction:</B>   
  
By default, HTML Purifier may remove some of your spacing indentation. Turn on CollectErrors or experimental features order to fully preserve whitespace.

<B>3) Instruction Instruction Instruction:</B> 

More text. Instruction Instruction Instruction.
<B>Acceptance:</B>

More missing content.

</FONT>

The result is simply:

<div><pre></pre><pre></pre><pre></pre><pre></pre></div>

I'm glad HTML Purifier is aggressively cleaning up this mess, but perhaps too much text was removed?

Re: Purifier strips valid content when html is badly malformed
November 03, 2012 05:33PM

Hmm, can you minimize this test-case?

Re: Purifier strips valid content when html is badly malformed
November 07, 2012 01:50PM

Here you go:

<PRE><FONT>
Missing content!!
</FONT>
Re: Purifier strips valid content when html is badly malformed
November 07, 2012 08:37PM

OK, I know what the bug is. The fix is going to be a little fiddly though...

Re: Purifier strips valid content when html is badly malformed
November 15, 2012 04:48PM

Hi, I trust you'll update this ticket when the fix is ready?

Re: Purifier strips valid content when html is badly malformed
November 15, 2012 08:04PM

Yeah, though no promises on the time frame.

I have the same proble. 6 years and no bug fix?

also, not only html tags are stripped but strings that are not html tags also. here is an example: "my test string contain a true return invalid tags escape but using it consume a too much ram if I have to escape a lot of fields.

I've found a solution to the previous problem I've post. (non html tag that get stripped)

'Core.LexerImpl' => 'DirectLex'

This will escape the tag. ex:

"my string start a <tag that is not an html valid tag"

without &#039;Core.LexerImpl&#039; => &#039;DirectLex&#039; it result to:
"my string start a 

with &#039;Core.LexerImpl&#039; => &#039;DirectLex&#039; it result to:
"m"my string start a &lt;tag that is not an html valid tag"y string start a <tag that is not an html valid tag"
Re: Purifier strips valid content when html is badly malformed
February 17, 2013 06:33PM

OK, I have discovered a workaround for this bug: set %HTML.TidyLevel to heavy. E.g. as seen here.

It is actually pretty hard to fix this properly in HTML Purifier proper, because child element validation proceeds top-down, with special-casing for element removal; transforming elements into another set of tags violates a bunch of invariants as far as this algorithm is concerned. So as an additional workaround, I'm adding a new direct %HTML.EnableExcludes, which can be toggled off in order to disable excludes removal; this will also achieve the desired effect, although at the cost of permitting some invalid documents.

Re: Purifier strips valid content when html is badly malformed
February 17, 2013 06:47PM

Erm, it has been renamed to %Core.DisableExcludes

Re: Purifier strips valid content when html is badly malformed
February 21, 2013 03:12PM

Thanks Edward, that does it. Really appreciate you fixing this one.

Note: The changelog still refers to the new tag as EnableExcludes: http://repo.or.cz/w/htmlpurifier.git/blob/d516e2f8de435ea78cc6152abc425d3ff2c4d289:/NEWS

Appreciate your follow up comment that it's now called Core.DisableExcludes.

Re: Purifier strips valid content when html is badly malformed
February 21, 2013 05:07PM

Oooops. Sorry about that! :o)

Re: Purifier strips valid content when html is badly malformed
February 21, 2013 05:42PM

I've fixed it in HEAD, but I've decided it's not worth the trouble to roll a new release, so I've just updated the website.

Sorry, you do not have permission to post/reply in this forum.