Welcome! » Log In » Create A New Profile

Content Lost using Filter.ExtractStyleBlocks with large and/or many style block

Posted by rickdt 
Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 11:33AM

HTMLPurifier_Filter_ExtractStyleBlocks:43

$html = preg_replace_callback(&#039;#<style(?:\s.*)?>(.+)</style>#isU&#039;, array($this, &#039;styleCallback&#039;), $html);

When there is a lot of style preg_replace_callback return null

Also, if there is many style block, this regular expression will catch all content between the two style blocks.

<style>
...
</style>
Content placed here is lost !!
<style>
...
</style>

I fixed both problem with lazy matching in the regular expression.

#<style(?:\s.*)?>(.+?)</style>#isU
Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 01:25PM

Good catch.

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 01:31PM

Actually, we already turn on lazy matching with the U flag. What version of PHP are you using? Maybe we should stop using that flag and specify it explicitly.

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 02:05PM

Hi,

I use php 5.3.3.

Sorry, I didn't know about the # notation. I always use /pattern/options

So, when I added the ? in the regular expression with the U flag, I changed it to greedy mode which is wrong.

I was misslead by the fact that it fixed the problem of preg_replace_callback returning null

This is probably not a problem with HTMLPurifier. I'll post back if I find any information usfull for the HTMLPurifier project.

Best Regrads!

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 02:41PM

I found the source of the problem.

It was the size of the style content that exceded php pcre.backtrack_limit.

P.S.: Content come from an email composed by microsoft word with lot of crappy css.

A workaround for me is to set pcre.backtrack_limit to an higher value before running HMTLPurifier.

&amp;amp;amp;amp;amp;amp;amp;amp;lt;pre&amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;lt;![CDATA[ ini_set(‘pcre.backtrack_limit’,’300000?); ]]&amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;lt;/pre&amp;amp;amp;amp;amp;amp;amp;amp;gt;

I guess the parsing could be optimized into HTMLPurifier, but I don't know if it's worth it.

This might help somebody else.

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 02:45PM

That's fascinating. We could unroll the regular expression into a normal parse phase, but I have a better idea: what happens if you replace the . in style with [^>]?

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 03:01PM

I've got the same problem with :

#<style(?:\s.*)?>([^<]+)</style>#isU

However, if I increase the memory limit, this regex does not crash.

I may be wrong, but I think this would not work with html comments inside the style element, whish is common, especially with MS Word generated content.

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 03:21PM

erm, wrong one: I mean the . inside style's attributes. Could you post some sample code for me to test with too?

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 03:36PM

Where can I upload the test case? I don't think this forum will accept 4000 line post.

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 03:41PM

Upload it somewhere, and link to it in a forum message (maybe add some text to defeat the spam filter)

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 03:59PM

https: slash slash secure.kronos-web.com/apps/dropbox/?f=37458725ac690747bbc5eb78d0214416

Just click the download link (the text will be in french)

Re: Content Lost using Filter.ExtractStyleBlocks with large and/or many style block
September 13, 2010 04:06PM

Thanks! I'll look at it when I get a chance.

Sorry, you do not have permission to post/reply in this forum.