Welcome! » Log In » Create A New Profile

HTML Purifier performance is big problem.

Posted by tsecheng 
tsecheng
HTML Purifier performance is big problem.
September 07, 2013 09:28AM
I use HTML Purifier  devel my app, but when i process 1.33MB html file, my server use CPU 100% and crash.
for performace, i use trip_tags first, than content remain 512KB, HTML Purifier still take CPU 100% on 20 second.

i think in real usage, user paste from MS word, string larger than 512K is normal.

finally i give up, i choose htmLawed, by the way my version is 4.5.0

require_once 'lib/htmlpurifier/library/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$allow_tags = 'iframe,a,p,img,ul,ol,li,table[style],thead,tbody,tr,th,td,br,div,blockquote,h1,h2,h3,h4,h5,h6,span';
$config->set('HTML.Allowed', $allow_tags);
$config->set('HTML.AllowedAttributes', '*.id,*.style,*.class,*.width,*.height,*.align,*.valign,*.bgcolor,*.colspan,*.rowspan, a.target,a.title,a.href, img.src,img.alt,img.title, table.border,table.cellspacing,table.cellpadding,table.summary, span.class, iframe.src,iframe.frameborder');
$config->set('CSS.AllowedProperties', array('text-decoration' => true,'font-family' => true,'font-size' => true,'text-align' => true,'padding-left' => true,'padding-right' => true,'padding-top' => true,'padding-bottom' => true,'color' => true,'background-color' => true, 'width'=>true, 'height'=>true,'float'=>true));
$config->set('AutoFormat.RemoveEmpty', true);
$config->set('HTML.SafeIframe', true);
$config->set('URI.SafeIframeRegexp', '%^(http:)?(https:)?(//)?([\w\d-_]+).([\w\d-_]+)%');
$config->set('Filter.YouTube', true);
$config->set('HTML.TidyLevel', 'light');

$def = $config->getHTMLDefinition(true);
$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(array('_blank','_self','_target','_top')));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($content);

-----------------------------------
i use script test. PHP version is 5.5.3 
[root@x4 tmp]# date;php purifier.php ;date
Sat Sep  7 21:06:25 CST 2013
^C
[root@x4 tmp]# date
Sat Sep  7 21:13:59 CST 2013
about 10 minutes, it still running with 100%CPU usage.

content file.
https://docs.google.com/file/d/0B7Vlj1awSjKjRGVDdHJvMUFGcGM/edit?usp=sharing
Re: HTML Purifier performance is big problem.
September 07, 2013 01:38PM
Sorry, you do not have permission to post/reply in this forum.