Welcome! » Log In » Create A New Profile

Memory exhausted when parsing large files

Posted by georgekaz 
Memory exhausted when parsing large files
November 23, 2016 11:41AM

Hi,

I have to validate some quite large html files and find that htmlpurifier is consuming all the allowed memory i.e. PHP Fatal error: Allowed memory size of xxxxx bytes exhausted.

I wondered if there's a way to mitigate for this with options. I get the data from a news supplier so I can't control it's length or content. I'm happy to supply a sample file.

I'm running this simple test below. Sample.html is 417k. Memory limit is set to 128M. It requires 192M before this does not fail, which seems like a lot considering the size of the file, and often content is larger.

<?php require_once 'library/HTMLPurifier.auto.php';

$dirty_html = file_get_contents ( "Sample.html"); $purifier = new HTMLPurifier();

$clean_html = $purifier->purify($dirty_html); ?>

Thanks

Re: Memory exhausted when parsing large files
November 23, 2016 12:15PM

Yes, HTML Purifier is not the most memory friendly software, and there's probably not much non-code changes you could do to improve it.

If you can post the HTML, there might just be some quadratic factor in the code that we could identify and fix.

Re: Memory exhausted when parsing large files
November 24, 2016 08:54AM

Hi,

Thanks for the reply. You can download a tgz of the html sample here: https://we.tl/KvF1nb1nc4

It's too big to paste here so I'm sharing it through wetransfer. It'll be there for 7 days.

Don't judge me on the html please. I don't write it, it's from a news distribution source and we have to embed it into our pages. I'm mostly using htmlpurifier here when loading that page to ensure there's no xss nastiness in what we are displaying. I think there's value in seeing why so much memory is consumed anyway.

I've tried some options too but it hasn't helped:

$config->set(&#039;HTML.Doctype&#039;, &#039;HTML 4.01 Transitional&#039;);
$config->set(&#039;Cache.DefinitionImpl&#039;, null);
$config->set(&#039;HTML.TidyLevel&#039;, &#039;none&#039;);
$config->set(&#039;Output.FixInnerHTML&#039;,false);

Thanks George

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: