Welcome! » Log In » Create A New Profile

Memory exhausted when parsing large files

Posted by georgekaz 
Memory exhausted when parsing large files
November 23, 2016 11:41AM


I have to validate some quite large html files and find that htmlpurifier is consuming all the allowed memory i.e. PHP Fatal error: Allowed memory size of xxxxx bytes exhausted.

I wondered if there's a way to mitigate for this with options. I get the data from a news supplier so I can't control it's length or content. I'm happy to supply a sample file.

I'm running this simple test below. Sample.html is 417k. Memory limit is set to 128M. It requires 192M before this does not fail, which seems like a lot considering the size of the file, and often content is larger.

<?php require_once 'library/HTMLPurifier.auto.php';

$dirty_html = file_get_contents ( "Sample.html"); $purifier = new HTMLPurifier();

$clean_html = $purifier->purify($dirty_html); ?>


Re: Memory exhausted when parsing large files
November 23, 2016 12:15PM

Yes, HTML Purifier is not the most memory friendly software, and there's probably not much non-code changes you could do to improve it.

If you can post the HTML, there might just be some quadratic factor in the code that we could identify and fix.

Re: Memory exhausted when parsing large files
November 24, 2016 08:54AM


Thanks for the reply. You can download a tgz of the html sample here: https://we.tl/KvF1nb1nc4

It's too big to paste here so I'm sharing it through wetransfer. It'll be there for 7 days.

Don't judge me on the html please. I don't write it, it's from a news distribution source and we have to embed it into our pages. I'm mostly using htmlpurifier here when loading that page to ensure there's no xss nastiness in what we are displaying. I think there's value in seeing why so much memory is consumed anyway.

I've tried some options too but it hasn't helped:

$config->set(&#039;HTML.Doctype&#039;, &#039;HTML 4.01 Transitional&#039;);
$config->set(&#039;Cache.DefinitionImpl&#039;, null);
$config->set(&#039;HTML.TidyLevel&#039;, &#039;none&#039;);

Thanks George

Sorry, you do not have permission to post/reply in this forum.