Jörg Ludwig
Bug: cuts off html after 8 kbyte with special chars
June 29, 2011 11:14AM

We use HTML Purifier to clean up HTML mails from customers before displaying them. Under certain circumstances an ISO-8859-1 HTML string is cut off in the middle. The following scripts reproduces the problem:

require_once "HTMLPurifier.auto.php";

$in = "€".str_repeat(".", 50000);

$cfg = HTMLPurifier_Config::createDefault();
$cfg->set("Core.Encoding", "iso-8859-1");
$purifier = new HTMLPurifier($cfg);
$out = $purifier->purify($in);

echo "in: ".strlen($in)."";
echo "out: ".strlen($out)."";
echo $out;

Output:

in: 50007
out: 8159
................... [...]

Expected Output:

in: 50007
out: 50007
[Euro symbol]............ [...]

The problem does not occur with encoding set to UTF-8. Unfortunately we cannot just convert the encoding as the encoding is also declared in the HTML header of the input string.

Re: Bug: cuts off html after 8 kbyte with special chars
June 29, 2011 10:36PM

This may be a bug in your version of iconv, in which it cuts off your text after an invalid character. I'm not really sure how to work around this. What happens if you pass the HTML through iconv with params ISO-8559-1 and UTF-8//IGNORE?

Jörg Ludwig
Re: Bug: cuts off html after 8 kbyte with special chars
July 04, 2011 10:31AM

Thank you for your quick reply! I am not sure what you mean by passing the HTML through iconv. Both $in and $out are pure US-ASCII. There is nothing to convert.

Re: Bug: cuts off html after 8 kbyte with special chars
July 04, 2011 10:48AM

What happens if you set %Core.EscapeNonASCIICharacters to true.

Jörg Ludwig
Re: Bug: cuts off html after 8 kbyte with special chars
July 08, 2011 07:03AM

This workaround works fine! Thank you very much for your help!

Re: Bug: cuts off html after 8 kbyte with special chars
July 08, 2011 08:05AM

OK, Iconv bug. We should probably detect and work around this.

Re: Bug: cuts off html after 8 kbyte with special chars
December 18, 2011 01:47PM
Re: Bug: cuts off html after 8 kbyte with special chars
December 25, 2011 09:58AM

I've added a workaround for this bug in master. Unfortunately, this will be broken in PHP 5.4 if a closely related bug in PHP isn't fixed.

Re: Bug: cuts off html after 8 kbyte with special chars
February 17, 2012 05:42AM

I just submitted two upstream bugs on this issue:

http:// sources.redhat <dot> com/bugzilla/show_bug.cgi?id=13518

http:// sources.redhat <dot> com/bugzilla/show_bug.cgi?id=13517

Just want to add to this list since I was just reading through the related bugs and ended up searching for it, myself:

http:// sources.redhat <dot> com/bugzilla/show_bug.cgi?id=13541

That's the follow-up to Bug 13518. Maybe this'll save someone the search! :)

(spaces and <dot>s so aksimet doesn't spam-file me.)

(Edit: Fixed formatting after an HTML encoding bug ravaged the forum ^-^)

Edited 1 time(s). Last edit at 07/30/2012 01:51PM by pinkgothic.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: