Welcome! » Log In » Create A New Profile

Bug with automatic image alt tag

Posted by bfarber 
Bug with automatic image alt tag
September 21, 2015 11:13PM

I believe I've encountered a bug with HTML Purifier with regards to the automatic setting of the image alt tag and I wasn't sure where to report it, but I did want to raise the issue.

Scenario: say we have to clean the following text (which is valid)

<p><img src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Tea_leaves_steeping_in_a_zhong_ÄŤaj_05.jpg/150px-Tea_leaves_steeping_in_a_zhong_ÄŤaj_05.jpg" border="0" class="linked-image" /></p>

When HTML Purifier attempts to process this it sees there is no alt, and assuming you haven't defined a specific one using Attr.DefaultImageAlt, the filename is used to create one automatically in HTMLPurifier_AttrTransform_ImgRequired. The problem is, this class takes the first 40 characters of the basename of the image URL, which results in breaking a UTF-8 character. Because that's hard to see when copying text, here's a screenshot:

http://content.screencast.com/users/bfarber/folders/Jing/media/a3ab4386-adc7-46fd-8166-9c4b97724aa4/2015-09-21_2310.png

Fast forwarding from here, HTML Purifier then runs the alt tag through htmlspecialchars(), however because it previously broke the multibyte sequence, htmlspecialchars() throws a warning.

htmlspecialchars() [function.htmlspecialchars]: Invalid multibyte sequence in argument

I've worked around this for my own purposes for now by simply grabbing 75 chars of the basename() instead of 40, but that's just pushing the goal posts out really.

Re: Bug with automatic image alt tag
September 29, 2015 06:32PM

You're absolutely right. I think I'll just remove the truncating code.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: