Welcome! » Log In » Create A New Profile

Update comparison page, optimize size/performance

Posted by johncasper 
Update comparison page, optimize size/performance
January 28, 2015 09:20AM

When was the last time you updated the comparison page?

http://htmlpurifier.org/comparison

I don't think that page has changed ever since I started using HTML Purifier.

About four years ago, you said that you would get around to adding WasHTML to the list:

http://trac.roundcube.net/ticket/1484584

On a related note, HTML Purifier is large, slow, and a memory hog. I generally end up caching its output to avoid calling it regularly. Is there *anything* you can do about this? I say related, because repeatedly dropping a 1MB library that chugs CPU and RAM for breakfast for a single task (HTML sanitization) onto servers is rather excessive, which seems to be the point of the Roundcube devs' position to use WasHTML. Like you, security is #1 for me as well. However, for me, remaining "sane" comes in at a close second place. When a single third-party library consumes well over 50% of the code for a project (in some cases, 90%), that's insane. There's got to be something you can do to make a lighter-weight version of HTML Purifier (e.g. an online builder that generates a customized build with just the bits that the person actually needs or convert HTML Purifier to a PECL extension or a completely revamped project).

Re: Update comparison page, optimize size/performance
January 28, 2015 04:44PM

According to the Git logs, 2010. 8)

As for performance problems and code size, I think there are two problems. First, in 2015, I don't think there are many people who care too much about making sure the HTML they output is "valid HTML". So I suspect that many people could skip the FixNesting pass and not lose much sleep over it. It is a minor security problem, however, since when tags are nested in a strange way it increases the possibility that a browser will interpret it a strange way, and it's always possible that someone could add a tag to the specification which is insecure depending on what context it's in (I don't think this has happened). I picked a very conservative choice when starting the project (and because "standards-compliant" is nice tagline.)

The second problem is that we really ought not to convert the DOM into a token stream. This was also an artifact of not wanting to assume people had the dom extension installed, and ensuring we could work around bugs in the underlying libxml (if there were any, which there have been.) Probably in 2015 we should be able to just assume it is installed; however, there remains the problem of the auto-formatters (injectors), which were written assuming a token stream. Switching the representation here would break all existing auto formatters. That's not a problem for intree formatters but if there are people who wrote their own this would be a strictly backwards incompatible change.

Really what would be best for HTML Purifier would be getting ported to a C compatible language (not C of course because that's really unsafe :-) which would help memory and CPU usage and make it usable from other languages . But I don't have the time to do that.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with < and >.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: