Welcome! » Log In » Create A New Profile

"Compare" article: What is the premise?

Posted by donlimon 
donlimon
"Compare" article: What is the premise?
April 23, 2009 06:57PM

I reat the "compare" article at http://htmlpurifier.org/comparison.html and some questions are not answered..

1) I guess if say "htmlentities" or "htmlspecialchars", it will be safe in any case. 2) If I use a DOM sub-tree parser (like htmltidy is, afaik), then remove all tag attributes and flatten all non-whitelisted tags, then send every text node in the DOM tree through htmlspecialchars, then rebuild the html, then it will be safe - right? 3) If I use a bbcode or wiki markup parser to identify the wiki tokens, then clean the remaining text fragments with htmlspecialchars, then build the html, then it should be safe - correct? 4) As an extension to 2), I could allow some whitelisted attribute names and attribute values (using a regex filter for urls, for instance). It could still be safe, right?

For the alternatives discussed in the article: - Some of them have parameters the admin can set. As I understand, the article looks at the default configuration and then says it's not safe, or breaks. Does this mean, these tools could do the job with proper configuration? - When talking about wiki markup or bbcode, the article argues that too many things are filtered out, and we make some valuable formatting possibilities impossible. I guess that's why Wikipedia allows a subset of HTML. Again, I can ask, can you make these systems safe by being more strict about the available html tags?

Seeing that htmlspecialchars actually is a solution in terms of safety, it obviously has the drawback that it makes any formatting impossible. On the other hand, not even htmlpurifier allows complete freedom of formatting.

Thus, the question should not be reduced to "how safe is one or another method", but, which amount of freedom does a tool in question allow, while still producing safe output.

This is an aspect that I miss in the article, and I would appreciate to see added. It will help me in discussions, to make my point :)

Thanks, donlimon

---------

OFFTOPIC: I appreciate that I don't have to create an account for this forum!!!

donlimon
Re: "Compare" article: What is the premise?
April 23, 2009 07:07PM

Damn, the downside of not having an account is that I can't edit my post. Obviously, the second section is meant as a numbered list. The second as a list with dashes. Maybe I would have tweaked the post title (but I can live with it)

For the idea of DOM parsing + htmlspecialchars on the fragments, I noticed that this will double-encode already encoded things. & will become &, & will become &

Still I have the feeling that I'm not totally wrong with this idea.

Re: "Compare" article: What is the premise?
April 24, 2009 01:47PM
For the idea of DOM parsing + htmlspecialchars on the fragments, I noticed that this will double-encode already encoded things. & will become &, & will become &amp

if you're usin PHP 5 (which u should be) then you can solve that by setting the $double_encode parameter htmlspecialchars() to false

Re: "Compare" article: What is the premise?
April 24, 2009 02:33PM
I guess if say "htmlentities" or "htmlspecialchars", it will be safe in any case.

In an HTML text context, yes. There are some caveats when you are putting data inside of an attribute (namely multibyte attacks and quoting).

If I use a DOM sub-tree parser (like htmltidy is, afaik), then remove all tag attributes and flatten all non-whitelisted tags, then send every text node in the DOM tree through htmlspecialchars, then rebuild the html, then it will be safe - right?

You don't actually want to do that. A more recommended route is to use the dom extension (powered by libxml), and then do what you described. This is generally correct, be careful with the whitelists.

If I use a bbcode or wiki markup parser to identify the wiki tokens, then clean the remaining text fragments with htmlspecialchars, then build the html, then it should be safe - correct?

It depends on the wiki/bbcode parser.

As an extension to 2), I could allow some whitelisted attribute names and attribute values (using a regex filter for urls, for instance). It could still be safe, right?

Now you are in dragons territory (and this is why you want to use HTML Purifier). Regex filter for URL has a lot of workarounds; in theory yes, but it gets a lot harder to do it in practice.

Some of them have parameters the admin can set. As I understand, the article looks at the default configuration and then says it's not safe, or breaks. Does this mean, these tools could do the job with proper configuration?

Yes. Some of the tools I might be philosophically opposed to their technique, but it works "well enough" in the real world.

When talking about wiki markup or bbcode, the article argues that too many things are filtered out, and we make some valuable formatting possibilities impossible. I guess that's why Wikipedia allows a subset of HTML. Again, I can ask, can you make these systems safe by being more strict about the available html tags?

Wikipedia is strict about available html tags, and it's decently safe. The parser code they have, though, is a hideous mess.

Thus, the question should not be reduced to "how safe is one or another method", but, which amount of freedom does a tool in question allow, while still producing safe output.

HTML Purifier's philosophy is "As much as is safe, and not one bit more!" Embedded videos and flash occupy a very funny zone, where they are not strictly safe but sort of can be cajoled into being safe, but also have a number of compatibility measures that extend beyond just sanitization.

Damn, the downside of not having an account is that I can't edit my post. Obviously, the second section is meant as a numbered list. The second as a list with dashes. Maybe I would have tweaked the post title (but I can live with it)

This is not a jab against you, but I very specifically require users to preview before posting, and thus am constantly befuddled by ill-formatted posts :-( It's like I'm being annoying and not actually helping.

donlimon
Re: "Compare" article: What is the premise?
April 24, 2009 05:17PM
If I use a DOM sub-tree parser (like htmltidy is, afaik), then remove all tag attributes and flatten all non-whitelisted tags, then send every text node in the DOM tree through htmlspecialchars, then rebuild the html, then it will be safe - right?

You don't actually want to do that. A more recommended route is to use the dom extension (powered by libxml), and then do what you described. This is generally correct, be careful with the whitelists.

My assumption was that DOMDocument XML parser would not accept broken input, while tidy does? That's what is left in my memory from the last time I used it. For sure DOMDocument has the nicer DOM manipulation language (still not as nice as javascript DOM).

donlimon
Re: "Compare" article: What is the premise?
April 24, 2009 05:28PM
As an extension to 2), I could allow some whitelisted attribute names and attribute values (using a regex filter for urls, for instance). It could still be safe, right?

Now you are in dragons territory (and this is why you want to use HTML Purifier). Regex filter for URL has a lot of workarounds; in theory yes, but it gets a lot harder to do it in practice.

But again, my simple stupid regex would not all be unsafe, it's just that one regex would be too strict, while the other would be unsafe - right?

So if I would restrict the possible URLs to a very basic format, I could write a safe regex for that - I guess the difficult part is having funky special chars in the GET query string?

Re: "Compare" article: What is the premise?
April 24, 2009 09:31PM
My assumption was that DOMDocument XML parser would not accept broken input, while tidy does? That's what is left in my memory from the last time I used it. For sure DOMDocument has the nicer DOM manipulation language (still not as nice as javascript DOM).

Use loadHTML, and DOMDocument will use its HTML parser and can handle some pathological cases. Once we finish html5lib implementation for PHP5, you can get better parsing types.

But again, my simple stupid regex would not all be unsafe, it's just that one regex would be too strict, while the other would be unsafe - right?

So if I would restrict the possible URLs to a very basic format, I could write a safe regex for that - I guess the difficult part is having funky special chars in the GET query string?

You should read RFC 3986. And then come back here. :-) URLs are surprisingly complicated.

donlimon
Re: "Compare" article: What is the premise?
July 10, 2009 12:15PM

Hey! I had a look into the code, and as I understand it, htmlpurifier does in fact use DOMDocument and work on the DOM tree. This had not been obvious to me before.

Maybe it can help to explain that on the "comparison" page. Think of people who want to implement their own DOMDocument-based algorithm. If they are told that htmlpurifier does the same, but more reliable and feature-complete, they don't need to be afraid to lose something when switching to htmlpurifier.

Re: "Compare" article: What is the premise?
July 10, 2009 12:32PM
Maybe it can help to explain that on the "comparison" page. Think of people who want to implement their own DOMDocument-based algorithm. If they are told that htmlpurifier does the same, but more reliable and feature-complete, they don't need to be afraid to lose something when switching to htmlpurifier.

The thing about this, though, is we don't output a DOMDocument (we output a string), so you have to reparse it when you're done. This is something we hope to fix eventually.

Sorry, you do not have permission to post/reply in this forum.