Posted by Osama 
June 02, 2008 08:38AM

Does HTMLPurifier remove HTML comment tag (i.e. <!---)? And if yes, is there a way to get around this? I think it is important in some cases. e.g. Wordpress allows users to add a "read more ..." by saving the word "more" inside a comment tag along with the content, and then it renders it on runtime.

Thank you.

June 02, 2008 01:17PM

HTML Purifier removes comment tags. If you want to get around this, considering writing a filter that replaces the comments with, say, a br tag with a special class attribute.

It might be possible to have HTML Purifier allow comment tags with that match certain text patterns, but something like that would need patching the core code.

Waldo Jaquith
June 20, 2008 03:09PM

That's too bad -- that's a deal-breaker for me, at least in my immediate purpose for it. (This is such a useful program that I'm bound to find other purposes for it.) I'm quite happy to hack in some new functionality, at least in principle, but your response comes across as closer to a warning not to do that than anything else. :)

So I'll just go on record here and say that it would be really, really helpful for HTML Purifier to be able to ignore HTML comments. For any program that uses comments for core functionality (like WordPress, as mentioned), HTML Purifier would cripple that, making it a no-go for those purposes.

June 20, 2008 09:56PM

Alright, so here's the rationale:

Allowing comments verbatim is asking for vulnerability, because while comments are not supposed to affect execution in any way, in practice they do. The most prominent case are IE conditional comments, but as you noted, WordPress also uses comments for these sorts of things.

In all probability, allowing only certain comments, such as the page-break comment from WordPress, etc. would be a workable and safe solution, and I can see what I can do to get that into the core.

Waldo Jaquith
June 25, 2008 10:20PM

I'm admittedly new to HTML Purifier, so I can't say whether this fits in with the philosophy of HTML Purifier, but couldn't an option to retain comments be provided, with the well-labeled caveat that it's a potential security hole? On three of the four sites on which I intend to deploy HTML Purifier, I'm going to use it to clean up my own content, so there simply aren't any security concerns. On the fourth, I'm with you: I'd be nuts to let people post HTML comments, because no good could possibly come of allowing them.

By way of illustration, I'm using HTML Purifier on a project for the publication I work for, Virginia Quarterly Review. I'm finishing up a project to OCR the first 50 years of our publication (1925-1975), using HTML comments to demarcate and number each page break, to make it possible to match up GIFs of each page with the actual text of that page. The HTML generated by the OCR software (Abbyy FineReader) is far from perfect. So I'm using HTML Purifier to clean it up. But I can't, because it would strip out my HTML comments. I'm trying to come up with some sort of a goofy workaround to allow the comments to survive HTML Purifier and then be restored as comments, and I imagine that'll work out. It's a kludge, which is a shame, since HTML Purifier seems to be the precise opposite of a kludge. :)

June 25, 2008 10:22PM

Waldo Jaquith, I can certainly implement comments when %HTML.Trusted is true. In fact, let me do that right now.

June 25, 2008 11:16PM

Implemented in dba3ed7. You can grab a diff from there, or git clone master and take it out for a spin.

Waldo Jaquith
July 10, 2008 11:36AM

Yee-haw. :) Thanks for that, Edward--that "trusted" flag seems like the perfect solution. (To this and to a number of other hypothetical scenarios that I can come up with.) I appreciate it!

April 24, 2017 11:27AM

Thank you!

