Welcome! » Log In » Create A New Profile

Help me understand the proper use of html purifier :)

Posted by Dingo2 
Help me understand the proper use of html purifier :)
December 22, 2012 07:36AM

I apologize if topic like this has appeared too many times, but I find it is better to ask stupid questions than make stupid mistakes.

I started learning php about an year ago (sporadically), and about a few months ago I decided to create an application which I use to learn important php concepts and at the end, security.

For the framework I chose codeigniter and for now I feel really comfortable with it. About 2 weeks ago I started learning in more detail about security (from lerning how to program with security in mind, to informing myself about projects such as htmlpurifier) and how can I protect my application from stuff.

While codeigniter got me covered with session encoding, sql injection (via the active record), csrf I kinda didn't understand how to protect my app from xss, because I don't think I understand xss properly and when some other stuff will be less expensive in protecting user form input.

So for the sake of clarification, I would be grateful if somebody can explain the use of htmlpurifier on an exaple

The example: I have a simple form which contains the following - An input field (name) which should only allow alpha (utf-8) symbols (so no stuff like ', etc). - A textarea, a simple textarea which doesn't allow any of the htlm stuff (strips the tags) and on output it uses preg_replace to make paragraphs out of \n etc. (on edit it just returns it).

So should I use htmlpurifier for a input field, which only accepts alpha? (uses preg_match("/^([a-z])+$/i", $str)? to filter), or should I use purifier on top of it?

Should I use purifier on an textarea which uses strip_tags(), or should I use purifier (and how should I configure it to prorerlly strip tags?).

I guess that on form element such as dropdown and radio button using a purifier makes no sense?

I read in the documentation (at least the parts that I could understand to be honest, i am lacking in knowledge) that in order to speed up html purifier, one should purify on input, and not on output? So the question is when should I purify on output?

Other than on form submision should I purify somewhere else too?

Can someone point out what are the classical mistakes that people do when using htmlpurifier, and some common knolwedge guidelines about it?

Re: Help me understand the proper use of html purifier :)
December 22, 2012 10:59AM

I think the most important mental model to have when attempting to avoid XSS attacks is understanding the types of the data you have. Languages like PHP don’t do a very good job of helping you keep this distinction. Here are slides for a presentation I did a while back which cover the topic: http://mit.edu/~ezyang/Public/iap/intro-to-was.html

In a nutshell: string is not a type. A string is either HTML, or plaintext, or maybe even a number. HTML Purifier is a function of type ``HTML -> HTML``; if your input isn’t HTML, then don’t use HTML Purifier on it. In contrast, ``htmlspecialchars`` has type ``PlainText -> HTML`` and is appropriate for those situations. Don’t use striptags on plain text forms; you want users to be able to write things like <foo@example.com> and not have it be scrubbed out; it’s perfectly safe if you escape.

Re: Help me understand the proper use of html purifier :)
December 22, 2012 11:08AM
The example: I have a simple form which contains the following - An input field (name) which should only allow alpha (utf-8) symbols (so no stuff like ', etc). - A textarea, a simple textarea which doesn't allow any of the htlm stuff (strips the tags) and on output it uses preg_replace to make paragraphs out of \n etc. (on edit it just returns it).
So should I use htmlpurifier for a input field, which only accepts alpha? (uses preg_match("/^([a-z])+$/i", $str)? to filter), or should I use purifier on top of it?

no, preg_replace is enough, but maybe also use htmlspecialchars() on output too. also u may want to look into PHP filter_var() & filter_input()

Should I use purifier on an textarea which uses strip_tags(), or should I use purifier (and how should I configure it to prorerlly strip tags?).

no, only use htmlpurifier for html content, never for plain text. using strip tags() could possibly give you some problems in context, depends on use. but use htmlspecialchars() which will prevent any code from actually being parsed by php & treat it like text instead. if you escape properly, you're fine.

I guess that on form element such as dropdown and radio button using a purifier makes no sense?

no, but you should still use validation/filtering & type casting, such as (int)$var if it should expect integer values such as 1 or 0 for radio buttons or checkboxes, due to $_POST & $_GET manipulation, & read up on the dangers of $_REQUEST & $_GET etc & global vars.

ImpressCMS: Make A Lasting Impression

Edited 2 time(s). Last edit at 12/22/2012 11:15AM by vaughan.

Re: Help me understand the proper use of html purifier :)
December 22, 2012 01:52PM

I think the most important mental model to have when attempting to avoid XSS attacks is understanding the types of the data you have. Languages like PHP don’t do a very good job of helping you keep this distinction. Here are slides for a presentation I did a while back which cover the topic: http://mit.edu/~ezyang/Public/iap/intro-to-was.html

In a nutshell: string is not a type. A string is either HTML, or plaintext, or maybe even a number. HTML Purifier is a function of type ``HTML -> HTML``; if your input isn’t HTML, then don’t use HTML Purifier on it. In contrast, ``htmlspecialchars`` has type ``PlainText -> HTML`` and is appropriate for those situations. Don’t use striptags on plain text forms; you want users to be able to write things like <foo@example.com> and not have it be scrubbed out; it’s perfectly safe if you escape.

Thank you for the response, so when using plaintext htmlspecialchars is enough, what about htmlentities?

Re: Help me understand the proper use of html purifier :)
December 22, 2012 01:55PM
The example: I have a simple form which contains the following - An input field (name) which should only allow alpha (utf-8) symbols (so no stuff like ', etc). - A textarea, a simple textarea which doesn't allow any of the htlm stuff (strips the tags) and on output it uses preg_replace to make paragraphs out of \n etc. (on edit it just returns it).
So should I use htmlpurifier for a input field, which only accepts alpha? (uses preg_match("/^([a-z])+$/i", $str)? to filter), or should I use purifier on top of it?

no, preg_replace is enough, but maybe also use htmlspecialchars() on output too. also u may want to look into PHP filter_var() & filter_input()

Should I use purifier on an textarea which uses strip_tags(), or should I use purifier (and how should I configure it to prorerlly strip tags?).

no, only use htmlpurifier for html content, never for plain text. using strip tags() could possibly give you some problems in context, depends on use. but use htmlspecialchars() which will prevent any code from actually being parsed by php & treat it like text instead. if you escape properly, you're fine.

I guess that on form element such as dropdown and radio button using a purifier makes no sense?

no, but you should still use validation/filtering & type casting, such as (int)$var if it should expect integer values such as 1 or 0 for radio buttons or checkboxes, due to $_POST & $_GET manipulation, & read up on the dangers of $_REQUEST & $_GET etc & global vars.

I see, codeigniter has functions for validation that are numeric so I can use that also. Thank you for your response.

Re: Help me understand the proper use of html purifier :)
December 22, 2012 01:57PM

Btw, can you give me an example where html parser should be used, in other words `HTML -> HTML`` type, on an site example or a theoretic example?

Re: Help me understand the proper use of html purifier :)
December 22, 2012 09:18PM

htmlentities also transforms non-ASCII characters into entities; if you're website is being served in UTF-8 (like it ought to be in the 21st century) then it really is irrelevant.

Re: Help me understand the proper use of html purifier :)
December 22, 2012 09:20PM

Simple example: you have some users who want to be able to post with formatting, images, etc. OK: you use HTML Purifier to clean it up.

Re: Help me understand the proper use of html purifier :)
December 23, 2012 02:49AM

I think I finally get it, thank you for taking the time to reply :)

Re: Help me understand the proper use of html purifier :)
April 15, 2013 07:01AM

Apologies for reviving this slightly old thread, but it seems to be an apt place to continue this discussion..

Getting your head to understand the correct concepts about coding web applications securely seems to be quite a tricky task for many novice and slightly beyond novice programmers in PHP (and I include myself in the latter category: not an absolute beginner, but I wouldn't claim to be "expert"). This is unfortunately not helped by the vast number of web tutorials (and even published books) out there that only demonstrate insecure or out-of-date coding practices (eg, mysql_ functions rather than mysqli_ or PDO).

If, as a novice PHP coder, you do start to dig a little deeper you do find out about potential problems such as SQL injection, HTML injection, XSS, bad and good ways to sanitise your data (something that any programmer should at least be almost automatically aware of the need to do, even if they don't yet know how to go about it properly).

It looks as though, as part of the learning process, many of us stumble across the (excellent - thank you so much, Edward!) HTML Purifier, and see it as part of the solution to our potential problems. The question of when to use (and when not to use) HTML Purifier seems to come up quite frequently on the forums (it should probably be an FAQ).

One topic that in particular seems to come up quite often is whether or not to use HTML Purifier to remove all HTML from a string (as a user may have, maliciously or otherwise, entered HTML in their input (input which you require to be plain text only for your use)), as it seems to be a very tempting way to do just that. The official answer seems to be "Don't", and to use strip_tags and htmlspecialchars instead (although strip_tags comes in for strong criticism on the comparison web page?).

If Edward or any of the other more knowledgeable coders here would be able to write a little FAQ entry or tutorial on this topic in an easy-to-understand format, I am sure that would help many people take a step further towards enlightenment. (I have read the presentation referred to earlier in the thread, for which I am grateful, but this particular question about stripping all HTML from a string seems to come up often enough that perhaps something addressing this particular point in detail would be very useful.)

And, thanks again for all your hard work on HTML Purifier.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: