Welcome! » Log In » Create A New Profile

Would HTML Purifier be overkill for this kind of usage?

Posted by Stefano 

Hello, I'm coding a simple PHP script. I'm not very experienced and I can't say for sure that I really need HTML Purifier, so I would like to ask for your advice.

These are some specifications for the script that I'm writing:

- I'm using UTF-8. The database, server headers and meta tags are properly set-up.

- The document type is XHTML 1.0 Strict.

- I'm using PHP 5.2.5

- I'm using prepared statements with bound parameters so I don't have to worry about SQL injection.

- I'm only accepting input from the user through POST requests.

What I would like to prevent is any possible security problem, like XSS, caused by malicious input.

I generally parse everything the user submits with a few regex to allow only a very restricted subset of the ASCII character set. In these cases I feel confident enough that no problems should arise.

What I'm concerned about is accepting input like usernames and comments. I would like the user to be able to insert every character without restrictions. Fortunately, I don't need to preserve any HTML syntax or things like that. The input doesn't need to be parsed in any way. I suppose this makes things simpler.

To avoid malicious input I'm using something like this:

/* Strip all non printable characters. I allow CR, LF and TAB in the regex for comments */
$userInput = preg_replace ('%[\x00-\x31\x127]%', '', $userInput);
/* Convert angled brackets with < and > */
$userInput = preg_replace (&#039;%<%&#039;, &#039;&lt;&#039;, $userInput);
$userInput = preg_replace (&#039;%>%&#039;, &#039;&gt;&#039;, $userInput);

Would this be enough to prevent security breaches? I suppose that getting rid of the angled brackets will prevent the injection of HTML code and avoiding non printable characters (apart from CR, LF and TAB when needed) is enough for my simple needs.

I would love to hear the opinion of more experienced people on this matter. I'm not just using HTML Purifier because being as powerful and configurable as it is, I'm afraid I could set some configuration parameters wrong. Shooting myself in the feet.

Thanks for your attention!

Re: Would HTML Purifier be overkill for this kind of usage?
April 13, 2008 04:53PM

HTML Purifier would be overkill for this task.

Would this be enough to prevent security breaches?

The code is pretty good (probably enough to prevent XSS), but it's not complete. The biggest problems I can see are that ampersands aren't encoded properly, and you're not checking for UTF-8 well-formedness. Also, the code is a little inefficient.

You can probably use this:

function escapeHTML($string) {
    $string = HTMLPurifier_Encoder::cleanUTF8($string);
    $string = htmlspecialchars($string, ENT_COMPAT, &#039;UTF-8&#039;);
    return $string;

where HTMLPurifier_Encoder::cleanUTF8 is a function that checks for UTF-8 validity and non-SGML codepoints.

Thanks you very much for the thoughtful reply.

As of now I want to focus on the UTF-8 well-formedness that you pointed out.

I had a look at the source of the function you mentioned. Even if for now I can't really tell what's going on I'm sure I'll be able to learn much from it when I'm more experienced. Especially since it's so well commented.

Anyway, I'll be interacting exclusively with MySQL on the backend and I read that it only supports the BMP Unicode subset. Meaning that there will be needed at most 3 bytes of the UTF-8 encoding for each character.

While searching for more information I stumbled upon the following code to check for the correctness of UTF-8 input:

/*. string .*/ function utf8_bmp_filter(/*. string .*/ $s)
    DOC Filter and sanify UTF-8 BMP string

    Only valid UTF-8 bytes encoding the Unicode Basic Multilingual Plane
    subset (codes from 0x0000 up to 0xFFFF) are passed. Any other code or
    sequence is dropped. See RFC 3629 par. 4 for details.
    $T = "[\x80-\xBF]";

    return preg_replace("/("

        # Unicode range 0x0000-0x007F (ASCII charset):

        # Unicode range 0x0080-0x07FF:

        # Unicode range 0x0800-0xD7FF, 0xE000-0xFFFF:

        # Invalid/unsupported multi-byte sequence:

        "\$1", $s);

Again, I'm afraid I can't understand all the effects that this code will have on the input. Still I was wondering if I should use the above code, that may be enough for my necessities, or stick with HTMLPurifier_Encoder::cleanUTF8, that seems more complete.

In the end I will be using something like that:

/* Get rid of all the control characters. */
/* Use this to keep CR, LF and TAB */
/* $userInput = preg_replace(&#039;/[\\000-\\010\\013\\014\\016-\\037\\177]/&#039;, &#039;&#039;, $userInput); */
$userInput = preg_replace(&#039;/[\\000-\\037\\177]/&#039;, &#039;&#039;, $userInput);

/* Choose one between the utf8_bmp_filter function above and HTMLPurifier_Encoder::cleanUTF8 */
/* If HTMLPurifier_Encoder::cleanUTF8 is chosen the above preg_replace won&#039;t be necessary, */
/* if I understand correctly the source */

/* End with replacing &#039;delicate&#039; characters with their HTML entities */
/* before storing them in the database. */
/* This is the only step I can figure by myself :) */
/* I&#039;ll decide on which characters I really need to sanitize and then do some benchmarks with */
/* htmlentities, htmlspecialchars and preg_replace */

I really would like to know if the preg_replace and utf8_bmp_filter combo mentioned above would be as good as HTMLPurifier_Encoder::cleanUTF8 for my needs. Or if the latter is still a more complete solution.

This argument seems really tricky for inexperienced php users like me, your opinion on the matter would truly be invaluable. Thanks for your time!

Re: Would HTML Purifier be overkill for this kind of usage?
April 22, 2008 01:43PM

The regexp makes me squeamish, although it might work. I'd have to run it against a test-suite to be sure. I can definitely vouch for HTMLPurifier_Encoder::cleanUTF8, but not necessarily for that regexp (and the regexp is almost certainly slower). It should be easy to change cleanUTF8 to only allow characters in the BMP; in fact, that sounds like something I should do myself.

Your pseudocode looks correct.

Sorry, you do not have permission to post/reply in this forum.