Posted by matta 
November 21, 2008 02:58PM

I'm trying to implement tinymce for public use, and consequently need to make sure the html isn't malicious. I'm trying to use html purifier to do this. The problem I've reached is that if there are multiple spaces in the user inputted text they will be replaced as   in the code - as they should be. However, after I purify the code, all of the   characters get replaced with  characters in the code, and they display when rendered. Is this an encoding problem? I've been google-ing and browsing around for a day now and haven't seen anyone else with the same problem... it's curious.

Thanks. --Matta

I have 2 files, edit.php (with tinymce posting to save.php) save.php (loading the purifier and cleaning and then inserting into a db)

In save.php

$dirty = stripslashes($_POST['content']);

require_once 'library/HTMLPurifier.auto.php';
$purifier = new HTMLPurifier();
$clean = $purifier->purify($dirty);

$content = mysql_real_escape_string($clean);

// $content then gets inserted into the db

Re:   being replaced by Â
November 21, 2008 03:26PM

Yepp... it was encoding, I had the database encoding as UTF-8 correctly but my editor page was configured differently.

Answer: http://htmlpurifier.org/docs/enduser-utf8.html

Added this line to my .php file...

header('Content-Type:text/html; charset=UTF-8');

Before any output, like the link instructs to.


Re:   being replaced by Â
November 21, 2008 04:30PM

well... after testing it, the above fix made the code render correctly but the html output after being purified is actually...

<span>S P  A   C    E     S</span>

instead of what it should be...

<span>S P&nbsp; A&nbsp;&nbsp; C&nbsp;&nbsp;&nbsp; E&nbsp;&nbsp;&nbsp;&nbsp; S</span>

...consequently, when the file is loaded back into the editor, all multiple spaces are lost.

Any fix for this? I tried saving 2 outputs to the db, 1 straight from the editor and 1 after being run through the purifier but it didn't seem to work either.


It seems this thread is now my new blog.

Re: &nbsp; being replaced by Â
November 22, 2008 12:25AM

The whitespaces you see are actually non-breaking spaces, which were converted into UTF-8 form. I'm not sure why they're being converted back into normal spaces in the editor. Would you care to look more carefully at the output with urlencode()?

If you want a really quick fix for these problems, you can turn on %Core.EscapeNonASCIICharacters

Re: &nbsp; being replaced by Â
December 29, 2008 07:30AM

I have exactly the same problem.

Strange characters apear like: Â', 'Ã

I read the topic above and the browser encoding is utf 8 and the meta tage is utf 8 and after reading this topic at the top of my index file befoere session start I added header('Content-Type:text/html; charset=UTF-8'); But still the problem occurs.

I solved this problem by adding the following code:

$array = array('Â', 'Â'); $str = str_replace($array, &amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;, $str);

But how do I realy solve this problem than ? How do I use: %Core.EscapeNonASCIICharacters with something like $config-&amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; etc etc ?? or isn't that the sollution ?



Re: &nbsp; being replaced by Â
December 29, 2008 07:35AM

Even now when I pasted my example code above with the A it is displayed wrongley. The first A should be an A with a strange rooftop symbol and than the second A with the curly symbol and a , right after that.

Re: &nbsp; being replaced by Â
December 29, 2008 11:03AM

The next conclusion is that your database is configured improperly. Have you used SET NAMES to ensure UTF-8 transfer of data? What version of MySQL are you using?

It would also be great if you could tell me the binary content of the data coming into the application, and going out of the application. What HTML Purifier configuration are you using?

Re: &nbsp; being replaced by Â
December 29, 2008 05:34PM

I should have been more clear on when it happens.

The characters apear before the data is submitted to the database so that's not where the problem lies.

It appears when some data is entered (in tinymce editor) and then send to a preview page. Between those to steps it is filtered through html purifier. When I commend out the html purifier lines the problem does not occur.

This is the configuration I use:

require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();

$config->set('Cache', 'DefinitionImpl', null);

$config->set('HTML', 'Allowed', 'p,b,i,u, strike, ul, li, ol, br, a[href]');

$purifier = new HTMLPurifier($config);

$str = $purifier->purify($str);

I don't have any knowledge of SET NAMES or binary data.



Re: &nbsp; being replaced by Â
December 29, 2008 05:50PM

On preview, is the encoding of the page also UTF-8?

Re: &nbsp; being replaced by Â
December 29, 2008 06:35PM

I did some testing. When I remove the line:

header('Content-Type:text/html; charset=UTF-8');

than the encoding is Western Europe ISO

But when I add it, yes it becomes UTF-8

I did some more testing:

When I do enter enter in the editor and send it to the preview page WITHOUT running it through html purifier in html it becomes &lt;p&gt;&amp;nbsp;&lt;/p&gt;

BUT when running it through html purifier it becomes &lt;p&gt;&Acirc;&nbsp;&lt;/p&gt;



Re: &nbsp; being replaced by Â
December 29, 2008 07:00PM

Ah. What other functions are being run on the HTML? Are you running htmlentities() by any chance?

Re: &nbsp; being replaced by Â
December 29, 2008 07:37PM


in this order:

$str = FilterXSStekst($str);

$str = htmlentities($str, ENT_QUOTES);

If I don't, it would messup the preview page.

Re: &nbsp; being replaced by Â
December 29, 2008 07:40PM

Use htmlspecialchars() instead.

Re: &nbsp; being replaced by Â
December 29, 2008 11:30PM

Ok, that works !

now the problem shifts to the database. There again apear the strange characters.

the tabel is formatted as utf8_general_ci.......

Re: &nbsp; being replaced by Â
December 29, 2008 11:32PM

Read the database section in the Secret to UTF-8.

Re: &nbsp; being replaced by Â
December 30, 2008 06:42PM

I am not sure to what your are refering in there.

Are you refering to binary fields or something else ?



Re: &nbsp; being replaced by Â
December 30, 2008 10:40PM

I'm referring to the encoding of the fields in your database. They either are setup as Latin-1 improperly (they need to be changed to UTF-8) or your connection encoding is Latin-1, in which case you need SET NAMES. Or possibly both.

Re: &nbsp; being replaced by Â
January 02, 2009 04:29PM

htmlspecialchars() defaults to ISO8859-1 aswell as htmlentities()

so in both those functions, if you are using UTF-8, you need to set the character encoding parameter also

htmlspecialchars($str, ENT_QUOTES, UTF-8);
htmlentities($str, ENT_QUOTES, UTF-8);

we had some issues with multilingual setups due to this issue, and took a bit of work to narrow it down, as it wasn't giving the same issues to everyone, it depended on different server configs.

also note. as of PHP5.2.3 both functions have an extra parameter added aswell.

When double_encode is turned off PHP will not encode existing html entities. The default is to convert everything.

htmlspecialchars( string $string [, int $quote_style [, string $charset [, bool $double_encode ]]] ) htmlentities( string $string [, int $quote_style [, string $charset [, bool $double_encode ]]] )

as for

How do I use: %Core.EscapeNonASCIICharacters with something like $config-> etc etc ?? or isn't that the sollution ?

use it same as other configs >

$config->set('CORE', 'EscapeNonASCIICharacters', true);

Re: &nbsp; being replaced by Â
February 26, 2009 10:08PM

i been getting the same ERROR

PHP is PHP Version 5.2.8

MySQL 5.1.30: Database/Tables is set to UTF-8 &lt;pre&gt;&lt;![CDATA[ example of a field MY_TEXT text utf8_general_ci ]]&gt;&lt;/pre&gt;

HTML PAGE: &lt;pre&gt;&lt;![CDATA[ &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;application/xhtml+xml; charset=UTF-8&quot; /&gt; ]]&gt;&lt;/pre&gt;

Purifier: &lt;pre&gt;&lt;![CDATA[ $config-&gt;set('Core', 'Encoding', 'UTF-8'); ]]&gt;&lt;/pre&gt;

Before I display the Data in the textarea i do &lt;pre&gt;&lt;![CDATA[ $MY_TEXT = htmlspecialchars($MY_TEXT, ENT_QUOTES, 'UTF-8') ]]&gt;&lt;/pre&gt;

Actually every thing Looks good...but i find when looking into Database via Phpmyadmin i see the &lt;pre&gt;&lt;![CDATA[ &amp;nbsp; being replaced by  ]]&gt;&lt;/pre&gt;

Re: &nbsp; being replaced by Â
February 26, 2009 10:51PM

Have you used SET NAMES?

Re: &nbsp; being replaced by Â
February 27, 2009 12:37AM


&lt;pre&gt;&lt;![CDATA[ $dbx = mysql_connect('localhost', username, password) or die('Could not connect to the database server'.mysql_error()); mysql_select_db(database,$dbx) or die (&quot;select xxxx failed: &quot;.mysql_error()); mysql_query('set names utf8'); ]]&gt;&lt;/pre&gt;

Re: &nbsp; being replaced by Â
August 27, 2013 02:02AM


