Welcome! » Log In » Create A New Profile

mbstring.func_overload enabled

Posted by Klaus Leithoff 
Klaus Leithoff
mbstring.func_overload enabled
November 09, 2010 09:51AM

Intro: we use htmlpurifier with the EGroupware project. Right now we use Version 4.1.1 (for our latest release and developer tree)

We encountered problems with URI containing umlauts or specialchars using UTF-8 and mbstring.func_overload = 7 URIs where chopped off by the number of chars equaling the amount of umlauts with the URI in question. The problem did not occur when switching off mbstring.func_overload

We figured out that even encoding the umlauts could not solve the problem, as htmlpurifier, due to a UTF-8 specific feature, automatically resolves all entities (According to the Docs). The helper/workaround mentioned in Configuration Docs -> Core.EscapeNonASCIICharacters could not resolve the issue.

Since we need mbstring.func_overload we would like to suggest a change to

File: library/HTMLPurifier/PercentEncoder.php

Function: encode

Index: library/HTMLPurifier/PercentEncoder.php
===================================================================
--- library/HTMLPurifier/PercentEncoder.php	(Revision 32902)
+++ library/HTMLPurifier/PercentEncoder.php	(Arbeitskopie)
@@ -49,13 +49,15 @@
      */
     public function encode($string) {
         $ret = '';
+        //error_log(__METHOD__.__LINE__.$string);
         for ($i = 0, $c = strlen($string); $i < $c; $i++) {
-            if ($string[$i] !== &#039;%&#039; && !isset($this->preserve[$int = ord($string[$i])]) ) {
+            if (substr($string,$i,1) !== &#039;%&#039; && !isset($this->preserve[$int = ord(substr($string,$i,1))]) ) {
                 $ret .= &#039;%&#039; . sprintf(&#039;%02X&#039;, $int);
             } else {
-                $ret .= $string[$i];
+                $ret .= substr($string,$i,1);
             }
         }
+        //error_log(__METHOD__.__LINE__.$ret);
         return $ret;
     }

Best regards Leithoff, Klaus

Re: mbstring.func_overload enabled
November 09, 2010 10:07AM

It is my belief that if you attempt to use mbstring.func_overload with HTML Purifier you will just generally lose hard. That feature totally changes the semantics of every string function ever, and HTML Purifier relies on 8-bit semantics for its correctness.

Ralf Becker
Re: mbstring.func_overload enabled
November 09, 2010 11:22AM

Sorry to disagree.

That total change is what gives you a correct handling of multibyte chars in PHP ;-)

Simple example:

$str = "Öl";

which mbstring.func_overload = 7:

strlen($str) === 2
for($i = 0; $i < strlen($str); ++$i) {
  $chr = substr($str,$i,1);
}

gives "Ö" and "l" for $chr

with mbstring.func_overload = 0:

strlen($str) === 3
for($i = 0; $i < strlen($str); ++$i) {
  $chr = $str[$i];
}

gives chr(195), chr(150) and "l" for $chr

Which the given string "Öl" you have no problem with your code, but it's easy to find a multibyte char where the second byte is eg. the equivalent of a "&" ...

I agree that our fix, does not make HTMLpurifier fully compatible with mbstring.func_overload=7 and utf-8 encoding, it just improves what's currently there.

Anyway thanks for for HTMLpurifier :-)

Ralf

EGroupware project administrator

Re: mbstring.func_overload enabled
November 09, 2010 11:30AM

That is correct. However, if you've written your string operations with byte-by-byte semantics in mind, it's actually fairly easy due to the design of UTF-8 to write operations that work quickly and safely with international text. This is the approach HTML Purifier takes, and we have some fairly low level encoding and decoding subroutines which rely on byte semantics.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: