Source for file Encoder.php
Documentation is available at Encoder.php
* A UTF-8 specific character encoder that handles cleaning and transforming.
* @note All functions in this class should be static.
* Constructor throws fatal error if you attempt to instantiate class
private function __construct() {
trigger_error('Cannot instantiate encoder, call methods statically', E_USER_ERROR);
* Error-handler that mutes errors, alternative to shut-up operator.
private static function muteErrorHandler() {}
* Cleans a UTF-8 string for well-formedness and SGML validity
* It will parse according to UTF-8 and return a valid UTF8 string, with
* non-SGML codepoints excluded.
* @note Just for reference, the non-SGML code points are 0 to 31 and
* 127 to 159, inclusive. However, we allow code points 9, 10
* and 13, which are the tab, line feed and carriage return
* respectively. 128 and above the code points map to multibyte
* @note Fallback code adapted from utf8ToUnicode by Henri Sivonen and
* hsivonen@iki.fi at <http://iki.fi/hsivonen/php-utf8/> under the
* LGPL license. Notes on what changed are inside, but in general,
* the original code transformed UTF-8 text into an array of integer
* Unicode codepoints. Understandably, transforming that back to
* a string would be somewhat expensive, so the function was modded to
* directly operate on the string. However, this discourages code
* reuse, and the logic enumerated here would be useful for any
* function that needs to be able to understand UTF-8 characters.
* As of right now, only smart lossless character encoding converters
* would need that, and I'm probably not going to implement them.
* Once again, PHP 6 should solve all our problems.
public static function cleanUTF8($str, $force_php =
false) {
// UTF-8 validity is checked since PHP 4.3.5
// This is an optimization: if the string is already valid UTF-8, no
// need to do PHP stuff. 99% of the time, this will be the case.
// The regexp matches the XML char production, as well as well as excluding
// non-SGML codepoints U+007F to U+009F
if (preg_match('/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du', $str)) {
$mState =
0; // cached expected number of octets after the current octet
// until the beginning of the next UTF8 character sequence
$mUcs4 =
0; // cached Unicode character
$mBytes =
1; // cached expected number of octets in the current sequence
// original code involved an $out that was an array of Unicode
// codepoints. Instead of having to convert back into UTF-8, we've
// decided to directly append valid UTF-8 characters onto a string
// $out once they're done. $char accumulates raw bytes, while $mUcs4
// turns into the Unicode code point, so there's some redundancy.
for($i =
0; $i <
$len; $i++
) {
$char .=
$str[$i]; // append byte to char
// When mState is zero we expect either a US-ASCII character
// or a multi-octet sequence.
if (0 ==
(0x80 & ($in))) {
// US-ASCII, pass straight through.
if (($in <=
31 ||
$in ==
127) &&
!($in ==
9 ||
$in ==
13 ||
$in ==
10) // save \r\t\n
// control characters, remove
} elseif (0xC0 ==
(0xE0 & ($in))) {
// First octet of 2 octet sequence
$mUcs4 =
($mUcs4 & 0x1F) <<
6;
} elseif (0xE0 ==
(0xF0 & ($in))) {
// First octet of 3 octet sequence
$mUcs4 =
($mUcs4 & 0x0F) <<
12;
} elseif (0xF0 ==
(0xF8 & ($in))) {
// First octet of 4 octet sequence
$mUcs4 =
($mUcs4 & 0x07) <<
18;
} elseif (0xF8 ==
(0xFC & ($in))) {
// First octet of 5 octet sequence.
// This is illegal because the encoded codepoint must be
// (a) not the shortest form or
// (b) outside the Unicode range of 0-0x10FFFF.
// Rather than trying to resynchronize, we will carry on
// until the end of the sequence and let the later error
// handling code catch it.
$mUcs4 =
($mUcs4 & 0x03) <<
24;
} elseif (0xFC ==
(0xFE & ($in))) {
// First octet of 6 octet sequence, see comments for 5
$mUcs4 =
($mUcs4 & 1) <<
30;
// Current octet is neither in the US-ASCII range nor a
// legal first octet of a multi-octet sequence.
// When mState is non-zero, we expect a continuation of the
if (0x80 ==
(0xC0 & ($in))) {
$shift =
($mState -
1) *
6;
$tmp =
($tmp & 0x0000003F) <<
$shift;
// End of the multi-octet sequence. mUcs4 now contains
// the final Unicode codepoint to be output
// Check for illegal sequences and codepoints.
// From Unicode 3.1, non-shortest form is illegal
if (((2 ==
$mBytes) &&
($mUcs4 <
0x0080)) ||
((3 ==
$mBytes) &&
($mUcs4 <
0x0800)) ||
((4 ==
$mBytes) &&
($mUcs4 <
0x10000)) ||
// From Unicode 3.2, surrogate characters = illegal
(($mUcs4 & 0xFFFFF800) ==
0xD800) ||
// Codepoints outside the Unicode range are illegal
} elseif (0xFEFF !=
$mUcs4 &&
// omit BOM
// check for valid Char unicode codepoints
(0x20 <=
$mUcs4 &&
0x7E >=
$mUcs4) ||
// 7F-9F is not strictly prohibited by XML,
// but it is non-SGML, and thus we don't allow it
(0xA0 <=
$mUcs4 &&
0xD7FF >=
$mUcs4) ||
(0x10000 <=
$mUcs4 &&
0x10FFFF >=
$mUcs4)
// initialize UTF8 cache (reset)
// ((0xC0 & (*in) != 0x80) && (mState != 0))
// Incomplete multi-octet sequence.
// used to result in complete fail, but we'll reset
* Translates a Unicode codepoint into its corresponding UTF-8 character.
* @note Based on Feyd's function at
* <http://forums.devnetwork.net/viewtopic.php?p=191404#191404>,
* which is in public domain.
* @note While we're going to do code point parsing anyway, a good
* optimization would be to refuse to translate code points that
* are non-SGML characters. However, this could lead to duplication.
* @note This is very similar to the unichr function in
* maintenance/generate-entity-file.php (although this is superior,
* due to its sanity checks).
// +----------+----------+----------+----------+
// | 33222222 | 22221111 | 111111 | |
// | 10987654 | 32109876 | 54321098 | 76543210 | bit
// +----------+----------+----------+----------+
// | | | | 0xxxxxxx | 1 byte 0x00000000..0x0000007F
// | | | 110yyyyy | 10xxxxxx | 2 byte 0x00000080..0x000007FF
// | | 1110zzzz | 10yyyyyy | 10xxxxxx | 3 byte 0x00000800..0x0000FFFF
// | 11110www | 10wwzzzz | 10yyyyyy | 10xxxxxx | 4 byte 0x00010000..0x0010FFFF
// +----------+----------+----------+----------+
// | 00000000 | 00011111 | 11111111 | 11111111 | Theoretical upper limit of legal scalars: 2097151 (0x001FFFFF)
// | 00000000 | 00010000 | 11111111 | 11111111 | Defined upper limit of legal scalar codes
// +----------+----------+----------+----------+
public static function unichr($code) {
if($code >
1114111 or $code <
0 or
($code >=
55296 and $code <=
57343) ) {
// bits are set outside the "valid" range as defined
// regular ASCII character
$y =
(($code & 2047) >>
6) |
192;
$y =
(($code & 4032) >>
6) |
128;
$z =
(($code >>
12) & 15) |
224;
$z =
(($code >>
12) & 63) |
128;
$w =
(($code >>
18) & 7) |
240;
// set up the actual character
* Converts a string to UTF-8 based on configuration.
$encoding =
$config->get('Core', 'Encoding');
if ($encoding ===
'utf-8') return $str;
if ($iconv ===
null) $iconv =
function_exists('iconv');
set_error_handler(array('HTMLPurifier_Encoder', 'muteErrorHandler'));
if ($iconv &&
!$config->get('Test', 'ForceNoIconv')) {
$str =
iconv($encoding, 'utf-8//IGNORE', $str);
// If the string is bjorked by Shift_JIS or a similar encoding
// that doesn't support all of ASCII, convert the naughty
// characters to their true byte-wise ASCII/UTF-8 equivalents.
} elseif ($encoding ===
'iso-8859-1') {
* Converts a string from UTF-8 based on configuration.
* @note Currently, this is a lossy conversion, with unexpressable
* characters being omitted.
$encoding =
$config->get('Core', 'Encoding');
if ($encoding ===
'utf-8') return $str;
if ($iconv ===
null) $iconv =
function_exists('iconv');
if ($escape =
$config->get('Core', 'EscapeNonASCIICharacters')) {
$str =
HTMLPurifier_Encoder::convertToASCIIDumbLossless($str);
set_error_handler(array('HTMLPurifier_Encoder', 'muteErrorHandler'));
if ($iconv &&
!$config->get('Test', 'ForceNoIconv')) {
// Undo our previous fix in convertToUTF8, otherwise iconv will barf
if (!$escape &&
!empty($ascii_fix)) {
foreach ($ascii_fix as $utf8 =>
$native) $clear_fix[$utf8] =
'';
$str =
strtr($str, $clear_fix);
$str =
iconv('utf-8', $encoding .
'//IGNORE', $str);
} elseif ($encoding ===
'iso-8859-1') {
* Lossless (character-wise) conversion of HTML to ASCII
* @param $str UTF-8 string to be converted to ASCII
* @returns ASCII encoded string with non-ASCII character entity-ized
* @warning Adapted from MediaWiki, claiming fair use: this is a common
* algorithm. If you disagree with this license fudgery,
* @note Uses decimal numeric entities since they are best supported.
* @note This is a DUMB function: it has no concept of keeping
* character entities that the projected character encoding
* can allow. We could possibly implement a smart version
* but that would require it to also know which Unicode
* codepoints the charset supported (not an easy task).
* @note Sort of with cleanUTF8() but it assumes that $str is
for( $i =
0; $i <
$len; $i++
) {
$bytevalue =
ord( $str[$i] );
if( $bytevalue <=
0x7F ) { //0xxx xxxx
$result .=
chr( $bytevalue );
} elseif( $bytevalue <=
0xBF ) { //10xx xxxx
$working =
$working <<
6;
$working +=
($bytevalue & 0x3F);
$result .=
"&#" .
$working .
";";
} elseif( $bytevalue <=
0xDF ) { //110x xxxx
$working =
$bytevalue & 0x1F;
} elseif( $bytevalue <=
0xEF ) { //1110 xxxx
$working =
$bytevalue & 0x0F;
$working =
$bytevalue & 0x07;
* This expensive function tests whether or not a given character
* encoding supports ASCII. 7/8-bit encodings like Shift_JIS will
* fail this test, and require special processing. Variable width
* encodings shouldn't ever fail.
* @param string $encoding Encoding name to test, as per iconv format
* @param bool $bypass Whether or not to bypass the precompiled arrays.
* @return Array of UTF-8 characters to their corresponding ASCII,
* which can be used to "undo" any overzealous iconv action.
static $encodings =
array();
if (isset
($encodings[$encoding])) return $encodings[$encoding];
return array("\xC2\xA5" =>
'\\', "\xE2\x80\xBE" =>
'~');
return array("\xE2\x82\xA9" =>
'\\');
if (strpos($lenc, 'iso-8859-') ===
0) return array();
set_error_handler(array('HTMLPurifier_Encoder', 'muteErrorHandler'));
if (iconv('UTF-8', $encoding, 'a') ===
false) return false;
for ($i =
0x20; $i <=
0x7E; $i++
) { // all printable ASCII chars
if (iconv('UTF-8', "$encoding//IGNORE
", $c) ===
'') {
// Reverse engineer: what's the UTF-8 equiv of this byte
// sequence? This assumes that there's no variable width
// encoding that doesn't support ASCII.
$ret[iconv($encoding, 'UTF-8//IGNORE', $c)] =
$c;
$encodings[$encoding] =
$ret;
Documentation generated on Thu, 19 Jun 2008 18:49:08 -0400 by phpDocumentor 1.4.2