HTMLPurifier 4.4.0
HTMLPurifier_Encoder Class Reference

A UTF-8 specific character encoder that handles cleaning and transforming. More...

List of all members.

Static Public Member Functions

static muteErrorHandler ()
 Error-handler that mutes errors, alternative to shut-up operator.
static unsafeIconv ($in, $out, $text)
 iconv wrapper which mutes errors, but doesn't work around bugs.
static iconv ($in, $out, $text, $max_chunk_size=8000)
 iconv wrapper which mutes errors and works around bugs.
static cleanUTF8 ($str, $force_php=false)
 Cleans a UTF-8 string for well-formedness and SGML validity.
static unichr ($code)
 Translates a Unicode codepoint into its corresponding UTF-8 character.
static iconvAvailable ()
static convertToUTF8 ($str, $config, $context)
 Converts a string to UTF-8 based on configuration.
static convertFromUTF8 ($str, $config, $context)
 Converts a string from UTF-8 based on configuration.
static convertToASCIIDumbLossless ($str)
 Lossless (character-wise) conversion of HTML to ASCII.
static testIconvTruncateBug ()
 glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.
static testEncodingSupportsASCII ($encoding, $bypass=false)
 This expensive function tests whether or not a given character encoding supports ASCII.
static muteErrorHandler ()
 Error-handler that mutes errors, alternative to shut-up operator.
static unsafeIconv ($in, $out, $text)
 iconv wrapper which mutes errors, but doesn't work around bugs.
static iconv ($in, $out, $text, $max_chunk_size=8000)
 iconv wrapper which mutes errors and works around bugs.
static cleanUTF8 ($str, $force_php=false)
 Cleans a UTF-8 string for well-formedness and SGML validity.
static unichr ($code)
 Translates a Unicode codepoint into its corresponding UTF-8 character.
static iconvAvailable ()
static convertToUTF8 ($str, $config, $context)
 Converts a string to UTF-8 based on configuration.
static convertFromUTF8 ($str, $config, $context)
 Converts a string from UTF-8 based on configuration.
static convertToASCIIDumbLossless ($str)
 Lossless (character-wise) conversion of HTML to ASCII.
static testIconvTruncateBug ()
 glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.
static testEncodingSupportsASCII ($encoding, $bypass=false)
 This expensive function tests whether or not a given character encoding supports ASCII.

Public Attributes

const ICONV_OK = 0
 No bugs detected in iconv.
const ICONV_TRUNCATES = 1
 Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found.
const ICONV_UNUSABLE = 2
 Iconv does not support //IGNORE, making it unusable for transcoding purposes.

Private Member Functions

 __construct ()
 Constructor throws fatal error if you attempt to instantiate class.
 __construct ()
 Constructor throws fatal error if you attempt to instantiate class.

Detailed Description

A UTF-8 specific character encoder that handles cleaning and transforming.

Note:
All functions in this class should be static.

Definition at line 7 of file Encoder.php.


Constructor & Destructor Documentation

HTMLPurifier_Encoder::__construct ( ) [private]

Constructor throws fatal error if you attempt to instantiate class.

Definition at line 13 of file Encoder.php.

HTMLPurifier_Encoder::__construct ( ) [private]

Constructor throws fatal error if you attempt to instantiate class.

Definition at line 3029 of file HTMLPurifier.standalone.php.


Member Function Documentation

static HTMLPurifier_Encoder::cleanUTF8 ( str,
force_php = false 
) [static]

Cleans a UTF-8 string for well-formedness and SGML validity.

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Note:
Just for reference, the non-SGML code points are 0 to 31 and 127 to 159, inclusive. However, we allow code points 9, 10 and 13, which are the tab, line feed and carriage return respectively. 128 and above the code points map to multibyte UTF-8 representations.
Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivonen@iki.fi at <http://iki.fi/hsivonen/php-utf8/> under the LGPL license. Notes on what changed are inside, but in general, the original code transformed UTF-8 text into an array of integer Unicode codepoints. Understandably, transforming that back to a string would be somewhat expensive, so the function was modded to directly operate on the string. However, this discourages code reuse, and the logic enumerated here would be useful for any function that needs to be able to understand UTF-8 characters. As of right now, only smart lossless character encoding converters would need that, and I'm probably not going to implement them. Once again, PHP 6 should solve all our problems.

Definition at line 109 of file Encoder.php.

Referenced by HTMLPurifier_Printer::escape(), HTMLPurifier_AttrDef::expandCSSEscape(), and HTMLPurifier_Lexer::normalize().

static HTMLPurifier_Encoder::cleanUTF8 ( str,
force_php = false 
) [static]

Cleans a UTF-8 string for well-formedness and SGML validity.

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Note:
Just for reference, the non-SGML code points are 0 to 31 and 127 to 159, inclusive. However, we allow code points 9, 10 and 13, which are the tab, line feed and carriage return respectively. 128 and above the code points map to multibyte UTF-8 representations.
Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivonen@iki.fi at <http://iki.fi/hsivonen/php-utf8/> under the LGPL license. Notes on what changed are inside, but in general, the original code transformed UTF-8 text into an array of integer Unicode codepoints. Understandably, transforming that back to a string would be somewhat expensive, so the function was modded to directly operate on the string. However, this discourages code reuse, and the logic enumerated here would be useful for any function that needs to be able to understand UTF-8 characters. As of right now, only smart lossless character encoding converters would need that, and I'm probably not going to implement them. Once again, PHP 6 should solve all our problems.

Definition at line 3125 of file HTMLPurifier.standalone.php.

static HTMLPurifier_Encoder::convertFromUTF8 ( str,
config,
context 
) [static]

Converts a string from UTF-8 based on configuration.

Note:
Currently, this is a lossy conversion, with unexpressable characters being omitted.

Definition at line 366 of file Encoder.php.

References $config, convertToASCIIDumbLossless(), iconv(), iconvAvailable(), and testEncodingSupportsASCII().

Referenced by HTMLPurifier::purify().

static HTMLPurifier_Encoder::convertFromUTF8 ( str,
config,
context 
) [static]

Converts a string from UTF-8 based on configuration.

Note:
Currently, this is a lossy conversion, with unexpressable characters being omitted.

Definition at line 3382 of file HTMLPurifier.standalone.php.

References $config, convertToASCIIDumbLossless(), iconv(), iconvAvailable(), and testEncodingSupportsASCII().

static HTMLPurifier_Encoder::convertToASCIIDumbLossless ( str) [static]

Lossless (character-wise) conversion of HTML to ASCII.

Parameters:
$strUTF-8 string to be converted to ASCII
Returns:
ASCII encoded string with non-ASCII character entity-ized
Warning:
Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you disagree with this license fudgery, implement it yourself.
Note:
Uses decimal numeric entities since they are best supported.
This is a DUMB function: it has no concept of keeping character entities that the projected character encoding can allow. We could possibly implement a smart version but that would require it to also know which Unicode codepoints the charset supported (not an easy task).
Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8

Definition at line 413 of file Encoder.php.

Referenced by convertFromUTF8().

static HTMLPurifier_Encoder::convertToASCIIDumbLossless ( str) [static]

Lossless (character-wise) conversion of HTML to ASCII.

Parameters:
$strUTF-8 string to be converted to ASCII
Returns:
ASCII encoded string with non-ASCII character entity-ized
Warning:
Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you disagree with this license fudgery, implement it yourself.
Note:
Uses decimal numeric entities since they are best supported.
This is a DUMB function: it has no concept of keeping character entities that the projected character encoding can allow. We could possibly implement a smart version but that would require it to also know which Unicode codepoints the charset supported (not an easy task).
Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8

Definition at line 3429 of file HTMLPurifier.standalone.php.

static HTMLPurifier_Encoder::convertToUTF8 ( str,
config,
context 
) [static]

Converts a string to UTF-8 based on configuration.

Definition at line 3352 of file HTMLPurifier.standalone.php.

References $config, iconvAvailable(), and unsafeIconv().

static HTMLPurifier_Encoder::convertToUTF8 ( str,
config,
context 
) [static]

Converts a string to UTF-8 based on configuration.

Definition at line 336 of file Encoder.php.

References $config, iconvAvailable(), and unsafeIconv().

Referenced by HTMLPurifier::purify().

static HTMLPurifier_Encoder::iconv ( in,
out,
text,
max_chunk_size = 8000 
) [static]

iconv wrapper which mutes errors and works around bugs.

Definition at line 35 of file Encoder.php.

References testIconvTruncateBug(), and unsafeIconv().

Referenced by convertFromUTF8(), and unsafeIconv().

static HTMLPurifier_Encoder::iconv ( in,
out,
text,
max_chunk_size = 8000 
) [static]

iconv wrapper which mutes errors and works around bugs.

Definition at line 3051 of file HTMLPurifier.standalone.php.

References testIconvTruncateBug(), and unsafeIconv().

static HTMLPurifier_Encoder::iconvAvailable ( ) [static]

Definition at line 3341 of file HTMLPurifier.standalone.php.

References ICONV_UNUSABLE, and testIconvTruncateBug().

static HTMLPurifier_Encoder::iconvAvailable ( ) [static]

Definition at line 325 of file Encoder.php.

References ICONV_UNUSABLE, and testIconvTruncateBug().

Referenced by convertFromUTF8(), and convertToUTF8().

static HTMLPurifier_Encoder::muteErrorHandler ( ) [static]

Error-handler that mutes errors, alternative to shut-up operator.

Definition at line 3036 of file HTMLPurifier.standalone.php.

static HTMLPurifier_Encoder::muteErrorHandler ( ) [static]

Error-handler that mutes errors, alternative to shut-up operator.

Definition at line 20 of file Encoder.php.

static HTMLPurifier_Encoder::testEncodingSupportsASCII ( encoding,
bypass = false 
) [static]

This expensive function tests whether or not a given character encoding supports ASCII.

7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

Parameters:
string$encodingEncoding name to test, as per iconv format
bool$bypassWhether or not to bypass the precompiled arrays.
Returns:
Array of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

Definition at line 498 of file Encoder.php.

References unsafeIconv().

Referenced by convertFromUTF8().

static HTMLPurifier_Encoder::testEncodingSupportsASCII ( encoding,
bypass = false 
) [static]

This expensive function tests whether or not a given character encoding supports ASCII.

7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

Parameters:
string$encodingEncoding name to test, as per iconv format
bool$bypassWhether or not to bypass the precompiled arrays.
Returns:
Array of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

Definition at line 3514 of file HTMLPurifier.standalone.php.

References unsafeIconv().

static HTMLPurifier_Encoder::testIconvTruncateBug ( ) [static]

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.

In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

Returns:
Error code indicating severity of bug.

Definition at line 469 of file Encoder.php.

References ICONV_OK, ICONV_TRUNCATES, ICONV_UNUSABLE, and unsafeIconv().

Referenced by iconv(), and iconvAvailable().

static HTMLPurifier_Encoder::testIconvTruncateBug ( ) [static]

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.

In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

Returns:
Error code indicating severity of bug.

Definition at line 3485 of file HTMLPurifier.standalone.php.

References ICONV_OK, ICONV_TRUNCATES, ICONV_UNUSABLE, and unsafeIconv().

static HTMLPurifier_Encoder::unichr ( code) [static]

Translates a Unicode codepoint into its corresponding UTF-8 character.

Note:
Based on Feyd's function at <http://forums.devnetwork.net/viewtopic.php?p=191404#191404>, which is in public domain.
While we're going to do code point parsing anyway, a good optimization would be to refuse to translate code points that are non-SGML characters. However, this could lead to duplication.
This is very similar to the unichr function in maintenance/generate-entity-file.php (although this is superior, due to its sanity checks).

Definition at line 288 of file Encoder.php.

Referenced by HTMLPurifier_AttrDef::expandCSSEscape(), and HTMLPurifier_EntityParser::nonSpecialEntityCallback().

static HTMLPurifier_Encoder::unichr ( code) [static]

Translates a Unicode codepoint into its corresponding UTF-8 character.

Note:
Based on Feyd's function at <http://forums.devnetwork.net/viewtopic.php?p=191404#191404>, which is in public domain.
While we're going to do code point parsing anyway, a good optimization would be to refuse to translate code points that are non-SGML characters. However, this could lead to duplication.
This is very similar to the unichr function in maintenance/generate-entity-file.php (although this is superior, due to its sanity checks).

Definition at line 3304 of file HTMLPurifier.standalone.php.

static HTMLPurifier_Encoder::unsafeIconv ( in,
out,
text 
) [static]

iconv wrapper which mutes errors, but doesn't work around bugs.

Definition at line 3041 of file HTMLPurifier.standalone.php.

References iconv().

static HTMLPurifier_Encoder::unsafeIconv ( in,
out,
text 
) [static]

iconv wrapper which mutes errors, but doesn't work around bugs.

Definition at line 25 of file Encoder.php.

References iconv().

Referenced by convertToUTF8(), iconv(), testEncodingSupportsASCII(), and testIconvTruncateBug().


Member Data Documentation

No bugs detected in iconv.

Definition at line 445 of file Encoder.php.

Referenced by testIconvTruncateBug().

Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found.

Definition at line 449 of file Encoder.php.

Referenced by testIconvTruncateBug().

Iconv does not support //IGNORE, making it unusable for transcoding purposes.

Definition at line 453 of file Encoder.php.

Referenced by iconvAvailable(), and testIconvTruncateBug().


The documentation for this class was generated from the following files: