A UTF-8 specific character encoder that handles cleaning and transforming.
More...
|
static | muteErrorHandler () |
| Error-handler that mutes errors, alternative to shut-up operator. More...
|
|
static | unsafeIconv ($in, $out, $text) |
| iconv wrapper which mutes errors, but doesn't work around bugs. More...
|
|
static | iconv ($in, $out, $text, $max_chunk_size=8000) |
| iconv wrapper which mutes errors and works around bugs. More...
|
|
static | cleanUTF8 ($str, $force_php=false) |
| Cleans a UTF-8 string for well-formedness and SGML validity. More...
|
|
static | unichr ($code) |
| Translates a Unicode codepoint into its corresponding UTF-8 character. More...
|
|
static | iconvAvailable () |
|
static | convertToUTF8 ($str, $config, $context) |
| Convert a string to UTF-8 based on configuration. More...
|
|
static | convertFromUTF8 ($str, $config, $context) |
| Converts a string from UTF-8 based on configuration. More...
|
|
static | convertToASCIIDumbLossless ($str) |
| Lossless (character-wise) conversion of HTML to ASCII. More...
|
|
static | testIconvTruncateBug () |
| glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. More...
|
|
static | testEncodingSupportsASCII ($encoding, $bypass=false) |
| This expensive function tests whether or not a given character encoding supports ASCII. More...
|
|
|
const | ICONV_OK = 0 |
| No bugs detected in iconv. More...
|
|
const | ICONV_TRUNCATES = 1 |
| Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found. More...
|
|
const | ICONV_UNUSABLE = 2 |
| Iconv does not support //IGNORE, making it unusable for transcoding purposes. More...
|
|
|
| __construct () |
| Constructor throws fatal error if you attempt to instantiate class. More...
|
|
A UTF-8 specific character encoder that handles cleaning and transforming.
- Note
- All functions in this class should be static.
Definition at line 7 of file Encoder.php.
◆ __construct()
HTMLPurifier_Encoder::__construct |
( |
| ) |
|
|
private |
Constructor throws fatal error if you attempt to instantiate class.
Definition at line 13 of file Encoder.php.
15 trigger_error(
'Cannot instantiate encoder, call methods statically', E_USER_ERROR);
◆ cleanUTF8()
static HTMLPurifier_Encoder::cleanUTF8 |
( |
|
$str, |
|
|
|
$force_php = false |
|
) |
| |
|
static |
Cleans a UTF-8 string for well-formedness and SGML validity.
It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.
- Parameters
-
string | $str | The string to clean |
bool | $force_php | |
- Returns
- string
- Note
- Just for reference, the non-SGML code points are 0 to 31 and 127 to 159, inclusive. However, we allow code points 9, 10 and 13, which are the tab, line feed and carriage return respectively. 128 and above the code points map to multibyte UTF-8 representations.
-
Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i at http://iki.fi/hsivonen/php-utf8/ under the LGPL license. Notes on what changed are inside, but in general, the original code transformed UTF-8 text into an array of integer Unicode codepoints. Understandably, transforming that back to a string would be somewhat expensive, so the function was modded to directly operate on the string. However, this discourages code reuse, and the logic enumerated here would be useful for any function that needs to be able to understand UTF-8 characters. As of right now, only smart lossless character encoding converters would need that, and I'm probably not going to implement them. Once again, PHP 6 should solve all our problems.
Definition at line 127 of file Encoder.php.
References $in, and $out.
Referenced by HTMLPurifier_Printer\escape(), HTMLPurifier_AttrDef\expandCSSEscape(), and HTMLPurifier_Lexer\normalize().
135 '/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
156 for ($i = 0; $i < $len; $i++) {
162 if (0 == (0x80 & (
$in))) {
164 if ((
$in <= 31 ||
$in == 127) &&
174 } elseif (0xC0 == (0xE0 & (
$in))) {
177 $mUcs4 = ($mUcs4 & 0x1F) << 6;
180 } elseif (0xE0 == (0xF0 & (
$in))) {
183 $mUcs4 = ($mUcs4 & 0x0F) << 12;
186 } elseif (0xF0 == (0xF8 & (
$in))) {
189 $mUcs4 = ($mUcs4 & 0x07) << 18;
192 } elseif (0xF8 == (0xFC & (
$in))) {
203 $mUcs4 = ($mUcs4 & 0x03) << 24;
206 } elseif (0xFC == (0xFE & (
$in))) {
210 $mUcs4 = ($mUcs4 & 1) << 30;
224 if (0x80 == (0xC0 & (
$in))) {
226 $shift = ($mState - 1) * 6;
228 $tmp = ($tmp & 0x0000003F) << $shift;
231 if (0 == --$mState) {
238 if (((2 == $mBytes) && ($mUcs4 < 0x0080)) ||
239 ((3 == $mBytes) && ($mUcs4 < 0x0800)) ||
240 ((4 == $mBytes) && ($mUcs4 < 0x10000)) ||
243 (($mUcs4 & 0xFFFFF800) == 0xD800) ||
248 } elseif (0xFEFF != $mUcs4 &&
254 (0x20 <= $mUcs4 && 0x7E >= $mUcs4) ||
257 (0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
258 (0x10000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
if(php_sapi_name() !='cli') $in
◆ convertFromUTF8()
static HTMLPurifier_Encoder::convertFromUTF8 |
( |
|
$str, |
|
|
|
$config, |
|
|
|
$context |
|
) |
| |
|
static |
Converts a string from UTF-8 based on configuration.
- Parameters
-
- Returns
- string
- Note
- Currently, this is a lossy conversion, with unexpressable characters being omitted.
Definition at line 420 of file Encoder.php.
References $config, and array.
Referenced by HTMLPurifier\purify().
422 $encoding =
$config->get(
'Core.Encoding');
423 if ($escape =
$config->get(
'Core.EscapeNonASCIICharacters')) {
424 $str = self::convertToASCIIDumbLossless($str);
426 if ($encoding ===
'utf-8') {
429 static $iconv = null;
430 if ($iconv === null) {
431 $iconv = self::iconvAvailable();
433 if ($iconv && !
$config->get(
'Test.ForceNoIconv')) {
435 $ascii_fix = self::testEncodingSupportsASCII($encoding);
436 if (!$escape && !empty($ascii_fix)) {
437 $clear_fix =
array();
438 foreach ($ascii_fix as $utf8 => $native) {
439 $clear_fix[$utf8] =
'';
441 $str = strtr($str, $clear_fix);
443 $str = strtr($str, array_flip($ascii_fix));
445 $str = self::iconv(
'utf-8', $encoding .
'//IGNORE', $str);
447 } elseif ($encoding ===
'iso-8859-1') {
448 $str = utf8_decode($str);
451 trigger_error(
'Encoding not supported', E_USER_ERROR);
Create styles array
The data for the language used.
◆ convertToASCIIDumbLossless()
static HTMLPurifier_Encoder::convertToASCIIDumbLossless |
( |
|
$str | ) |
|
|
static |
Lossless (character-wise) conversion of HTML to ASCII.
- Parameters
-
string | $str | UTF-8 string to be converted to ASCII |
- Returns
- string ASCII encoded string with non-ASCII character entity-ized
- Warning
- Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you disagree with this license fudgery, implement it yourself.
- Note
- Uses decimal numeric entities since they are best supported.
-
This is a DUMB function: it has no concept of keeping character entities that the projected character encoding can allow. We could possibly implement a smart version but that would require it to also know which Unicode codepoints the charset supported (not an easy task).
-
Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8
Definition at line 474 of file Encoder.php.
References $result.
480 for ($i = 0; $i < $len; $i++) {
481 $bytevalue = ord($str[$i]);
482 if ($bytevalue <= 0x7F) {
485 } elseif ($bytevalue <= 0xBF) {
486 $working = $working << 6;
487 $working += ($bytevalue & 0x3F);
489 if ($bytesleft <= 0) {
490 $result .=
"&#" . $working .
";";
492 } elseif ($bytevalue <= 0xDF) {
493 $working = $bytevalue & 0x1F;
495 } elseif ($bytevalue <= 0xEF) {
496 $working = $bytevalue & 0x0F;
499 $working = $bytevalue & 0x07;
◆ convertToUTF8()
static HTMLPurifier_Encoder::convertToUTF8 |
( |
|
$str, |
|
|
|
$config, |
|
|
|
$context |
|
) |
| |
|
static |
Convert a string to UTF-8 based on configuration.
- Parameters
-
- Returns
- string
Definition at line 372 of file Encoder.php.
References $config, and testIconvTruncateBug().
Referenced by HTMLPurifier\purify().
374 $encoding =
$config->get(
'Core.Encoding');
375 if ($encoding ===
'utf-8') {
378 static $iconv = null;
379 if ($iconv === null) {
380 $iconv = self::iconvAvailable();
382 if ($iconv && !
$config->get(
'Test.ForceNoIconv')) {
384 $str = self::unsafeIconv($encoding,
'utf-8//IGNORE', $str);
385 if ($str ===
false) {
387 trigger_error(
'Invalid encoding ' . $encoding, E_USER_ERROR);
393 $str = strtr($str, self::testEncodingSupportsASCII($encoding));
395 } elseif ($encoding ===
'iso-8859-1') {
396 $str = utf8_encode($str);
400 if ($bug == self::ICONV_OK) {
401 trigger_error(
'Encoding not supported, please install iconv', E_USER_ERROR);
404 'You have a buggy version of iconv, see https://bugs.php.net/bug.php?id=48147 ' .
405 'and http://sourceware.org/bugzilla/show_bug.cgi?id=13541',
static testIconvTruncateBug()
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.
◆ iconv()
static HTMLPurifier_Encoder::iconv |
( |
|
$in, |
|
|
|
$out, |
|
|
|
$text, |
|
|
|
$max_chunk_size = 8000 |
|
) |
| |
|
static |
iconv wrapper which mutes errors and works around bugs.
- Parameters
-
string | $in | Input encoding |
string | $out | Output encoding |
string | $text | The text to convert |
int | $max_chunk_size | |
- Returns
- string
Definition at line 48 of file Encoder.php.
References $code, $in, $out, $r, and $text.
Referenced by unsafeIconv().
50 $code = self::testIconvTruncateBug();
51 if (
$code == self::ICONV_OK) {
53 } elseif (
$code == self::ICONV_TRUNCATES) {
57 if ($max_chunk_size < 4) {
58 trigger_error(
'max_chunk_size is too small', E_USER_WARNING);
63 if (($c = strlen(
$text)) <= $max_chunk_size) {
69 if ($i + $max_chunk_size >= $c) {
74 if (0x80 != (0xC0 & ord(
$text[$i + $max_chunk_size]))) {
75 $chunk_size = $max_chunk_size;
76 } elseif (0x80 != (0xC0 & ord(
$text[$i + $max_chunk_size - 1]))) {
77 $chunk_size = $max_chunk_size - 1;
78 } elseif (0x80 != (0xC0 & ord(
$text[$i + $max_chunk_size - 2]))) {
79 $chunk_size = $max_chunk_size - 2;
80 } elseif (0x80 != (0xC0 & ord(
$text[$i + $max_chunk_size - 3]))) {
81 $chunk_size = $max_chunk_size - 3;
85 $chunk = substr(
$text, $i, $chunk_size);
86 $r .= self::unsafeIconv(
$in,
$out, $chunk);
if(php_sapi_name() !='cli') $in
◆ iconvAvailable()
static HTMLPurifier_Encoder::iconvAvailable |
( |
| ) |
|
|
static |
- Returns
- bool
Definition at line 356 of file Encoder.php.
358 static $iconv = null;
359 if ($iconv === null) {
360 $iconv = function_exists(
'iconv') && self::testIconvTruncateBug() != self::ICONV_UNUSABLE;
◆ muteErrorHandler()
static HTMLPurifier_Encoder::muteErrorHandler |
( |
| ) |
|
|
static |
Error-handler that mutes errors, alternative to shut-up operator.
Definition at line 21 of file Encoder.php.
◆ testEncodingSupportsASCII()
static HTMLPurifier_Encoder::testEncodingSupportsASCII |
( |
|
$encoding, |
|
|
|
$bypass = false |
|
) |
| |
|
static |
This expensive function tests whether or not a given character encoding supports ASCII.
7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
- Parameters
-
string | $encoding | Encoding name to test, as per iconv format |
bool | $bypass | Whether or not to bypass the precompiled arrays. |
- Returns
- Array of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.
Definition at line 565 of file Encoder.php.
References $r, $ret, and array.
572 static $encodings =
array();
574 if (isset($encodings[$encoding])) {
575 return $encodings[$encoding];
577 $lenc = strtolower($encoding);
580 return array(
"\xC2\xA5" =>
'\\',
"\xE2\x80\xBE" =>
'~');
582 return array(
"\xE2\x82\xA9" =>
'\\');
584 if (strpos($lenc,
'iso-8859-') === 0) {
589 if (self::unsafeIconv(
'UTF-8', $encoding,
'a') ===
false) {
592 for ($i = 0x20; $i <= 0x7E; $i++) {
594 $r = self::unsafeIconv(
'UTF-8',
"$encoding//IGNORE", $c);
598 (
$r === $c && self::unsafeIconv($encoding,
'UTF-8//IGNORE',
$r) !== $c)
603 $ret[self::unsafeIconv($encoding,
'UTF-8//IGNORE', $c)] = $c;
606 $encodings[$encoding] =
$ret;
Create styles array
The data for the language used.
◆ testIconvTruncateBug()
static HTMLPurifier_Encoder::testIconvTruncateBug |
( |
| ) |
|
|
static |
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.
In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
- Returns
- int Error code indicating severity of bug.
Definition at line 531 of file Encoder.php.
References $code, and $r.
Referenced by convertToUTF8().
534 if (
$code === null) {
536 $r = self::unsafeIconv(
'utf-8',
'ascii//IGNORE',
"\xCE\xB1" . str_repeat(
'a', 9000));
538 $code = self::ICONV_UNUSABLE;
539 } elseif (($c = strlen(
$r)) < 9000) {
540 $code = self::ICONV_TRUNCATES;
541 } elseif ($c > 9000) {
543 'Your copy of iconv is extremely buggy. Please notify HTML Purifier maintainers: ' .
544 'include your iconv version as per phpversion()',
548 $code = self::ICONV_OK;
◆ unichr()
static HTMLPurifier_Encoder::unichr |
( |
|
$code | ) |
|
|
static |
◆ unsafeIconv()
static HTMLPurifier_Encoder::unsafeIconv |
( |
|
$in, |
|
|
|
$out, |
|
|
|
$text |
|
) |
| |
|
static |
iconv wrapper which mutes errors, but doesn't work around bugs.
- Parameters
-
string | $in | Input encoding |
string | $out | Output encoding |
string | $text | The text to convert |
- Returns
- string
Definition at line 32 of file Encoder.php.
References $in, $out, $r, $text, array, and iconv().
34 set_error_handler(
array(
'HTMLPurifier_Encoder',
'muteErrorHandler'));
36 restore_error_handler();
static iconv($in, $out, $text, $max_chunk_size=8000)
iconv wrapper which mutes errors and works around bugs.
Create styles array
The data for the language used.
if(php_sapi_name() !='cli') $in
◆ ICONV_OK
const HTMLPurifier_Encoder::ICONV_OK = 0 |
No bugs detected in iconv.
Definition at line 507 of file Encoder.php.
◆ ICONV_TRUNCATES
const HTMLPurifier_Encoder::ICONV_TRUNCATES = 1 |
Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found.
Definition at line 511 of file Encoder.php.
◆ ICONV_UNUSABLE
const HTMLPurifier_Encoder::ICONV_UNUSABLE = 2 |
Iconv does not support //IGNORE, making it unusable for transcoding purposes.
Definition at line 515 of file Encoder.php.
The documentation for this class was generated from the following file:
- libs/composer/vendor/ezyang/htmlpurifier/library/HTMLPurifier/Encoder.php