ILIAS  release_5-4 Revision v5.4.26-12-gabc799a52e6
HTMLPurifier_Encoder Class Reference

A UTF-8 specific character encoder that handles cleaning and transforming. More...

+ Collaboration diagram for HTMLPurifier_Encoder:

Static Public Member Functions

static muteErrorHandler ()
 Error-handler that mutes errors, alternative to shut-up operator. More...
 
static unsafeIconv ($in, $out, $text)
 iconv wrapper which mutes errors, but doesn't work around bugs. More...
 
static iconv ($in, $out, $text, $max_chunk_size=8000)
 iconv wrapper which mutes errors and works around bugs. More...
 
static cleanUTF8 ($str, $force_php=false)
 Cleans a UTF-8 string for well-formedness and SGML validity. More...
 
static unichr ($code)
 Translates a Unicode codepoint into its corresponding UTF-8 character. More...
 
static iconvAvailable ()
 
static convertToUTF8 ($str, $config, $context)
 Convert a string to UTF-8 based on configuration. More...
 
static convertFromUTF8 ($str, $config, $context)
 Converts a string from UTF-8 based on configuration. More...
 
static convertToASCIIDumbLossless ($str)
 Lossless (character-wise) conversion of HTML to ASCII. More...
 
static testIconvTruncateBug ()
 glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. More...
 
static testEncodingSupportsASCII ($encoding, $bypass=false)
 This expensive function tests whether or not a given character encoding supports ASCII. More...
 

Data Fields

const ICONV_OK = 0
 No bugs detected in iconv. More...
 
const ICONV_TRUNCATES = 1
 Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found. More...
 
const ICONV_UNUSABLE = 2
 Iconv does not support //IGNORE, making it unusable for transcoding purposes. More...
 

Private Member Functions

 __construct ()
 Constructor throws fatal error if you attempt to instantiate class. More...
 

Detailed Description

A UTF-8 specific character encoder that handles cleaning and transforming.

Note
All functions in this class should be static.

Definition at line 7 of file Encoder.php.

Constructor & Destructor Documentation

◆ __construct()

HTMLPurifier_Encoder::__construct ( )
private

Constructor throws fatal error if you attempt to instantiate class.

Definition at line 13 of file Encoder.php.

14  {
15  trigger_error('Cannot instantiate encoder, call methods statically', E_USER_ERROR);
16  }

Member Function Documentation

◆ cleanUTF8()

static HTMLPurifier_Encoder::cleanUTF8 (   $str,
  $force_php = false 
)
static

Cleans a UTF-8 string for well-formedness and SGML validity.

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Specifically, it will permit: {9}{A}{D}{20}-{7E}{A0}-{D7FF}{E000}-{FFFD}{10000}-{10FFFF} Source: https://www.w3.org/TR/REC-xml/#NT-Char Arguably this function should be modernized to the HTML5 set of allowed characters: https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream which simultaneously expand and restrict the set of allowed characters.

Parameters
string$strThe string to clean
bool$force_php
Returns
string
Note
Just for reference, the non-SGML code points are 0 to 31 and 127 to 159, inclusive. However, we allow code points 9, 10 and 13, which are the tab, line feed and carriage return respectively. 128 and above the code points map to multibyte UTF-8 representations.
Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivo.nosp@m.nen@.nosp@m.iki.f.nosp@m.i at http://iki.fi/hsivonen/php-utf8/ under the LGPL license. Notes on what changed are inside, but in general, the original code transformed UTF-8 text into an array of integer Unicode codepoints. Understandably, transforming that back to a string would be somewhat expensive, so the function was modded to directly operate on the string. However, this discourages code reuse, and the logic enumerated here would be useful for any function that needs to be able to understand UTF-8 characters. As of right now, only smart lossless character encoding converters would need that, and I'm probably not going to implement them.

Definition at line 134 of file Encoder.php.

References $i, $in, and $out.

Referenced by HTMLPurifier_Printer\escape(), HTMLPurifier_AttrDef\expandCSSEscape(), and HTMLPurifier_Lexer\normalize().

135  {
136  // UTF-8 validity is checked since PHP 4.3.5
137  // This is an optimization: if the string is already valid UTF-8, no
138  // need to do PHP stuff. 99% of the time, this will be the case.
139  if (preg_match(
140  '/^[\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]*$/Du',
141  $str
142  )) {
143  return $str;
144  }
145 
146  $mState = 0; // cached expected number of octets after the current octet
147  // until the beginning of the next UTF8 character sequence
148  $mUcs4 = 0; // cached Unicode character
149  $mBytes = 1; // cached expected number of octets in the current sequence
150 
151  // original code involved an $out that was an array of Unicode
152  // codepoints. Instead of having to convert back into UTF-8, we've
153  // decided to directly append valid UTF-8 characters onto a string
154  // $out once they're done. $char accumulates raw bytes, while $mUcs4
155  // turns into the Unicode code point, so there's some redundancy.
156 
157  $out = '';
158  $char = '';
159 
160  $len = strlen($str);
161  for ($i = 0; $i < $len; $i++) {
162  $in = ord($str[$i]);
163  $char .= $str[$i]; // append byte to char
164  if (0 == $mState) {
165  // When mState is zero we expect either a US-ASCII character
166  // or a multi-octet sequence.
167  if (0 == (0x80 & ($in))) {
168  // US-ASCII, pass straight through.
169  if (($in <= 31 || $in == 127) &&
170  !($in == 9 || $in == 13 || $in == 10) // save \r\t\n
171  ) {
172  // control characters, remove
173  } else {
174  $out .= $char;
175  }
176  // reset
177  $char = '';
178  $mBytes = 1;
179  } elseif (0xC0 == (0xE0 & ($in))) {
180  // First octet of 2 octet sequence
181  $mUcs4 = ($in);
182  $mUcs4 = ($mUcs4 & 0x1F) << 6;
183  $mState = 1;
184  $mBytes = 2;
185  } elseif (0xE0 == (0xF0 & ($in))) {
186  // First octet of 3 octet sequence
187  $mUcs4 = ($in);
188  $mUcs4 = ($mUcs4 & 0x0F) << 12;
189  $mState = 2;
190  $mBytes = 3;
191  } elseif (0xF0 == (0xF8 & ($in))) {
192  // First octet of 4 octet sequence
193  $mUcs4 = ($in);
194  $mUcs4 = ($mUcs4 & 0x07) << 18;
195  $mState = 3;
196  $mBytes = 4;
197  } elseif (0xF8 == (0xFC & ($in))) {
198  // First octet of 5 octet sequence.
199  //
200  // This is illegal because the encoded codepoint must be
201  // either:
202  // (a) not the shortest form or
203  // (b) outside the Unicode range of 0-0x10FFFF.
204  // Rather than trying to resynchronize, we will carry on
205  // until the end of the sequence and let the later error
206  // handling code catch it.
207  $mUcs4 = ($in);
208  $mUcs4 = ($mUcs4 & 0x03) << 24;
209  $mState = 4;
210  $mBytes = 5;
211  } elseif (0xFC == (0xFE & ($in))) {
212  // First octet of 6 octet sequence, see comments for 5
213  // octet sequence.
214  $mUcs4 = ($in);
215  $mUcs4 = ($mUcs4 & 1) << 30;
216  $mState = 5;
217  $mBytes = 6;
218  } else {
219  // Current octet is neither in the US-ASCII range nor a
220  // legal first octet of a multi-octet sequence.
221  $mState = 0;
222  $mUcs4 = 0;
223  $mBytes = 1;
224  $char = '';
225  }
226  } else {
227  // When mState is non-zero, we expect a continuation of the
228  // multi-octet sequence
229  if (0x80 == (0xC0 & ($in))) {
230  // Legal continuation.
231  $shift = ($mState - 1) * 6;
232  $tmp = $in;
233  $tmp = ($tmp & 0x0000003F) << $shift;
234  $mUcs4 |= $tmp;
235 
236  if (0 == --$mState) {
237  // End of the multi-octet sequence. mUcs4 now contains
238  // the final Unicode codepoint to be output
239 
240  // Check for illegal sequences and codepoints.
241 
242  // From Unicode 3.1, non-shortest form is illegal
243  if (((2 == $mBytes) && ($mUcs4 < 0x0080)) ||
244  ((3 == $mBytes) && ($mUcs4 < 0x0800)) ||
245  ((4 == $mBytes) && ($mUcs4 < 0x10000)) ||
246  (4 < $mBytes) ||
247  // From Unicode 3.2, surrogate characters = illegal
248  (($mUcs4 & 0xFFFFF800) == 0xD800) ||
249  // Codepoints outside the Unicode range are illegal
250  ($mUcs4 > 0x10FFFF)
251  ) {
252 
253  } elseif (0xFEFF != $mUcs4 && // omit BOM
254  // check for valid Char unicode codepoints
255  (
256  0x9 == $mUcs4 ||
257  0xA == $mUcs4 ||
258  0xD == $mUcs4 ||
259  (0x20 <= $mUcs4 && 0x7E >= $mUcs4) ||
260  // 7F-9F is not strictly prohibited by XML,
261  // but it is non-SGML, and thus we don't allow it
262  (0xA0 <= $mUcs4 && 0xD7FF >= $mUcs4) ||
263  (0xE000 <= $mUcs4 && 0xFFFD >= $mUcs4) ||
264  (0x10000 <= $mUcs4 && 0x10FFFF >= $mUcs4)
265  )
266  ) {
267  $out .= $char;
268  }
269  // initialize UTF8 cache (reset)
270  $mState = 0;
271  $mUcs4 = 0;
272  $mBytes = 1;
273  $char = '';
274  }
275  } else {
276  // ((0xC0 & (*in) != 0x80) && (mState != 0))
277  // Incomplete multi-octet sequence.
278  // used to result in complete fail, but we'll reset
279  $mState = 0;
280  $mUcs4 = 0;
281  $mBytes = 1;
282  $char ='';
283  }
284  }
285  }
286  return $out;
287  }
if(php_sapi_name() !='cli') $in
Definition: Utf8Test.php:37
$i
Definition: disco.tpl.php:19
+ Here is the caller graph for this function:

◆ convertFromUTF8()

static HTMLPurifier_Encoder::convertFromUTF8 (   $str,
  $config,
  $context 
)
static

Converts a string from UTF-8 based on configuration.

Parameters
string$strThe string to convert
HTMLPurifier_Config$config
HTMLPurifier_Context$context
Returns
string
Note
Currently, this is a lossy conversion, with unexpressable characters being omitted.

Definition at line 426 of file Encoder.php.

References $config.

Referenced by HTMLPurifier\purify().

427  {
428  $encoding = $config->get('Core.Encoding');
429  if ($escape = $config->get('Core.EscapeNonASCIICharacters')) {
430  $str = self::convertToASCIIDumbLossless($str);
431  }
432  if ($encoding === 'utf-8') {
433  return $str;
434  }
435  static $iconv = null;
436  if ($iconv === null) {
437  $iconv = self::iconvAvailable();
438  }
439  if ($iconv && !$config->get('Test.ForceNoIconv')) {
440  // Undo our previous fix in convertToUTF8, otherwise iconv will barf
441  $ascii_fix = self::testEncodingSupportsASCII($encoding);
442  if (!$escape && !empty($ascii_fix)) {
443  $clear_fix = array();
444  foreach ($ascii_fix as $utf8 => $native) {
445  $clear_fix[$utf8] = '';
446  }
447  $str = strtr($str, $clear_fix);
448  }
449  $str = strtr($str, array_flip($ascii_fix));
450  // Normal stuff
451  $str = self::iconv('utf-8', $encoding . '//IGNORE', $str);
452  return $str;
453  } elseif ($encoding === 'iso-8859-1') {
454  $str = utf8_decode($str);
455  return $str;
456  }
457  trigger_error('Encoding not supported', E_USER_ERROR);
458  // You might be tempted to assume that the ASCII representation
459  // might be OK, however, this is *not* universally true over all
460  // encodings. So we take the conservative route here, rather
461  // than forcibly turn on %Core.EscapeNonASCIICharacters
462  }
$config
Definition: bootstrap.php:15
+ Here is the caller graph for this function:

◆ convertToASCIIDumbLossless()

static HTMLPurifier_Encoder::convertToASCIIDumbLossless (   $str)
static

Lossless (character-wise) conversion of HTML to ASCII.

Parameters
string$strUTF-8 string to be converted to ASCII
Returns
string ASCII encoded string with non-ASCII character entity-ized
Warning
Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you disagree with this license fudgery, implement it yourself.
Note
Uses decimal numeric entities since they are best supported.
This is a DUMB function: it has no concept of keeping character entities that the projected character encoding can allow. We could possibly implement a smart version but that would require it to also know which Unicode codepoints the charset supported (not an easy task).
Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8

Definition at line 480 of file Encoder.php.

References $i, and $result.

481  {
482  $bytesleft = 0;
483  $result = '';
484  $working = 0;
485  $len = strlen($str);
486  for ($i = 0; $i < $len; $i++) {
487  $bytevalue = ord($str[$i]);
488  if ($bytevalue <= 0x7F) { //0xxx xxxx
489  $result .= chr($bytevalue);
490  $bytesleft = 0;
491  } elseif ($bytevalue <= 0xBF) { //10xx xxxx
492  $working = $working << 6;
493  $working += ($bytevalue & 0x3F);
494  $bytesleft--;
495  if ($bytesleft <= 0) {
496  $result .= "&#" . $working . ";";
497  }
498  } elseif ($bytevalue <= 0xDF) { //110x xxxx
499  $working = $bytevalue & 0x1F;
500  $bytesleft = 1;
501  } elseif ($bytevalue <= 0xEF) { //1110 xxxx
502  $working = $bytevalue & 0x0F;
503  $bytesleft = 2;
504  } else { //1111 0xxx
505  $working = $bytevalue & 0x07;
506  $bytesleft = 3;
507  }
508  }
509  return $result;
510  }
$result
$i
Definition: disco.tpl.php:19

◆ convertToUTF8()

static HTMLPurifier_Encoder::convertToUTF8 (   $str,
  $config,
  $context 
)
static

Convert a string to UTF-8 based on configuration.

Parameters
string$strThe string to convert
HTMLPurifier_Config$config
HTMLPurifier_Context$context
Returns
string

Definition at line 378 of file Encoder.php.

References $config, and testIconvTruncateBug().

Referenced by HTMLPurifier\purify().

379  {
380  $encoding = $config->get('Core.Encoding');
381  if ($encoding === 'utf-8') {
382  return $str;
383  }
384  static $iconv = null;
385  if ($iconv === null) {
386  $iconv = self::iconvAvailable();
387  }
388  if ($iconv && !$config->get('Test.ForceNoIconv')) {
389  // unaffected by bugs, since UTF-8 support all characters
390  $str = self::unsafeIconv($encoding, 'utf-8//IGNORE', $str);
391  if ($str === false) {
392  // $encoding is not a valid encoding
393  trigger_error('Invalid encoding ' . $encoding, E_USER_ERROR);
394  return '';
395  }
396  // If the string is bjorked by Shift_JIS or a similar encoding
397  // that doesn't support all of ASCII, convert the naughty
398  // characters to their true byte-wise ASCII/UTF-8 equivalents.
399  $str = strtr($str, self::testEncodingSupportsASCII($encoding));
400  return $str;
401  } elseif ($encoding === 'iso-8859-1') {
402  $str = utf8_encode($str);
403  return $str;
404  }
406  if ($bug == self::ICONV_OK) {
407  trigger_error('Encoding not supported, please install iconv', E_USER_ERROR);
408  } else {
409  trigger_error(
410  'You have a buggy version of iconv, see https://bugs.php.net/bug.php?id=48147 ' .
411  'and http://sourceware.org/bugzilla/show_bug.cgi?id=13541',
412  E_USER_ERROR
413  );
414  }
415  }
$config
Definition: bootstrap.php:15
static testIconvTruncateBug()
glibc iconv has a known bug where it doesn&#39;t handle the magic //IGNORE stanza correctly.
Definition: Encoder.php:537
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ iconv()

static HTMLPurifier_Encoder::iconv (   $in,
  $out,
  $text,
  $max_chunk_size = 8000 
)
static

iconv wrapper which mutes errors and works around bugs.

Parameters
string$inInput encoding
string$outOutput encoding
string$textThe text to convert
int$max_chunk_size
Returns
string

Definition at line 48 of file Encoder.php.

References $c, $code, $i, $in, $out, $r, and $text.

Referenced by unsafeIconv().

49  {
50  $code = self::testIconvTruncateBug();
51  if ($code == self::ICONV_OK) {
52  return self::unsafeIconv($in, $out, $text);
53  } elseif ($code == self::ICONV_TRUNCATES) {
54  // we can only work around this if the input character set
55  // is utf-8
56  if ($in == 'utf-8') {
57  if ($max_chunk_size < 4) {
58  trigger_error('max_chunk_size is too small', E_USER_WARNING);
59  return false;
60  }
61  // split into 8000 byte chunks, but be careful to handle
62  // multibyte boundaries properly
63  if (($c = strlen($text)) <= $max_chunk_size) {
64  return self::unsafeIconv($in, $out, $text);
65  }
66  $r = '';
67  $i = 0;
68  while (true) {
69  if ($i + $max_chunk_size >= $c) {
70  $r .= self::unsafeIconv($in, $out, substr($text, $i));
71  break;
72  }
73  // wibble the boundary
74  if (0x80 != (0xC0 & ord($text[$i + $max_chunk_size]))) {
75  $chunk_size = $max_chunk_size;
76  } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 1]))) {
77  $chunk_size = $max_chunk_size - 1;
78  } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 2]))) {
79  $chunk_size = $max_chunk_size - 2;
80  } elseif (0x80 != (0xC0 & ord($text[$i + $max_chunk_size - 3]))) {
81  $chunk_size = $max_chunk_size - 3;
82  } else {
83  return false; // rather confusing UTF-8...
84  }
85  $chunk = substr($text, $i, $chunk_size); // substr doesn't mind overlong lengths
86  $r .= self::unsafeIconv($in, $out, $chunk);
87  $i += $chunk_size;
88  }
89  return $r;
90  } else {
91  return false;
92  }
93  } else {
94  return false;
95  }
96  }
$code
Definition: example_050.php:99
$r
Definition: example_031.php:79
$text
Definition: errorreport.php:18
if(php_sapi_name() !='cli') $in
Definition: Utf8Test.php:37
$i
Definition: disco.tpl.php:19
+ Here is the caller graph for this function:

◆ iconvAvailable()

static HTMLPurifier_Encoder::iconvAvailable ( )
static
Returns
bool

Definition at line 362 of file Encoder.php.

363  {
364  static $iconv = null;
365  if ($iconv === null) {
366  $iconv = function_exists('iconv') && self::testIconvTruncateBug() != self::ICONV_UNUSABLE;
367  }
368  return $iconv;
369  }

◆ muteErrorHandler()

static HTMLPurifier_Encoder::muteErrorHandler ( )
static

Error-handler that mutes errors, alternative to shut-up operator.

Definition at line 21 of file Encoder.php.

22  {
23  }

◆ testEncodingSupportsASCII()

static HTMLPurifier_Encoder::testEncodingSupportsASCII (   $encoding,
  $bypass = false 
)
static

This expensive function tests whether or not a given character encoding supports ASCII.

7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

Parameters
string$encodingEncoding name to test, as per iconv format
bool$bypassWhether or not to bypass the precompiled arrays.
Returns
Array of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

Definition at line 571 of file Encoder.php.

References $c, $i, $r, and $ret.

572  {
573  // All calls to iconv here are unsafe, proof by case analysis:
574  // If ICONV_OK, no difference.
575  // If ICONV_TRUNCATE, all calls involve one character inputs,
576  // so bug is not triggered.
577  // If ICONV_UNUSABLE, this call is irrelevant
578  static $encodings = array();
579  if (!$bypass) {
580  if (isset($encodings[$encoding])) {
581  return $encodings[$encoding];
582  }
583  $lenc = strtolower($encoding);
584  switch ($lenc) {
585  case 'shift_jis':
586  return array("\xC2\xA5" => '\\', "\xE2\x80\xBE" => '~');
587  case 'johab':
588  return array("\xE2\x82\xA9" => '\\');
589  }
590  if (strpos($lenc, 'iso-8859-') === 0) {
591  return array();
592  }
593  }
594  $ret = array();
595  if (self::unsafeIconv('UTF-8', $encoding, 'a') === false) {
596  return false;
597  }
598  for ($i = 0x20; $i <= 0x7E; $i++) { // all printable ASCII chars
599  $c = chr($i); // UTF-8 char
600  $r = self::unsafeIconv('UTF-8', "$encoding//IGNORE", $c); // initial conversion
601  if ($r === '' ||
602  // This line is needed for iconv implementations that do not
603  // omit characters that do not exist in the target character set
604  ($r === $c && self::unsafeIconv($encoding, 'UTF-8//IGNORE', $r) !== $c)
605  ) {
606  // Reverse engineer: what's the UTF-8 equiv of this byte
607  // sequence? This assumes that there's no variable width
608  // encoding that doesn't support ASCII.
609  $ret[self::unsafeIconv($encoding, 'UTF-8//IGNORE', $c)] = $c;
610  }
611  }
612  $encodings[$encoding] = $ret;
613  return $ret;
614  }
$r
Definition: example_031.php:79
$ret
Definition: parser.php:6
$i
Definition: disco.tpl.php:19

◆ testIconvTruncateBug()

static HTMLPurifier_Encoder::testIconvTruncateBug ( )
static

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly.

In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

Returns
int Error code indicating severity of bug.

Definition at line 537 of file Encoder.php.

References $c, $code, and $r.

Referenced by convertToUTF8().

538  {
539  static $code = null;
540  if ($code === null) {
541  // better not use iconv, otherwise infinite loop!
542  $r = self::unsafeIconv('utf-8', 'ascii//IGNORE', "\xCE\xB1" . str_repeat('a', 9000));
543  if ($r === false) {
544  $code = self::ICONV_UNUSABLE;
545  } elseif (($c = strlen($r)) < 9000) {
546  $code = self::ICONV_TRUNCATES;
547  } elseif ($c > 9000) {
548  trigger_error(
549  'Your copy of iconv is extremely buggy. Please notify HTML Purifier maintainers: ' .
550  'include your iconv version as per phpversion()',
551  E_USER_ERROR
552  );
553  } else {
554  $code = self::ICONV_OK;
555  }
556  }
557  return $code;
558  }
$code
Definition: example_050.php:99
$r
Definition: example_031.php:79
+ Here is the caller graph for this function:

◆ unichr()

static HTMLPurifier_Encoder::unichr (   $code)
static

Translates a Unicode codepoint into its corresponding UTF-8 character.

Note
Based on Feyd's function at http://forums.devnetwork.net/viewtopic.php?p=191404#191404, which is in public domain.
While we're going to do code point parsing anyway, a good optimization would be to refuse to translate code points that are non-SGML characters. However, this could lead to duplication.
This is very similar to the unichr function in maintenance/generate-entity-file.php (although this is superior, due to its sanity checks).

Definition at line 315 of file Encoder.php.

References $code, $ret, $w, $x, and $y.

Referenced by HTMLPurifier_EntityParser\entityCallback(), HTMLPurifier_AttrDef\expandCSSEscape(), and HTMLPurifier_EntityParser\nonSpecialEntityCallback().

316  {
317  if ($code > 1114111 or $code < 0 or
318  ($code >= 55296 and $code <= 57343) ) {
319  // bits are set outside the "valid" range as defined
320  // by UNICODE 4.1.0
321  return '';
322  }
323 
324  $x = $y = $z = $w = 0;
325  if ($code < 128) {
326  // regular ASCII character
327  $x = $code;
328  } else {
329  // set up bits for UTF-8
330  $x = ($code & 63) | 128;
331  if ($code < 2048) {
332  $y = (($code & 2047) >> 6) | 192;
333  } else {
334  $y = (($code & 4032) >> 6) | 128;
335  if ($code < 65536) {
336  $z = (($code >> 12) & 15) | 224;
337  } else {
338  $z = (($code >> 12) & 63) | 128;
339  $w = (($code >> 18) & 7) | 240;
340  }
341  }
342  }
343  // set up the actual character
344  $ret = '';
345  if ($w) {
346  $ret .= chr($w);
347  }
348  if ($z) {
349  $ret .= chr($z);
350  }
351  if ($y) {
352  $ret .= chr($y);
353  }
354  $ret .= chr($x);
355 
356  return $ret;
357  }
$code
Definition: example_050.php:99
$w
$y
Definition: example_007.php:83
$ret
Definition: parser.php:6
$x
Definition: complexTest.php:9
+ Here is the caller graph for this function:

◆ unsafeIconv()

static HTMLPurifier_Encoder::unsafeIconv (   $in,
  $out,
  $text 
)
static

iconv wrapper which mutes errors, but doesn't work around bugs.

Parameters
string$inInput encoding
string$outOutput encoding
string$textThe text to convert
Returns
string

Definition at line 32 of file Encoder.php.

References $in, $out, $r, $text, and iconv().

33  {
34  set_error_handler(array('HTMLPurifier_Encoder', 'muteErrorHandler'));
35  $r = iconv($in, $out, $text);
36  restore_error_handler();
37  return $r;
38  }
static iconv($in, $out, $text, $max_chunk_size=8000)
iconv wrapper which mutes errors and works around bugs.
Definition: Encoder.php:48
$r
Definition: example_031.php:79
$text
Definition: errorreport.php:18
if(php_sapi_name() !='cli') $in
Definition: Utf8Test.php:37
+ Here is the call graph for this function:

Field Documentation

◆ ICONV_OK

const HTMLPurifier_Encoder::ICONV_OK = 0

No bugs detected in iconv.

Definition at line 513 of file Encoder.php.

◆ ICONV_TRUNCATES

const HTMLPurifier_Encoder::ICONV_TRUNCATES = 1

Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found.

Definition at line 517 of file Encoder.php.

◆ ICONV_UNUSABLE

const HTMLPurifier_Encoder::ICONV_UNUSABLE = 2

Iconv does not support //IGNORE, making it unusable for transcoding purposes.

Definition at line 521 of file Encoder.php.


The documentation for this class was generated from the following file: