ILIAS  release_8 Revision v8.19
All Data Structures Namespaces Files Functions Variables Modules Pages
Sanitizer Class Reference
+ Collaboration diagram for Sanitizer:

Static Public Member Functions

static removeHTMLtags ($text, $processCallback=null, $args=array())
 Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments. More...
 
static removeHTMLcomments ($text)
 Remove '', and everything between. More...
 
static validateTagAttributes ($attribs, $element)
 Take an array of attribute names and values and normalize or discard illegal values for the given element type. More...
 
static checkCss ($value)
 Pick apart some CSS and check it for forbidden or unsafe structures. More...
 
static fixTagAttributes ($text, $element)
 Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes. More...
 
static encodeAttribute ($text)
 Encode an attribute value for HTML output. More...
 
static safeEncodeAttribute ($text)
 Encode an attribute value for HTML tags, with extra armoring against further wiki processing. More...
 
static escapeId ($id)
 Given a value escape it so that it can be used in an id attribute and return it, this does not validate the value however (see first link) More...
 
static escapeClass ($class)
 Given a value, escape it so that it can be used as a CSS class and return it. More...
 
static decodeTagAttributes ($text)
 Return an associative array of attribute names and values from a partial tag string. More...
 
static normalizeCharReferences ($text)
 Ensure that any entities and character references are legal for XML and XHTML specifically. More...
 
static normalizeCharReferencesCallback ($matches)
 
static normalizeEntity ($name)
 If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the named entity reference as is. More...
 
static decCharReference ($codepoint)
 
static hexCharReference ($codepoint)
 
static decodeCharReferences ($text)
 Decode any character references, numeric or named entities, in the text and return a UTF-8 string. More...
 
static decodeCharReferencesCallback ($matches)
 
static decodeChar ($codepoint)
 Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER. More...
 
static decodeEntity ($name)
 If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character. More...
 
static attributeWhitelist ($element)
 Fetch the whitelist of acceptable attributes for a given element name. More...
 
static setupAttributeWhitelist ()
 
static stripAllTags ($text)
 Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text. More...
 
static hackDocType ()
 Hack up a private DOCTYPE with HTML's standard entity declarations. More...
 
static cleanUrl ($url, $hostname=true)
 

Static Private Member Functions

static armorLinksCallback ($matches)
 Regex replace callback for armoring links against further processing. More...
 
static getTagAttributeCallback ($set)
 Pick the appropriate attribute value from a match set from the MW_ATTRIBS_REGEX matches. More...
 
static normalizeAttributeValue ($text)
 Normalize whitespace and character references in an XML source- encoded text for an attribute value. More...
 
static normalizeWhitespace ($text)
 
static validateCodepoint ($codepoint)
 Returns true if a given Unicode codepoint is a valid character in XML. More...
 

Detailed Description

Definition at line 355 of file Sanitizer.php.

Member Function Documentation

◆ armorLinksCallback()

static Sanitizer::armorLinksCallback (   $matches)
staticprivate

Regex replace callback for armoring links against further processing.

Parameters
array$matches
Returns
string

Definition at line 822 of file Sanitizer.php.

827  {

◆ attributeWhitelist()

static Sanitizer::attributeWhitelist (   $element)
static

Fetch the whitelist of acceptable attributes for a given element name.

Parameters
string$element
Returns
array

Definition at line 1114 of file Sanitizer.php.

Referenced by hexCharReference().

1119  {
1120  static $list;
1121  if (!isset($list)) {
static setupAttributeWhitelist()
Definition: Sanitizer.php:1128
+ Here is the caller graph for this function:

◆ checkCss()

static Sanitizer::checkCss (   $value)
static

Pick apart some CSS and check it for forbidden or unsafe structures.

Returns a sanitized string, or false if it was just too evil.

Currently URL references, 'expression', 'tps' are forbidden.

Parameters
string$value
Returns
mixed

Definition at line 644 of file Sanitizer.php.

649  {
650  $stripped = Sanitizer::decodeCharReferences($value);
651 
652  // Remove any comments; IE gets token splitting wrong
653  $stripped = StringUtils::delimiterReplace('/*', '*/', ' ', $stripped);
654 
655  $value = $stripped;
656 
657  // ... and continue checks
658  $stripped = preg_replace_callback(
659  '!\\\\([0-9A-Fa-f]{1,6})[ \\n\\r\\t\\f]?!',
660  function ($hit) {
661  return codepointToUtf8(hexdec($hit[1]));
662  },
663  $stripped
664  );
665  $stripped = str_replace('\\', '', $stripped);
666  if (preg_match(
667  '/(?:expression|tps*:\/\/|url\\s*\().*/is',
668  $stripped
669  )) {
670  # haxx0r
671  return false;
static decodeCharReferences($text)
Decode any character references, numeric or named entities, in the text and return a UTF-8 string...
Definition: Sanitizer.php:1041
codepointToUtf8($codepoint)
Definition: Sanitizer.php:327

◆ cleanUrl()

static Sanitizer::cleanUrl (   $url,
  $hostname = true 
)
static

NOTE: The original preg_replace/e IMPLICITLY adds a forward-slash on double quotes This could be a bug, but we will just mimic this behaviour 1:1 for now.

Definition at line 1310 of file Sanitizer.php.

References $rest, $url, and decodeCharReferences().

1315  {
1316  # Normalize any HTML entities in input. They will be
1317  # re-escaped by makeExternalLink().
1318 
1320 
1321  # Escape any control characters introduced by the above step
1322  $url = preg_replace_callback(
1323  '/[\][<>"\\x00-\\x20\\x7F]/',
1324  function ($hit) {
1325  if ($hit[0] === '"') {
1331  return urlencode('\\"');
1332  } else {
1333  return urlencode($hit[0]);
1334  }
1335  },
1336  $url
1337  );
1338 
1339  # Validate hostname portion
1340  $matches = array();
1341  if (preg_match('!^([^:]+:)(//[^/]+)?(.*)$!iD', $url, $matches)) {
1342  list( /* $whole */, $protocol, $host, $rest) = $matches;
1343 
1344  // Characters that will be ignored in IDNs.
1345  // http://tools.ietf.org/html/3454#section-3.1
1346  // Strip them before further processing so blacklists and such work.
1347  $strip = "/
1348  \\s| # general whitespace
1349  \xc2\xad| # 00ad SOFT HYPHEN
1350  \xe1\xa0\x86| # 1806 MONGOLIAN TODO SOFT HYPHEN
1351  \xe2\x80\x8b| # 200b ZERO WIDTH SPACE
1352  \xe2\x81\xa0| # 2060 WORD JOINER
1353  \xef\xbb\xbf| # feff ZERO WIDTH NO-BREAK SPACE
1354  \xcd\x8f| # 034f COMBINING GRAPHEME JOINER
1355  \xe1\xa0\x8b| # 180b MONGOLIAN FREE VARIATION SELECTOR ONE
1356  \xe1\xa0\x8c| # 180c MONGOLIAN FREE VARIATION SELECTOR TWO
1357  \xe1\xa0\x8d| # 180d MONGOLIAN FREE VARIATION SELECTOR THREE
1358  \xe2\x80\x8c| # 200c ZERO WIDTH NON-JOINER
1359  \xe2\x80\x8d| # 200d ZERO WIDTH JOINER
1360  [\xef\xb8\x80-\xef\xb8\x8f] # fe00-fe00f VARIATION SELECTOR-1-16
1361  /xuD";
1362 
1363  $host = preg_replace($strip, '', $host);
1364 
1365  // @fixme: validate hostnames here
1366 
1367  return $protocol . $host . $rest;
$rest
Definition: goto.php:49
static decodeCharReferences($text)
Decode any character references, numeric or named entities, in the text and return a UTF-8 string...
Definition: Sanitizer.php:1041
$url
+ Here is the call graph for this function:

◆ decCharReference()

static Sanitizer::decCharReference (   $codepoint)
static

Definition at line 997 of file Sanitizer.php.

References validateCodepoint().

1002  {
1003  $point = intval($codepoint);
1004  if (Sanitizer::validateCodepoint($point)) {
1005  return sprintf('&#%d;', $point);
static validateCodepoint($codepoint)
Returns true if a given Unicode codepoint is a valid character in XML.
Definition: Sanitizer.php:1022
+ Here is the call graph for this function:

◆ decodeChar()

static Sanitizer::decodeChar (   $codepoint)
static

Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEMENT CHARACTER.

Parameters
int$codepoint
Returns
string

Definition at line 1076 of file Sanitizer.php.

Referenced by hexCharReference().

1081  {
1082  if (Sanitizer::validateCodepoint($codepoint)) {
1083  return codepointToUtf8($codepoint);
static validateCodepoint($codepoint)
Returns true if a given Unicode codepoint is a valid character in XML.
Definition: Sanitizer.php:1022
codepointToUtf8($codepoint)
Definition: Sanitizer.php:327
+ Here is the caller graph for this function:

◆ decodeCharReferences()

static Sanitizer::decodeCharReferences (   $text)
static

Decode any character references, numeric or named entities, in the text and return a UTF-8 string.

Parameters
string$text
Returns
string

Definition at line 1041 of file Sanitizer.php.

Referenced by cleanUrl(), Title\escapeFragmentForURL(), hexCharReference(), and Title\newFromText().

1046  {
1047  return preg_replace_callback(
const MW_CHAR_REFS_REGEX
Regular expression to match various types of character references in Sanitizer::normalizeCharReferenc...
Definition: Sanitizer.php:30
+ Here is the caller graph for this function:

◆ decodeCharReferencesCallback()

static Sanitizer::decodeCharReferencesCallback (   $matches)
static
Parameters
string$matches
Returns
string

Definition at line 1054 of file Sanitizer.php.

Referenced by hexCharReference().

1059  {
1060  if ($matches[1] != '') {
1061  return Sanitizer::decodeEntity($matches[1]);
1062  } elseif ($matches[2] != '') {
1063  return Sanitizer::decodeChar(intval($matches[2]));
1064  } elseif ($matches[3] != '') {
1065  return Sanitizer::decodeChar(hexdec($matches[3]));
1066  } elseif ($matches[4] != '') {
1067  return Sanitizer::decodeChar(hexdec($matches[4]));
static decodeChar($codepoint)
Return UTF-8 string for a codepoint if that is a valid character reference, otherwise U+FFFD REPLACEM...
Definition: Sanitizer.php:1076
static decodeEntity($name)
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that chara...
Definition: Sanitizer.php:1093
+ Here is the caller graph for this function:

◆ decodeEntity()

static Sanitizer::decodeEntity (   $name)
static

If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the UTF-8 encoding of that character.

Otherwise, returns pseudo-entity source (eg )

Parameters
string$name
Returns
string

Definition at line 1093 of file Sanitizer.php.

Referenced by hexCharReference().

1098  {
1100 
1101  if (isset($wgHtmlEntityAliases[$name])) {
1102  $name = $wgHtmlEntityAliases[$name];
1103  }
1104  if (isset($wgHtmlEntities[$name])) {
1105  return codepointToUtf8($wgHtmlEntities[$name]);
global $wgHtmlEntities
List of all named character entities defined in HTML 4.01 http://www.w3.org/TR/html4/sgml/entities.html.
Definition: Sanitizer.php:63
global $wgHtmlEntityAliases
Character entity aliases accepted by MediaWiki.
Definition: Sanitizer.php:321
if($format !==null) $name
Definition: metadata.php:247
codepointToUtf8($codepoint)
Definition: Sanitizer.php:327
+ Here is the caller graph for this function:

◆ decodeTagAttributes()

static Sanitizer::decodeTagAttributes (   $text)
static

Return an associative array of attribute names and values from a partial tag string.

Attribute names are forces to lowercase, character references are decoded to UTF-8 text.

Parameters
string
Returns
array

Definition at line 835 of file Sanitizer.php.

840  {
841  $attribs = array();
842 
843  if (trim($text) == '') {
844  return $attribs;
845  }
846 
847  $pairs = array();
848  if (!preg_match_all(
850  $text,
851  $pairs,
852  PREG_SET_ORDER
853  )) {
854  return $attribs;
855  }
856 
857  foreach ($pairs as $set) {
858  $attribute = strtolower($set[1]);
859  $value = Sanitizer::getTagAttributeCallback($set);
860 
861  // Normalize whitespace
862  $value = preg_replace('/[\t\r\n ]+/', ' ', $value);
863  $value = trim($value);
864 
865  // Decode character references
const MW_ATTRIBS_REGEX
Definition: Sanitizer.php:44
static getTagAttributeCallback($set)
Pick the appropriate attribute value from a match set from the MW_ATTRIBS_REGEX matches.
Definition: Sanitizer.php:875

◆ encodeAttribute()

static Sanitizer::encodeAttribute (   $text)
static

Encode an attribute value for HTML output.

Parameters
$text
Returns
HTML-encoded text fragment

Definition at line 718 of file Sanitizer.php.

723  {
724  $encValue = htmlspecialchars($text);
725 
726  // Whitespace is normalized during attribute decoding,
727  // so if we've been passed non-spaces we must encode them
728  // ahead of time or they won't be preserved.
729  $encValue = strtr($encValue, array(
730  "\n" => '&#10;',
731  "\r" => '&#13;',
732  "\t" => '&#9;',

◆ escapeClass()

static Sanitizer::escapeClass (   $class)
static

Given a value, escape it so that it can be used as a CSS class and return it.

Todo:
For extra validity, input should be validated UTF-8.
See also
http://www.w3.org/TR/CSS21/syndata.html Valid characters/format
Parameters
string$class
Returns
string

Definition at line 806 of file Sanitizer.php.

811  {
812  // Convert ugly stuff to underscores and kill underscores in ugly places
813  return rtrim(preg_replace(
814  array('/(^[0-9\\-])|[\\x00-\\x20!"#$%&\'()*+,.\\/:;<=>?@[\\]^`{|}~]|\\xC2\\xA0/','/_+/'),

◆ escapeId()

static Sanitizer::escapeId (   $id)
static

Given a value escape it so that it can be used in an id attribute and return it, this does not validate the value however (see first link)

See also
http://www.w3.org/TR/html401/types.html#type-name Valid characters in the id and name attributes
http://www.w3.org/TR/html401/struct/links.html#h-12.2.3 Anchors with the id attribute
Parameters
string$id
Returns
string

Definition at line 783 of file Sanitizer.php.

788  {
789  static $replace = array(
790  '%3A' => ':',
791  '%' => '.'
792  );
793 

◆ fixTagAttributes()

static Sanitizer::fixTagAttributes (   $text,
  $element 
)
static

Take a tag soup fragment listing an HTML element's attributes and normalize it to well-formed XML, discarding unwanted attributes.

Output is safe for further wikitext processing, with escaping of values that could trigger problems.

  • Normalizes attribute names to lowercase
  • Discards attributes not on a whitelist for the given element
  • Turns broken or invalid entities into plaintext
  • Double-quotes all attribute values
  • Attributes without values are given the name as attribute
  • Double attributes are discarded
  • Unsafe style attributes are discarded
  • Prepends space if there are attributes.
Parameters
string$text
string$element
Returns
string

Definition at line 692 of file Sanitizer.php.

697  {
698  if (trim($text) == '') {
699  return '';
700  }
701 
704  $element
705  );
706 
707  $attribs = array();
708  foreach ($stripped as $attribute => $value) {
709  $encAttribute = htmlspecialchars($attribute);
710  $encValue = Sanitizer::safeEncodeAttribute($value);
711 
static decodeTagAttributes($text)
Return an associative array of attribute names and values from a partial tag string.
Definition: Sanitizer.php:835
static validateTagAttributes($attribs, $element)
Take an array of attribute names and values and normalize or discard illegal values for the given ele...
Definition: Sanitizer.php:606
static safeEncodeAttribute($text)
Encode an attribute value for HTML tags, with extra armoring against further wiki processing...
Definition: Sanitizer.php:740

◆ getTagAttributeCallback()

static Sanitizer::getTagAttributeCallback (   $set)
staticprivate

Pick the appropriate attribute value from a match set from the MW_ATTRIBS_REGEX matches.

Parameters
array$set
Returns
string

Definition at line 875 of file Sanitizer.php.

880  {
881  if (isset($set[6])) {
882  # Illegal #XXXXXX color with no quotes.
883  return $set[6];
884  } elseif (isset($set[5])) {
885  # No quotes.
886  return $set[5];
887  } elseif (isset($set[4])) {
888  # Single-quoted
889  return $set[4];
890  } elseif (isset($set[3])) {
891  # Double-quoted
892  return $set[3];
893  } elseif (!isset($set[2])) {
894  # In XHTML, attributes must have a value.
895  # For 'reduced' form, return explicitly the attribute name here.
896  return $set[1];

◆ hackDocType()

static Sanitizer::hackDocType ( )
static

Hack up a private DOCTYPE with HTML's standard entity declarations.

PHP 4 seemed to know these if you gave it an HTML doctype, but PHP 5.1 doesn't.

Use for passing XHTML fragments to PHP's XML parsing functions

Returns
string

Definition at line 1299 of file Sanitizer.php.

Referenced by hexCharReference().

1304  {
1305  global $wgHtmlEntities;
1306  $out = "<!DOCTYPE html [\n";
1307  foreach ($wgHtmlEntities as $entity => $codepoint) {
1308  $out .= "<!ENTITY $entity \"&#$codepoint;\">";
global $wgHtmlEntities
List of all named character entities defined in HTML 4.01 http://www.w3.org/TR/html4/sgml/entities.html.
Definition: Sanitizer.php:63
$out
Definition: buildRTE.php:24
+ Here is the caller graph for this function:

◆ hexCharReference()

static Sanitizer::hexCharReference (   $codepoint)
static

Definition at line 1007 of file Sanitizer.php.

References $name, $out, $wgHtmlEntities, $wgHtmlEntityAliases, attributeWhitelist(), codepointToUtf8(), decodeChar(), decodeCharReferences(), decodeCharReferencesCallback(), decodeEntity(), hackDocType(), ILIAS\FileDelivery\http(), MW_CHAR_REFS_REGEX, setupAttributeWhitelist(), ILIAS\UI\examples\MainControls\SystemInfo\simple(), stripAllTags(), and validateCodepoint().

1012  {
1013  $point = hexdec($codepoint);
1014  if (Sanitizer::validateCodepoint($point)) {
1015  return sprintf('&#x%x;', $point);
static validateCodepoint($codepoint)
Returns true if a given Unicode codepoint is a valid character in XML.
Definition: Sanitizer.php:1022
+ Here is the call graph for this function:

◆ normalizeAttributeValue()

static Sanitizer::normalizeAttributeValue (   $text)
staticprivate

Normalize whitespace and character references in an XML source- encoded text for an attribute value.

See http://www.w3.org/TR/REC-xml/#AVNormalize for background, but note that we're not returning the value, but are returning XML source fragments that will be slapped into output.

Parameters
string$text
Returns
string

Definition at line 910 of file Sanitizer.php.

915  {
916  return str_replace(
917  '"',
918  '&quot;',
919  self::normalizeWhitespace(

◆ normalizeCharReferences()

static Sanitizer::normalizeCharReferences (   $text)
static

Ensure that any entities and character references are legal for XML and XHTML specifically.

Any stray bits will be &-escaped to result in a valid text fragment.

a. any named char refs must be known in XHTML b. any numeric char refs must be legal chars, not invalid or forbidden c. use &#x, not &#X d. fix or reject non-valid attributes

Parameters
string$text
Returns
string

Definition at line 944 of file Sanitizer.php.

949  {
950  return preg_replace_callback(
const MW_CHAR_REFS_REGEX
Regular expression to match various types of character references in Sanitizer::normalizeCharReferenc...
Definition: Sanitizer.php:30

◆ normalizeCharReferencesCallback()

static Sanitizer::normalizeCharReferencesCallback (   $matches)
static
Parameters
string$matches
Returns
string

Definition at line 956 of file Sanitizer.php.

961  {
962  $ret = null;
963  if ($matches[1] != '') {
964  $ret = Sanitizer::normalizeEntity($matches[1]);
965  } elseif ($matches[2] != '') {
966  $ret = Sanitizer::decCharReference($matches[2]);
967  } elseif ($matches[3] != '') {
968  $ret = Sanitizer::hexCharReference($matches[3]);
969  } elseif ($matches[4] != '') {
970  $ret = Sanitizer::hexCharReference($matches[4]);
971  }
972  if (is_null($ret)) {
973  return htmlspecialchars($matches[0]);
static decCharReference($codepoint)
Definition: Sanitizer.php:997
static normalizeEntity($name)
If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the named entity reference as is...
Definition: Sanitizer.php:985
static hexCharReference($codepoint)
Definition: Sanitizer.php:1007

◆ normalizeEntity()

static Sanitizer::normalizeEntity (   $name)
static

If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD, return the named entity reference as is.

If the entity is a MediaWiki-specific alias, returns the HTML equivalent. Otherwise, returns HTML-escaped text of pseudo-entity source (eg &foo;)

Parameters
string$name
Returns
string

Definition at line 985 of file Sanitizer.php.

990  {
992  if (isset($wgHtmlEntityAliases[$name])) {
993  return "&{$wgHtmlEntityAliases[$name]};";
994  } elseif (isset($wgHtmlEntities[$name])) {
995  return "&$name;";
global $wgHtmlEntities
List of all named character entities defined in HTML 4.01 http://www.w3.org/TR/html4/sgml/entities.html.
Definition: Sanitizer.php:63
global $wgHtmlEntityAliases
Character entity aliases accepted by MediaWiki.
Definition: Sanitizer.php:321
if($format !==null) $name
Definition: metadata.php:247

◆ normalizeWhitespace()

static Sanitizer::normalizeWhitespace (   $text)
staticprivate

Definition at line 921 of file Sanitizer.php.

926  {
927  return preg_replace(
928  '/\r\n|[\x20\x0d\x0a\x09]/',

◆ removeHTMLcomments()

static Sanitizer::removeHTMLcomments (   $text)
static

Remove '', and everything between.

To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.

Parameters
string$text
Returns
string

Definition at line 556 of file Sanitizer.php.

561  {
562  wfProfileIn(__METHOD__);
563  while (($start = strpos($text, '<!--')) !== false) {
564  $end = strpos($text, '-->', $start + 4);
565  if ($end === false) {
566  # Unterminated comment; bail out
567  break;
568  }
569 
570  $end += 3;
571 
572  # Trim space and newline if the comment is both
573  # preceded and followed by a newline
574  $spaceStart = max($start - 1, 0);
575  $spaceLen = $end - $spaceStart;
576  while (substr($text, $spaceStart, 1) === ' ' && $spaceStart > 0) {
577  $spaceStart--;
578  $spaceLen++;
579  }
580  while (substr($text, $spaceStart + $spaceLen, 1) === ' ') {
581  $spaceLen++;
582  }
583  if (substr($text, $spaceStart, 1) === "\n" and substr($text, $spaceStart + $spaceLen, 1) === "\n") {
584  # Remove the comment, leading and trailing
585  # spaces, and leave only one newline.
586  $text = substr_replace($text, "\n", $spaceStart, $spaceLen + 1);
587  } else {
588  # Remove just the comment.
589  $text = substr_replace($text, '', $start, $end - $start);
590  }

◆ removeHTMLtags()

static Sanitizer::removeHTMLtags (   $text,
  $processCallback = null,
  $args = array() 
)
static

Cleans up HTML, removes dangerous tags and attributes, and removes HTML comments.

Parameters
string$text
callback$processCallbackto do any variable or parameter replacements in HTML attribute values
array$argsfor the processing callback
Returns
string

Definition at line 366 of file Sanitizer.php.

371  {
372  global $wgUseTidy;
373 
374  static $htmlpairs, $htmlsingle, $htmlsingleonly, $htmlnest, $tabletags,
375  $htmllist, $listtags, $htmlsingleallowed, $htmlelements, $staticInitialised;
376 
377  wfProfileIn(__METHOD__);
378 
379  if (!$staticInitialised) {
380  $htmlpairs = array( # Tags that must be closed
381  'b', 'del', 'i', 'ins', 'u', 'font', 'big', 'small', 'sub', 'sup', 'h1',
382  'h2', 'h3', 'h4', 'h5', 'h6', 'cite', 'code', 'em', 's',
383  'strike', 'strong', 'tt', 'var', 'div', 'center',
384  'blockquote', 'ol', 'ul', 'dl', 'table', 'caption', 'pre',
385  'ruby', 'rt' , 'rb' , 'rp', 'p', 'span', 'u'
386  );
387  $htmlsingle = array(
388  'br', 'hr', 'li', 'dt', 'dd'
389  );
390  $htmlsingleonly = array( # Elements that cannot have close tags
391  'br', 'hr'
392  );
393  $htmlnest = array( # Tags that can be nested--??
394  'table', 'tr', 'td', 'th', 'div', 'blockquote', 'ol', 'ul',
395  'dl', 'font', 'big', 'small', 'sub', 'sup', 'span'
396  );
397  $tabletags = array( # Can only appear inside table, we will close them
398  'td', 'th', 'tr',
399  );
400  $htmllist = array( # Tags used by list
401  'ul','ol',
402  );
403  $listtags = array( # Tags that can appear in a list
404  'li',
405  );
406 
407  $htmlsingleallowed = array_merge($htmlsingle, $tabletags);
408  $htmlelements = array_merge($htmlsingle, $htmlpairs, $htmlnest);
409 
410  # Convert them all to hashtables for faster lookup
411  $vars = array( 'htmlpairs', 'htmlsingle', 'htmlsingleonly', 'htmlnest', 'tabletags',
412  'htmllist', 'listtags', 'htmlsingleallowed', 'htmlelements' );
413  foreach ($vars as $var) {
414  $$var = array_flip($$var);
415  }
416  $staticInitialised = true;
417  }
418 
419  # Remove HTML comments
420  $text = Sanitizer::removeHTMLcomments($text);
421  $bits = explode('<', $text);
422  $text = str_replace('>', '&gt;', array_shift($bits));
423  if (!$wgUseTidy) {
424  $tagstack = $tablestack = array();
425  foreach ($bits as $x) {
426  $regs = array();
427  if (preg_match('!^(/?)(\\w+)([^>]*?)(/{0,1}>)([^<]*)$!', $x, $regs)) {
428  list( /* $qbar */, $slash, $t, $params, $brace, $rest) = $regs;
429  } else {
430  $slash = $t = $params = $brace = $rest = null;
431  }
432 
433  $badtag = 0 ;
434  if (isset($htmlelements[$t = strtolower($t)])) {
435  # Check our stack
436  if ($slash) {
437  # Closing a tag...
438  if (isset($htmlsingleonly[$t])) {
439  $badtag = 1;
440  } elseif (($ot = @array_pop($tagstack)) != $t) {
441  if (isset($htmlsingleallowed[$ot])) {
442  # Pop all elements with an optional close tag
443  # and see if we find a match below them
444  $optstack = array();
445  $optstack[] = $ot;
446  while ((($ot = @array_pop($tagstack)) != $t) &&
447  isset($htmlsingleallowed[$ot])) {
448  $optstack[] = $ot;
449  }
450  if ($t != $ot) {
451  # No match. Push the optinal elements back again
452  $badtag = 1;
453  while ($ot = @array_pop($optstack)) {
454  $tagstack[] = $ot;
455  }
456  }
457  } else {
458  @array_push($tagstack, $ot);
459  # <li> can be nested in <ul> or <ol>, skip those cases:
460  if (!(isset($htmllist[$ot]) && isset($listtags[$t]))) {
461  $badtag = 1;
462  }
463  }
464  } else {
465  if ($t == 'table') {
466  $tagstack = array_pop($tablestack);
467  }
468  }
469  $newparams = '';
470  } else {
471  # Keep track for later
472  if (isset($tabletags[$t]) &&
473  !in_array('table', $tagstack)) {
474  $badtag = 1;
475  } elseif (in_array($t, $tagstack) &&
476  !isset($htmlnest [$t ])) {
477  $badtag = 1 ;
478  # Is it a self closed htmlpair ? (bug 5487)
479  } elseif ($brace == '/>' &&
480  isset($htmlpairs[$t])) {
481  $badtag = 1;
482  } elseif (isset($htmlsingleonly[$t])) {
483  # Hack to force empty tag for uncloseable elements
484  $brace = '/>';
485  } elseif (isset($htmlsingle[$t])) {
486  # Hack to not close $htmlsingle tags
487  $brace = null;
488  } elseif (isset($tabletags[$t])
489  && in_array($t, $tagstack)) {
490  // New table tag but forgot to close the previous one
491  $text .= "</$t>";
492  } else {
493  if ($t == 'table') {
494  $tablestack[] = $tagstack;
495  $tagstack = array();
496  }
497  $tagstack[] = $t;
498  }
499 
500  # Replace any variables or template parameters with
501  # plaintext results.
502  if (is_callable($processCallback)) {
503  call_user_func_array($processCallback, array( &$params, $args ));
504  }
505 
506  # Strip non-approved attributes from the tag
507  $newparams = Sanitizer::fixTagAttributes($params, $t);
508  }
509  if (!$badtag) {
510  $rest = str_replace('>', '&gt;', $rest);
511  $close = ($brace == '/>' && !$slash) ? ' /' : '';
512  $text .= "<$slash$t$newparams$close>$rest";
513  continue;
514  }
515  }
516  $text .= '&lt;' . str_replace('>', '&gt;', $x);
517  }
518  # Close off any remaining tags
519  while (is_array($tagstack) && ($t = array_pop($tagstack))) {
520  $text .= "</$t>\n";
521  if ($t == 'table') {
522  $tagstack = array_pop($tablestack);
523  }
524  }
525  } else {
526  # this might be possible using tidy itself
527  foreach ($bits as $x) {
528  preg_match(
529  '/^(\\/?)(\\w+)([^>]*?)(\\/{0,1}>)([^<]*)$/',
530  $x,
531  $regs
532  );
533  @list( /* $qbar */, $slash, $t, $params, $brace, $rest) = $regs;
534  if (isset($htmlelements[$t = strtolower($t)])) {
535  if (is_callable($processCallback)) {
536  call_user_func_array($processCallback, array( &$params, $args ));
537  }
538  $newparams = Sanitizer::fixTagAttributes($params, $t);
539  $rest = str_replace('>', '&gt;', $rest);
540  $text .= "<$slash$t$newparams$brace$rest";
541  } else {
542  $text .= '&lt;' . str_replace('>', '&gt;', $x);
543  }
544  }
$rest
Definition: goto.php:49
if(! $DIC->user() ->getId()||!ilLTIConsumerAccess::hasCustomProviderCreationAccess()) $params
Definition: ltiregstart.php:33
static fixTagAttributes($text, $element)
Take a tag soup fragment listing an HTML element&#39;s attributes and normalize it to well-formed XML...
Definition: Sanitizer.php:692
static removeHTMLcomments($text)
Remove &#39;&#39;, and everything between.
Definition: Sanitizer.php:556

◆ safeEncodeAttribute()

static Sanitizer::safeEncodeAttribute (   $text)
static

Encode an attribute value for HTML tags, with extra armoring against further wiki processing.

Parameters
$text
Returns
HTML-encoded text fragment

Definition at line 740 of file Sanitizer.php.

745  {
746  $encValue = Sanitizer::encodeAttribute($text);
747 
748  # Templates and links may be expanded in later parsing,
749  # creating invalid or dangerous output. Suppress this.
750  $encValue = strtr($encValue, array(
751  '<' => '&lt;', // This should never happen,
752  '>' => '&gt;', // we've received invalid input
753  '"' => '&quot;', // which should have been escaped.
754  '{' => '&#123;',
755  '[' => '&#91;',
756  "''" => '&#39;&#39;',
757  'ISBN' => '&#73;SBN',
758  'RFC' => '&#82;FC',
759  'PMID' => '&#80;MID',
760  '|' => '&#124;',
761  '__' => '&#95;_',
762  ));
763 
764  # Stupid hack
765  $encValue = preg_replace_callback(
766  '/(' . wfUrlProtocols() . ')/',
767  array( 'Sanitizer', 'armorLinksCallback' ),
static encodeAttribute($text)
Encode an attribute value for HTML output.
Definition: Sanitizer.php:718

◆ setupAttributeWhitelist()

static Sanitizer::setupAttributeWhitelist ( )
static
Todo:
Document it a bit
Returns
array

Definition at line 1128 of file Sanitizer.php.

Referenced by hexCharReference().

1133  {
1134  $common = array( 'id', 'class', 'lang', 'dir', 'title', 'style' );
1135  $block = array_merge($common, array( 'align' ));
1136  $tablealign = array( 'align', 'char', 'charoff', 'valign' );
1137  $tablecell = array( 'abbr',
1138  'axis',
1139  'headers',
1140  'scope',
1141  'rowspan',
1142  'colspan',
1143  'nowrap', # deprecated
1144  'width', # deprecated
1145  'height', # deprecated
1146  'bgcolor' # deprecated
1147  );
1148 
1149  # Numbers refer to sections in HTML 4.01 standard describing the element.
1150  # See: http://www.w3.org/TR/html4/
1151  $whitelist = array(
1152  # 7.5.4
1153  'div' => $block,
1154  'center' => $common, # deprecated
1155  'span' => $block, # ??
1156 
1157  # 7.5.5
1158  'h1' => $block,
1159  'h2' => $block,
1160  'h3' => $block,
1161  'h4' => $block,
1162  'h5' => $block,
1163  'h6' => $block,
1164 
1165  # 7.5.6
1166  # address
1167 
1168  # 8.2.4
1169  # bdo
1170 
1171  # 9.2.1
1172  'em' => $common,
1173  'strong' => $common,
1174  'cite' => $common,
1175  # dfn
1176  'code' => $common,
1177  # samp
1178  # kbd
1179  'var' => $common,
1180  # abbr
1181  # acronym
1182 
1183  # 9.2.2
1184  'blockquote' => array_merge($common, array( 'cite' )),
1185  # q
1186 
1187  # 9.2.3
1188  'sub' => $common,
1189  'sup' => $common,
1190 
1191  # 9.3.1
1192  'p' => $block,
1193 
1194  # 9.3.2
1195  'br' => array( 'id', 'class', 'title', 'style', 'clear' ),
1196 
1197  # 9.3.4
1198  'pre' => array_merge($common, array( 'width' )),
1199 
1200  # 9.4
1201  'ins' => array_merge($common, array( 'cite', 'datetime' )),
1202  'del' => array_merge($common, array( 'cite', 'datetime' )),
1203 
1204  # 10.2
1205  'ul' => array_merge($common, array( 'type' )),
1206  'ol' => array_merge($common, array( 'type', 'start' )),
1207  'li' => array_merge($common, array( 'type', 'value' )),
1208 
1209  # 10.3
1210  'dl' => $common,
1211  'dd' => $common,
1212  'dt' => $common,
1213 
1214  # 11.2.1
1215  'table' => array_merge(
1216  $common,
1217  array( 'summary', 'width', 'border', 'frame',
1218  'rules', 'cellspacing', 'cellpadding',
1219  'align', 'bgcolor',
1220  )
1221  ),
1222 
1223  # 11.2.2
1224  'caption' => array_merge($common, array( 'align' )),
1225 
1226  # 11.2.3
1227  'thead' => array_merge($common, $tablealign),
1228  'tfoot' => array_merge($common, $tablealign),
1229  'tbody' => array_merge($common, $tablealign),
1230 
1231  # 11.2.4
1232  'colgroup' => array_merge($common, array( 'span', 'width' ), $tablealign),
1233  'col' => array_merge($common, array( 'span', 'width' ), $tablealign),
1234 
1235  # 11.2.5
1236  'tr' => array_merge($common, array( 'bgcolor' ), $tablealign),
1237 
1238  # 11.2.6
1239  'td' => array_merge($common, $tablecell, $tablealign),
1240  'th' => array_merge($common, $tablecell, $tablealign),
1241 
1242  # 15.2.1
1243  'tt' => $common,
1244  'b' => $common,
1245  'i' => $common,
1246  'big' => $common,
1247  'small' => $common,
1248  'strike' => $common,
1249  's' => $common,
1250  'u' => $common,
1251 
1252  # 15.2.2
1253  'font' => array_merge($common, array( 'size', 'color', 'face' )),
1254  # basefont
1255 
1256  # 15.3
1257  'hr' => array_merge($common, array( 'noshade', 'size', 'width' )),
1258 
1259  # XHTML Ruby annotation text module, simple ruby only.
1260  # http://www.w3c.org/TR/ruby/
1261  'ruby' => $common,
1262  # rbc
1263  # rtc
1264  'rb' => $common,
1265  'rt' => $common, #array_merge( $common, array( 'rbspan' ) ),
static http()
Fetches the global http state from ILIAS.
+ Here is the caller graph for this function:

◆ stripAllTags()

static Sanitizer::stripAllTags (   $text)
static

Take a fragment of (potentially invalid) HTML and return a version with any tags removed, encoded as plain text.

Warning: this return value must be further escaped for literal inclusion in HTML output as of 1.10!

Parameters
string$textHTML fragment
Returns
string

Definition at line 1277 of file Sanitizer.php.

Referenced by hexCharReference().

1282  {
1283  # Actual <tags>
1284  $text = StringUtils::delimiterReplace('<', '>', '', $text);
1285 
1286  # Normalize &entities and whitespace
1287  $text = self::decodeCharReferences($text);
+ Here is the caller graph for this function:

◆ validateCodepoint()

static Sanitizer::validateCodepoint (   $codepoint)
staticprivate

Returns true if a given Unicode codepoint is a valid character in XML.

Parameters
int$codepoint
Returns
bool

Definition at line 1022 of file Sanitizer.php.

Referenced by decCharReference(), and hexCharReference().

1027  {
1028  return ($codepoint == 0x09)
1029  || ($codepoint == 0x0a)
1030  || ($codepoint == 0x0d)
+ Here is the caller graph for this function:

◆ validateTagAttributes()

static Sanitizer::validateTagAttributes (   $attribs,
  $element 
)
static

Take an array of attribute names and values and normalize or discard illegal values for the given element type.

  • Discards attributes not on a whitelist for the given element
  • Unsafe style attributes are discarded
Parameters
array$attribs
string$element
Returns
array
Todo:

Check for legal values where the DTD limits things.

Check for unique id attribute :P

Definition at line 606 of file Sanitizer.php.

608  :P
609  */
610  public static function validateTagAttributes($attribs, $element)
611  {
612  $whitelist = array_flip(Sanitizer::attributeWhitelist($element));
613  $out = array();
614  foreach ($attribs as $attribute => $value) {
615  if (!isset($whitelist[$attribute])) {
616  continue;
617  }
618  # Strip javascript "expression" from stylesheets.
619  # http://msdn.microsoft.com/workshop/author/dhtml/overview/recalc.asp
620  if ($attribute == 'style') {
621  $value = Sanitizer::checkCss($value);
622  if ($value === false) {
623  # haxx0r
624  continue;
625  }
626  }
627 
628  if ($attribute === 'id') {
629  $value = Sanitizer::escapeId($value);
630  }
631 
632  // If this attribute was previously set, override it.
633  // Output should only have one attribute of each name.
static validateTagAttributes($attribs, $element)
Take an array of attribute names and values and normalize or discard illegal values for the given ele...
Definition: Sanitizer.php:606
static attributeWhitelist($element)
Fetch the whitelist of acceptable attributes for a given element name.
Definition: Sanitizer.php:1114
$out
Definition: buildRTE.php:24
static escapeId($id)
Given a value escape it so that it can be used in an id attribute and return it, this does not valida...
Definition: Sanitizer.php:783
This file is part of ILIAS, a powerful learning management system published by ILIAS open source e-Le...
Definition: Audio.php:21
static checkCss($value)
Pick apart some CSS and check it for forbidden or unsafe structures.
Definition: Sanitizer.php:644

The documentation for this class was generated from the following file: