ILIAS  release_5-3 Revision v5.3.23-19-g915713cf615
htmlfilter.php File Reference

Go to the source code of this file.

Functions

 tln_tagprint ($tagname, $attary, $tagtype)
 

htmlfilter.inc

This set of functions allows you to filter html in order to remove any malicious tags from it. More...
 
 tln_casenormalize (&$val)
 A small helper function to use with array_walk. More...
 
 tln_skipspace ($body, $offset)
 This function skips any whitespace from the current position within a string and to the next non-whitespace value. More...
 
 tln_findnxstr ($body, $offset, $needle)
 This function looks for the next character within a string. More...
 
 tln_findnxreg ($body, $offset, $reg)
 This function takes a PCRE-style regexp and tries to match it within the string. More...
 
 tln_getnxtag ($body, $offset)
 This function looks for the next tag. More...
 
 tln_deent (&$attvalue, $regex, $hex=false)
 Translates entities into literal values so they can be checked. More...
 
 tln_defang (&$attvalue)
 This function checks attribute values for entity-encoded values and returns them translated into 8-bit strings so we can run checks on them. More...
 
 tln_unspace (&$attvalue)
 Kill any tabs, newlines, or carriage returns. More...
 
 tln_fixatts ( $tagname, $attary, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
 This function runs various checks against the attributes. More...
 
 tln_fixurl ($attname, &$attvalue, $trans_image_path, $block_external_images)
 
 tln_fixstyle ($body, $pos, $trans_image_path, $block_external_images)
 
 tln_body2div ($attary, $trans_image_path)
 
 tln_sanitize ( $body, $tag_list, $rm_tags_with_content, $self_closing_tags, $force_tag_closing, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
 
 HTMLFilter ($body, $trans_image_path, $block_external_images=false)
 

Function Documentation

◆ HTMLFilter()

HTMLFilter (   $body,
  $trans_image_path,
  $block_external_images = false 
)

Definition at line 1013 of file htmlfilter.php.

References array, and tln_sanitize().

1014 {
1015 
1016  $tag_list = array(
1017  false,
1018  "object",
1019  "meta",
1020  "html",
1021  "head",
1022  "base",
1023  "link",
1024  "frame",
1025  "iframe",
1026  "plaintext",
1027  "marquee"
1028  );
1029 
1030  $rm_tags_with_content = array(
1031  "script",
1032  "applet",
1033  "embed",
1034  "title",
1035  "frameset",
1036  "xmp",
1037  "xml"
1038  );
1039 
1040  $self_closing_tags = array(
1041  "img",
1042  "br",
1043  "hr",
1044  "input",
1045  "outbind"
1046  );
1047 
1048  $force_tag_closing = true;
1049 
1050  $rm_attnames = array(
1051  "/.*/" =>
1052  array(
1053  // "/target/i",
1054  "/^on.*/i",
1055  "/^dynsrc/i",
1056  "/^data.*/i",
1057  "/^lowsrc.*/i"
1058  )
1059  );
1060 
1061  $bad_attvals = array(
1062  "/.*/" =>
1063  array(
1064  "/^src|background/i" =>
1065  array(
1066  array(
1067  '/^([\'"])\s*\S+script\s*:.*([\'"])/si',
1068  '/^([\'"])\s*mocha\s*:*.*([\'"])/si',
1069  '/^([\'"])\s*about\s*:.*([\'"])/si'
1070  ),
1071  array(
1072  "\\1$trans_image_path\\2",
1073  "\\1$trans_image_path\\2",
1074  "\\1$trans_image_path\\2"
1075  )
1076  ),
1077  "/^href|action/i" =>
1078  array(
1079  array(
1080  '/^([\'"])\s*\S+script\s*:.*([\'"])/si',
1081  '/^([\'"])\s*mocha\s*:*.*([\'"])/si',
1082  '/^([\'"])\s*about\s*:.*([\'"])/si'
1083  ),
1084  array(
1085  "\\1#\\1",
1086  "\\1#\\1",
1087  "\\1#\\1"
1088  )
1089  ),
1090  "/^style/i" =>
1091  array(
1092  array(
1093  "/\/\*.*\*\//",
1094  "/expression/i",
1095  "/binding/i",
1096  "/behaviou*r/i",
1097  "/include-source/i",
1098  '/position\s*:/i',
1099  '/(\\\\)?u(\\\\)?r(\\\\)?l(\\\\)?/i',
1100  '/url\s*\(\s*([\'"])\s*\S+script\s*:.*([\'"])\s*\)/si',
1101  '/url\s*\(\s*([\'"])\s*mocha\s*:.*([\'"])\s*\)/si',
1102  '/url\s*\(\s*([\'"])\s*about\s*:.*([\'"])\s*\)/si',
1103  '/(.*)\s*:\s*url\s*\(\s*([\'"]*)\s*\S+script\s*:.*([\'"]*)\s*\)/si'
1104  ),
1105  array(
1106  "",
1107  "idiocy",
1108  "idiocy",
1109  "idiocy",
1110  "idiocy",
1111  "idiocy",
1112  "url",
1113  "url(\\1#\\1)",
1114  "url(\\1#\\1)",
1115  "url(\\1#\\1)",
1116  "\\1:url(\\2#\\3)"
1117  )
1118  )
1119  )
1120  );
1121 
1122  if ($block_external_images) {
1123  array_push(
1124  $bad_attvals{'/.*/'}{'/^src|background/i'}[0],
1125  '/^([\'\"])\s*https*:.*([\'\"])/si'
1126  );
1127  array_push(
1128  $bad_attvals{'/.*/'}{'/^src|background/i'}[1],
1129  "\\1$trans_image_path\\1"
1130  );
1131  array_push(
1132  $bad_attvals{'/.*/'}{'/^style/i'}[0],
1133  '/url\(([\'\"])\s*https*:.*([\'\"])\)/si'
1134  );
1135  array_push(
1136  $bad_attvals{'/.*/'}{'/^style/i'}[1],
1137  "url(\\1$trans_image_path\\1)"
1138  );
1139  }
1140 
1141  $add_attr_to_tag = array(
1142  "/^a$/i" =>
1143  array('target' => '"_blank"')
1144  );
1145 
1146  $trusted = tln_sanitize(
1147  $body,
1148  $tag_list,
1149  $rm_tags_with_content,
1150  $self_closing_tags,
1151  $force_tag_closing,
1152  $rm_attnames,
1153  $bad_attvals,
1154  $add_attr_to_tag,
1155  $trans_image_path,
1156  $block_external_images
1157  );
1158  return $trusted;
1159 }
tln_sanitize( $body, $tag_list, $rm_tags_with_content, $self_closing_tags, $force_tag_closing, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:842
Create styles array
The data for the language used.
+ Here is the call graph for this function:

◆ tln_body2div()

tln_body2div (   $attary,
  $trans_image_path 
)

Definition at line 791 of file htmlfilter.php.

References $text, and array.

Referenced by tln_sanitize().

792 {
793  $divattary = array('class' => "'bodyclass'");
794  $text = '#000000';
795  $has_bgc_stl = $has_txt_stl = false;
796  $styledef = '';
797  if (is_array($attary) && sizeof($attary) > 0){
798  foreach ($attary as $attname=>$attvalue){
799  $quotchar = substr($attvalue, 0, 1);
800  $attvalue = str_replace($quotchar, "", $attvalue);
801  switch ($attname){
802  case 'background':
803  $styledef .= "background-image: url('$trans_image_path'); ";
804  break;
805  case 'bgcolor':
806  $has_bgc_stl = true;
807  $styledef .= "background-color: $attvalue; ";
808  break;
809  case 'text':
810  $has_txt_stl = true;
811  $styledef .= "color: $attvalue; ";
812  break;
813  }
814  }
815  // Outlook defines a white bgcolor and no text color. This can lead to
816  // white text on a white bg with certain themes.
817  if ($has_bgc_stl && !$has_txt_stl) {
818  $styledef .= "color: $text; ";
819  }
820  if (strlen($styledef) > 0){
821  $divattary{"style"} = "\"$styledef\"";
822  }
823  }
824  return $divattary;
825 }
$text
Definition: errorreport.php:18
Create styles array
The data for the language used.
+ Here is the caller graph for this function:

◆ tln_casenormalize()

tln_casenormalize ( $val)

A small helper function to use with array_walk.

Modifies a by-ref value and makes it lowercase.

Parameters
string$vala value passed by-ref.
Returns
void since it modifies a by-ref value.

Definition at line 69 of file htmlfilter.php.

70 {
71  $val = strtolower($val);
72 }

◆ tln_deent()

tln_deent ( $attvalue,
  $regex,
  $hex = false 
)

Translates entities into literal values so they can be checked.

Parameters
string$attvaluethe by-ref value to check.
string$regexthe regular expression to check against.
boolean$hexwhether the entities are hexadecimal.
Returns
boolean True or False depending on whether there were matches.

Definition at line 439 of file htmlfilter.php.

References $i, and array.

Referenced by tln_defang().

440 {
441  preg_match_all($regex, $attvalue, $matches);
442  if (is_array($matches) && sizeof($matches[0]) > 0) {
443  $repl = array();
444  for ($i = 0; $i < sizeof($matches[0]); $i++) {
445  $numval = $matches[1][$i];
446  if ($hex) {
447  $numval = hexdec($numval);
448  }
449  $repl{$matches[0][$i]} = chr($numval);
450  }
451  $attvalue = strtr($attvalue, $repl);
452  return true;
453  } else {
454  return false;
455  }
456 }
Create styles array
The data for the language used.
$i
Definition: disco.tpl.php:19
+ Here is the caller graph for this function:

◆ tln_defang()

tln_defang ( $attvalue)

This function checks attribute values for entity-encoded values and returns them translated into 8-bit strings so we can run checks on them.

Parameters
string$attvalueA string to run entity check against.

Skip this if there aren't ampersands or backslashes.

Definition at line 465 of file htmlfilter.php.

References $m, and tln_deent().

Referenced by tln_fixatts(), and tln_fixstyle().

466 {
470  if (strpos($attvalue, '&') === false
471  && strpos($attvalue, '\\') === false
472  ) {
473  return;
474  }
475  do {
476  $m = false;
477  $m = $m || tln_deent($attvalue, '/\&#0*(\d+);*/s');
478  $m = $m || tln_deent($attvalue, '/\&#x0*((\d|[a-f])+);*/si', true);
479  $m = $m || tln_deent($attvalue, '/\\\\(\d+)/s', true);
480  } while ($m == true);
481  $attvalue = stripslashes($attvalue);
482 }
tln_deent(&$attvalue, $regex, $hex=false)
Translates entities into literal values so they can be checked.
Definition: htmlfilter.php:439
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_findnxreg()

tln_findnxreg (   $body,
  $offset,
  $reg 
)

This function takes a PCRE-style regexp and tries to match it within the string.

Parameters
string$bodyThe string to look for needle in.
integer$offsetStart looking from here.
string$regA PCRE-style regex to match.
Returns
array|boolean Returns a false if no matches found, or an array with the following members:
  • integer with the location of the match within $body
  • string with whatever content between offset and the match
  • string with whatever it is we matched

Definition at line 127 of file htmlfilter.php.

References array.

Referenced by tln_getnxtag().

128 {
129  $matches = array();
130  $retarr = array();
131  $preg_rule = '%^(.*?)(' . $reg . ')%s';
132  preg_match($preg_rule, substr($body, $offset), $matches);
133  if (!isset($matches[0]) || !$matches[0]) {
134  $retarr = false;
135  } else {
136  $retarr[0] = $offset + strlen($matches[1]);
137  $retarr[1] = $matches[1];
138  $retarr[2] = $matches[2];
139  }
140  return $retarr;
141 }
Create styles array
The data for the language used.
+ Here is the caller graph for this function:

◆ tln_findnxstr()

tln_findnxstr (   $body,
  $offset,
  $needle 
)

This function looks for the next character within a string.

It's really just a glorified "strpos", except it catches the failures nicely.

Parameters
string$bodyThe string to look for needle in.
integer$offsetStart looking from this position.
string$needleThe character/string to look for.
Returns
integer location of the next occurrence of the needle, or strlen($body) if needle wasn't found.

Definition at line 105 of file htmlfilter.php.

Referenced by tln_getnxtag().

106 {
107  $pos = strpos($body, $needle, $offset);
108  if ($pos === false) {
109  $pos = strlen($body);
110  }
111  return $pos;
112 }
+ Here is the caller graph for this function:

◆ tln_fixatts()

tln_fixatts (   $tagname,
  $attary,
  $rm_attnames,
  $bad_attvals,
  $add_attr_to_tag,
  $trans_image_path,
  $block_external_images 
)

This function runs various checks against the attributes.

Parameters
string$tagnameString with the name of the tag.
array$attaryArray with all tag attributes.
array$rm_attnamesSee description for tln_sanitize
array$bad_attvalsSee description for tln_sanitize
array$add_attr_to_tagSee description for tln_sanitize
string$trans_image_path
boolean$block_external_images
Returns
array with modified attributes.

See if this attribute should be removed.

Remove any backslashes, entities, or extraneous whitespace.

Now let's run checks on the attvalues. I don't expect anyone to comprehend this. If you do, get in touch with me so I can drive to where you live and shake your hand personally. :)

There are two arrays in valary. First is matches. Second one is replacements

See if we need to append any attributes to this tag.

Definition at line 514 of file htmlfilter.php.

References tln_defang(), tln_fixurl(), and tln_unspace().

Referenced by tln_sanitize().

522  {
523  foreach($attary as $attname => $attvalue) {
527  foreach ($rm_attnames as $matchtag => $matchattrs) {
528  if (preg_match($matchtag, $tagname)) {
529  foreach ($matchattrs as $matchattr) {
530  if (preg_match($matchattr, $attname)) {
531  unset($attary{$attname});
532  continue;
533  }
534  }
535  }
536  }
540  $oldattvalue = $attvalue;
541  tln_defang($attvalue);
542  if ($attname == 'style' && $attvalue !== $oldattvalue) {
543  $attvalue = "idiocy";
544  $attary{$attname} = $attvalue;
545  }
546  tln_unspace($attvalue);
547 
554  foreach ($bad_attvals as $matchtag => $matchattrs) {
555  if (preg_match($matchtag, $tagname)) {
556  foreach ($matchattrs as $matchattr => $valary) {
557  if (preg_match($matchattr, $attname)) {
563  list($valmatch, $valrepl) = $valary;
564  $newvalue = preg_replace($valmatch, $valrepl, $attvalue);
565  if ($newvalue != $attvalue) {
566  $attary{$attname} = $newvalue;
567  $attvalue = $newvalue;
568  }
569  }
570  }
571  }
572  }
573  if ($attname == 'style') {
574  if (preg_match('/[\0-\37\200-\377]+/', $attvalue)) {
575  $attary{$attname} = '"disallowed character"';
576  }
577  preg_match_all("/url\s*\((.+)\)/si", $attvalue, $aMatch);
578  if (count($aMatch)) {
579  foreach($aMatch[1] as $sMatch) {
580  $urlvalue = $sMatch;
581  tln_fixurl($attname, $urlvalue, $trans_image_path, $block_external_images);
582  $attary{$attname} = str_replace($sMatch, $urlvalue, $attvalue);
583  }
584  }
585  }
586  }
590  foreach ($add_attr_to_tag as $matchtag => $addattary) {
591  if (preg_match($matchtag, $tagname)) {
592  $attary = array_merge($attary, $addattary);
593  }
594  }
595  return $attary;
596 }
tln_defang(&$attvalue)
This function checks attribute values for entity-encoded values and returns them translated into 8-bi...
Definition: htmlfilter.php:465
tln_unspace(&$attvalue)
Kill any tabs, newlines, or carriage returns.
Definition: htmlfilter.php:491
tln_fixurl($attname, &$attvalue, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:598
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_fixstyle()

tln_fixstyle (   $body,
  $pos,
  $trans_image_path,
  $block_external_images 
)

First look for general BODY style declaration, which would be like so: body {background: blah-blah} and change it to .bodyclass so we can just assign it to a

Fix url('blah') declarations.
Remove any backslashes, entities, and extraneous whitespace.

Definition at line 666 of file htmlfilter.php.

References $i, array, tln_defang(), tln_fixurl(), and tln_unspace().

Referenced by tln_sanitize().

667 {
668  // workaround for </style> in between comments
669  $content = '';
670  $sToken = '';
671  $bSucces = false;
672  $bEndTag = false;
673  for ($i=$pos,$iCount=strlen($body);$i<$iCount;++$i) {
674  $char = $body{$i};
675  switch ($char) {
676  case '<':
677  $sToken = $char;
678  break;
679  case '/':
680  if ($sToken == '<') {
681  $sToken .= $char;
682  $bEndTag = true;
683  } else {
684  $content .= $char;
685  }
686  break;
687  case '>':
688  if ($bEndTag) {
689  $sToken .= $char;
690  if (preg_match('/<\/\s*style\s*>/i',$sToken,$aMatch)) {
691  $newpos = $i + 1;
692  $bSucces = true;
693  break 2;
694  } else {
695  $content .= $sToken;
696  }
697  $bEndTag = false;
698  } else {
699  $content .= $char;
700  }
701  break;
702  case '!':
703  if ($sToken == '<') {
704  // possible comment
705  if (isset($body{$i+2}) && substr($body,$i,3) == '!--') {
706  $i = strpos($body,'-->',$i+3);
707  if ($i === false) { // no end comment
708  $i = strlen($body);
709  }
710  $sToken = '';
711  }
712  } else {
713  $content .= $char;
714  }
715  break;
716  default:
717  if ($bEndTag) {
718  $sToken .= $char;
719  } else {
720  $content .= $char;
721  }
722  break;
723  }
724  }
725  if ($bSucces == FALSE){
726  return array(FALSE, strlen($body));
727  }
728 
729 
730 
737  $content = preg_replace("|body(\s*\{.*?\})|si", ".bodyclass\\1", $content);
738 
742  // $content = preg_replace("|url\s*\(\s*([\'\"])\s*\S+script\s*:.*?([\'\"])\s*\)|si",
743  // "url(\\1$trans_image_path\\2)", $content);
744 
745  // first check for 8bit sequences and disallowed control characters
746  if (preg_match('/[\16-\37\200-\377]+/',$content)) {
747  $content = '<!-- style block removed by html filter due to presence of 8bit characters -->';
748  return array($content, $newpos);
749  }
750 
751  // remove @import line
752  $content = preg_replace("/^\s*(@import.*)$/mi","\n<!-- @import rules forbidden -->\n",$content);
753 
754  $content = preg_replace("/(\\\\)?u(\\\\)?r(\\\\)?l(\\\\)?/i", 'url', $content);
755  preg_match_all("/url\s*\((.+)\)/si",$content,$aMatch);
756  if (count($aMatch)) {
757  $aValue = $aReplace = array();
758  foreach($aMatch[1] as $sMatch) {
759  // url value
760  $urlvalue = $sMatch;
761  tln_fixurl('style',$urlvalue, $trans_image_path, $block_external_images);
762  $aValue[] = $sMatch;
763  $aReplace[] = $urlvalue;
764  }
765  $content = str_replace($aValue,$aReplace,$content);
766  }
767 
771  $contentTemp = $content;
772  tln_defang($contentTemp);
773  tln_unspace($contentTemp);
774 
775  $match = array('/\/\*.*\*\//',
776  '/expression/i',
777  '/behaviou*r/i',
778  '/binding/i',
779  '/include-source/i',
780  '/javascript/i',
781  '/script/i',
782  '/position/i');
783  $replace = array('','idiocy', 'idiocy', 'idiocy', 'idiocy', 'idiocy', 'idiocy', '');
784  $contentNew = preg_replace($match, $replace, $contentTemp);
785  if ($contentNew !== $contentTemp) {
786  $content = $contentNew;
787  }
788  return array($content, $newpos);
789 }
tln_defang(&$attvalue)
This function checks attribute values for entity-encoded values and returns them translated into 8-bi...
Definition: htmlfilter.php:465
Create styles array
The data for the language used.
$i
Definition: disco.tpl.php:19
tln_unspace(&$attvalue)
Kill any tabs, newlines, or carriage returns.
Definition: htmlfilter.php:491
tln_fixurl($attname, &$attvalue, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:598
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_fixurl()

tln_fixurl (   $attname,
$attvalue,
  $trans_image_path,
  $block_external_images 
)

Replace empty src tags with the blank image. src is only used for frames, images, and image inputs. Doing a replace should not affect them working as should be, however it will stop IE from being kicked off when src for img tags are not set

Definition at line 598 of file htmlfilter.php.

Referenced by tln_fixatts(), and tln_fixstyle().

599 {
600  $sQuote = '"';
601  $attvalue = trim($attvalue);
602  if ($attvalue && ($attvalue[0] =='"'|| $attvalue[0] == "'")) {
603  // remove the double quotes
604  $sQuote = $attvalue[0];
605  $attvalue = trim(substr($attvalue,1,-1));
606  }
607 
614  if ($attvalue == '') {
615  $attvalue = $sQuote . $trans_image_path . $sQuote;
616  } else {
617  // first, disallow 8 bit characters and control characters
618  if (preg_match('/[\0-\37\200-\377]+/',$attvalue)) {
619  switch ($attname) {
620  case 'href':
621  $attvalue = $sQuote . 'http://invalid-stuff-detected.example.com' . $sQuote;
622  break;
623  default:
624  $attvalue = $sQuote . $trans_image_path . $sQuote;
625  break;
626  }
627  } else {
628  $aUrl = parse_url($attvalue);
629  if (isset($aUrl['scheme'])) {
630  switch(strtolower($aUrl['scheme'])) {
631  case 'mailto':
632  case 'http':
633  case 'https':
634  case 'ftp':
635  if ($attname != 'href') {
636  if ($block_external_images == true) {
637  $attvalue = $sQuote . $trans_image_path . $sQuote;
638  } else {
639  if (!isset($aUrl['path'])) {
640  $attvalue = $sQuote . $trans_image_path . $sQuote;
641  }
642  }
643  } else {
644  $attvalue = $sQuote . $attvalue . $sQuote;
645  }
646  break;
647  case 'outbind':
648  $attvalue = $sQuote . $attvalue . $sQuote;
649  break;
650  case 'cid':
651  $attvalue = $sQuote . $attvalue . $sQuote;
652  break;
653  default:
654  $attvalue = $sQuote . $trans_image_path . $sQuote;
655  break;
656  }
657  } else {
658  if (!isset($aUrl['path']) || $aUrl['path'] != $trans_image_path) {
659  $$attvalue = $sQuote . $trans_image_path . $sQuote;
660  }
661  }
662  }
663  }
664 }
+ Here is the caller graph for this function:

◆ tln_getnxtag()

tln_getnxtag (   $body,
  $offset 
)

This function looks for the next tag.

Parameters
string$bodyString where to look for the next tag.
integer$offsetStart looking from here.
Returns
array|boolean false if no more tags exist in the body, or an array with the following members:
  • string with the name of the tag
  • array with attributes and their values
  • integer with tag type (1, 2, or 3)
  • integer where the tag starts (starting "<")
  • integer where the tag ends (ending ">") first three members will be false, if the tag is invalid.

We are here: blah blah <tag attribute="value"> ———^

There are 3 kinds of tags:

  1. Opening tag, e.g.: Closing tag, e.g.:
  2. XHTML-style content-less tag, e.g.:

A comment or an SGML declaration.

Assume tagtype 1 for now. If it's type 3, we'll switch values later.

Look for next [-_], which will indicate the end of the tag name.

$match can be either of these: '>' indicating the end of the tag entirely. '' indicating the end of the tag name. '/' indicating that this is type-3 xhtml tag.

Whatever else we find there indicates an invalid tag.

This is an xhtml-style tag with a closing / at the end, like so:

. Check if it's followed by the closing bracket. If not, then this tag is invalid

Check if it's whitespace

This is an invalid tag! Look for the next closing ">".

At this point we're here: <tagname attribute="blah"> -——^

At this point we loop in order to find all attributes.

Non-closed tag.

See if we arrived at a ">" or "/>", which means that we reached the end of the tag.

Yep. So we did.

There are several types of attributes, with optional [:space:] between members. Type 1: attrname[:space:]=[:space:]'CDATA' Type 2: attrname[:space:]=[:space:]"CDATA" Type 3: attr[:space:]=[:space:]CDATA Type 4: attrname

We leave types 1 and 2 the same, type 3 we check for '"' and convert to "&quot" if needed, then wrap in double quotes. Type 4 we convert into: attrname="yes".

Looks like body ended before the end of tag.

We arrived at the end of attribute name. Several things possible here: '>' means the end of the tag and this is attribute type 4 '/' if followed by '>' means the same thing as above '' means a lot of things – look what it's followed by. anything else means the attribute is invalid.

This is an xhtml-style tag with a closing / at the end, like so:

. Check if it's followed by the closing bracket. If not, then this tag is invalid

Skip whitespace and see what we arrive at.

Two things are valid here: '=' means this is attribute type 1 2 or 3. means this was attribute type 4. anything else we ignore and re-loop. End of tag and invalid stuff will be caught by our checks at the beginning of the loop.

Here are 3 possibilities: "'" attribute type 1 '"' attribute type 2 everything else is the content of tag type 3

These are hateful. Look for , or >.

If it's ">" it will be caught at the top.

That was attribute type 4.

An illegal character. Find next '>' and return.

The fact that we got here indicates that the tag end was never found. Return invalid tag indication so it gets stripped.

Definition at line 157 of file htmlfilter.php.

References array, tln_findnxreg(), tln_findnxstr(), and tln_skipspace().

Referenced by tln_sanitize().

158 {
159  if ($offset > strlen($body)) {
160  return false;
161  }
162  $lt = tln_findnxstr($body, $offset, '<');
163  if ($lt == strlen($body)) {
164  return false;
165  }
171  $pos = tln_skipspace($body, $lt + 1);
172  if ($pos >= strlen($body)) {
173  return array(false, false, false, $lt, strlen($body));
174  }
184  switch (substr($body, $pos, 1)) {
185  case '/':
186  $tagtype = 2;
187  $pos++;
188  break;
189  case '!':
193  if (substr($body, $pos + 1, 2) == '--') {
194  $gt = strpos($body, '-->', $pos);
195  if ($gt === false) {
196  $gt = strlen($body);
197  } else {
198  $gt += 2;
199  }
200  return array(false, false, false, $lt, $gt);
201  } else {
202  $gt = tln_findnxstr($body, $pos, '>');
203  return array(false, false, false, $lt, $gt);
204  }
205  break;
206  default:
211  $tagtype = 1;
212  break;
213  }
214 
218  $regary = tln_findnxreg($body, $pos, '[^\w\-_]');
219  if ($regary == false) {
220  return array(false, false, false, $lt, strlen($body));
221  }
222  list($pos, $tagname, $match) = $regary;
223  $tagname = strtolower($tagname);
224 
233  switch ($match) {
234  case '/':
240  if (substr($body, $pos, 2) == '/>') {
241  $pos++;
242  $tagtype = 3;
243  } else {
244  $gt = tln_findnxstr($body, $pos, '>');
245  $retary = array(false, false, false, $lt, $gt);
246  return $retary;
247  }
248  //intentional fall-through
249  case '>':
250  return array($tagname, false, $tagtype, $lt, $pos);
251  break;
252  default:
256  if (!preg_match('/\s/', $match)) {
260  $gt = tln_findnxstr($body, $lt, '>');
261  return array(false, false, false, $lt, $gt);
262  }
263  break;
264  }
265 
273  $attary = array();
274 
275  while ($pos <= strlen($body)) {
276  $pos = tln_skipspace($body, $pos);
277  if ($pos == strlen($body)) {
281  return array(false, false, false, $lt, $pos);
282  }
287  $matches = array();
288  if (preg_match('%^(\s*)(>|/>)%s', substr($body, $pos), $matches)) {
292  $pos += strlen($matches[1]);
293  if ($matches[2] == '/>') {
294  $tagtype = 3;
295  $pos++;
296  }
297  return array($tagname, $attary, $tagtype, $lt, $pos);
298  }
299 
317  $regary = tln_findnxreg($body, $pos, '[^\w\-_]');
318  if ($regary == false) {
322  return array(false, false, false, $lt, strlen($body));
323  }
324  list($pos, $attname, $match) = $regary;
325  $attname = strtolower($attname);
334  switch ($match) {
335  case '/':
341  if (substr($body, $pos, 2) == '/>') {
342  $pos++;
343  $tagtype = 3;
344  } else {
345  $gt = tln_findnxstr($body, $pos, '>');
346  $retary = array(false, false, false, $lt, $gt);
347  return $retary;
348  }
349  //intentional fall-through
350  case '>':
351  $attary{$attname} = '"yes"';
352  return array($tagname, $attary, $tagtype, $lt, $pos);
353  break;
354  default:
358  $pos = tln_skipspace($body, $pos);
359  $char = substr($body, $pos, 1);
368  if ($char == '=') {
369  $pos++;
370  $pos = tln_skipspace($body, $pos);
377  $quot = substr($body, $pos, 1);
378  if ($quot == '\'') {
379  $regary = tln_findnxreg($body, $pos + 1, '\'');
380  if ($regary == false) {
381  return array(false, false, false, $lt, strlen($body));
382  }
383  list($pos, $attval, $match) = $regary;
384  $pos++;
385  $attary{$attname} = '\'' . $attval . '\'';
386  } elseif ($quot == '"') {
387  $regary = tln_findnxreg($body, $pos + 1, '\"');
388  if ($regary == false) {
389  return array(false, false, false, $lt, strlen($body));
390  }
391  list($pos, $attval, $match) = $regary;
392  $pos++;
393  $attary{$attname} = '"' . $attval . '"';
394  } else {
398  $regary = tln_findnxreg($body, $pos, '[\s>]');
399  if ($regary == false) {
400  return array(false, false, false, $lt, strlen($body));
401  }
402  list($pos, $attval, $match) = $regary;
406  $attval = preg_replace('/\"/s', '&quot;', $attval);
407  $attary{$attname} = '"' . $attval . '"';
408  }
409  } elseif (preg_match('|[\w/>]|', $char)) {
413  $attary{$attname} = '"yes"';
414  } else {
418  $gt = tln_findnxstr($body, $pos, '>');
419  return array(false, false, false, $lt, $gt);
420  }
421  break;
422  }
423  }
428  return array(false, false, false, $lt, strlen($body));
429 }
tln_skipspace($body, $offset)
This function skips any whitespace from the current position within a string and to the next non-whit...
Definition: htmlfilter.php:84
tln_findnxstr($body, $offset, $needle)
This function looks for the next character within a string.
Definition: htmlfilter.php:105
Create styles array
The data for the language used.
tln_findnxreg($body, $offset, $reg)
This function takes a PCRE-style regexp and tries to match it within the string.
Definition: htmlfilter.php:127
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_sanitize()

tln_sanitize (   $body,
  $tag_list,
  $rm_tags_with_content,
  $self_closing_tags,
  $force_tag_closing,
  $rm_attnames,
  $bad_attvals,
  $add_attr_to_tag,
  $trans_image_path,
  $block_external_images 
)
Parameters
string$bodyThe HTML you wish to filter
array$tag_listsee description above
array$rm_tags_with_contentsee description above
array$self_closing_tagssee description above
boolean$force_tag_closingsee description above
array$rm_attnamessee description above
array$bad_attvalssee description above
array$add_attr_to_tagsee description above
string$trans_image_path
boolean$block_external_images
Returns
string Sanitized html safe to show on your pages.

Normalize rm_tags and rm_tags_with_content.

See if tag_list is of tags to remove or tags to allow. false means remove these tags true means allow these tags

Take care of netscape's stupid javascript entities like &{alert('boo')};

Take care of <style>

Got to the end of tag we needed to remove.

$rm_tags_with_content

See if this is a self-closing type and change tagtype appropriately.

See if we should skip this tag and any content inside it.

Convert body into div.

This is where we run other checks.

Definition at line 842 of file htmlfilter.php.

References array, tln_body2div(), tln_fixatts(), tln_fixstyle(), tln_getnxtag(), and tln_tagprint().

Referenced by HTMLFilter().

853  {
857  $rm_tags = array_shift($tag_list);
858  @array_walk($tag_list, 'tln_casenormalize');
859  @array_walk($rm_tags_with_content, 'tln_casenormalize');
860  @array_walk($self_closing_tags, 'tln_casenormalize');
866  $curpos = 0;
867  $open_tags = array();
868  $trusted = "<!-- begin tln_sanitized html -->\n";
869  $skip_content = false;
874  $body = preg_replace('/&(\{.*?\};)/si', '&amp;\\1', $body);
875  while (($curtag = tln_getnxtag($body, $curpos)) != false) {
876  list($tagname, $attary, $tagtype, $lt, $gt) = $curtag;
877  $free_content = substr($body, $curpos, $lt-$curpos);
881  if ($tagname == "style" && $tagtype == 1){
882  list($free_content, $curpos) =
883  tln_fixstyle($body, $gt+1, $trans_image_path, $block_external_images);
884  if ($free_content != FALSE){
885  if ( !empty($attary) ) {
886  $attary = tln_fixatts($tagname,
887  $attary,
888  $rm_attnames,
889  $bad_attvals,
890  $add_attr_to_tag,
891  $trans_image_path,
892  $block_external_images
893  );
894  }
895  $trusted .= tln_tagprint($tagname, $attary, $tagtype);
896  $trusted .= $free_content;
897  $trusted .= tln_tagprint($tagname, null, 2);
898  }
899  continue;
900  }
901  if ($skip_content == false){
902  $trusted .= $free_content;
903  }
904  if ($tagname != false) {
905  if ($tagtype == 2) {
906  if ($skip_content == $tagname) {
910  $tagname = false;
911  $skip_content = false;
912  } else {
913  if ($skip_content == false) {
914  if ($tagname == "body") {
915  $tagname = "div";
916  }
917  if (isset($open_tags{$tagname}) &&
918  $open_tags{$tagname} > 0
919  ) {
920  $open_tags{$tagname}--;
921  } else {
922  $tagname = false;
923  }
924  }
925  }
926  } else {
930  if ($skip_content == false) {
935  if ($tagtype == 1
936  && in_array($tagname, $self_closing_tags)
937  ) {
938  $tagtype = 3;
939  }
944  if ($tagtype == 1
945  && in_array($tagname, $rm_tags_with_content)
946  ) {
947  $skip_content = $tagname;
948  } else {
949  if (($rm_tags == false
950  && in_array($tagname, $tag_list)) ||
951  ($rm_tags == true
952  && !in_array($tagname, $tag_list))
953  ) {
954  $tagname = false;
955  } else {
959  if ($tagname == "body"){
960  $tagname = "div";
961  $attary = tln_body2div($attary, $trans_image_path);
962  }
963  if ($tagtype == 1) {
964  if (isset($open_tags{$tagname})) {
965  $open_tags{$tagname}++;
966  } else {
967  $open_tags{$tagname} = 1;
968  }
969  }
973  if (is_array($attary) && sizeof($attary) > 0) {
974  $attary = tln_fixatts(
975  $tagname,
976  $attary,
977  $rm_attnames,
978  $bad_attvals,
979  $add_attr_to_tag,
980  $trans_image_path,
981  $block_external_images
982  );
983  }
984  }
985  }
986  }
987  }
988  if ($tagname != false && $skip_content == false) {
989  $trusted .= tln_tagprint($tagname, $attary, $tagtype);
990  }
991  }
992  $curpos = $gt + 1;
993  }
994  $trusted .= substr($body, $curpos, strlen($body) - $curpos);
995  if ($force_tag_closing == true) {
996  foreach ($open_tags as $tagname => $opentimes) {
997  while ($opentimes > 0) {
998  $trusted .= '</' . $tagname . '>';
999  $opentimes--;
1000  }
1001  }
1002  $trusted .= "\n";
1003  }
1004  $trusted .= "<!-- end tln_sanitized html -->\n";
1005  return $trusted;
1006 }
tln_fixatts( $tagname, $attary, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
This function runs various checks against the attributes.
Definition: htmlfilter.php:514
tln_getnxtag($body, $offset)
This function looks for the next tag.
Definition: htmlfilter.php:157
tln_fixstyle($body, $pos, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:666
Create styles array
The data for the language used.
tln_tagprint($tagname, $attary, $tagtype)
htmlfilter.inc This set of functions allows you to filter html in order to remove any malicious tags ...
Definition: htmlfilter.php:41
tln_body2div($attary, $trans_image_path)
Definition: htmlfilter.php:791
+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_skipspace()

tln_skipspace (   $body,
  $offset 
)

This function skips any whitespace from the current position within a string and to the next non-whitespace value.

Parameters
string$bodythe string
integer$offsetthe offset within the string where we should start looking for the next non-whitespace character.
Returns
integer the location within the $body where the next non-whitespace char is located.

Definition at line 84 of file htmlfilter.php.

Referenced by tln_getnxtag().

85 {
86  preg_match('/^(\s*)/s', substr($body, $offset), $matches);
87  if (sizeof($matches[1])) {
88  $count = strlen($matches[1]);
89  $offset += $count;
90  }
91  return $offset;
92 }
+ Here is the caller graph for this function:

◆ tln_tagprint()

tln_tagprint (   $tagname,
  $attary,
  $tagtype 
)

htmlfilter.inc

This set of functions allows you to filter html in order to remove any malicious tags from it.

Useful in cases when you need to filter user input for any cross-site-scripting attempts.

Copyright (C) 2002-2004 by Duke University

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

Konstantin Riabitsev icon@.nosp@m.linu.nosp@m.x.duk.nosp@m.e.ed.nosp@m.u Jim Jagielski <jim@j.nosp@m.aguN.nosp@m.ET.co.nosp@m.m / jimja.nosp@m.g@gm.nosp@m.ail.c.nosp@m.om> 1.1 ($Date$) This function returns the final tag out of the tag name, an array of attributes, and the type of the tag. This function is called by tln_sanitize internally.

Parameters
string$tagnamethe name of the tag.
array$attarythe array of attributes and their values
integer$tagtypeThe type of the tag (see in comments).
Returns
string A string with the final tag representation.

Definition at line 41 of file htmlfilter.php.

References array.

Referenced by tln_sanitize().

42 {
43  if ($tagtype == 2) {
44  $fulltag = '</' . $tagname . '>';
45  } else {
46  $fulltag = '<' . $tagname;
47  if (is_array($attary) && sizeof($attary)) {
48  $atts = array();
49  foreach($attary as $attname => $attvalue) {
50  array_push($atts, "$attname=$attvalue");
51  }
52  $fulltag .= ' ' . join(' ', $atts);
53  }
54  if ($tagtype == 3) {
55  $fulltag .= ' /';
56  }
57  $fulltag .= '>';
58  }
59  return $fulltag;
60 }
Create styles array
The data for the language used.
+ Here is the caller graph for this function:

◆ tln_unspace()

tln_unspace ( $attvalue)

Kill any tabs, newlines, or carriage returns.

Our friends the makers of the browser with 95% market value decided that it'd be funny to make "java[tab]script" be just as good as "javascript".

Parameters
string$attvalueThe attribute value before extraneous spaces removed.

Definition at line 491 of file htmlfilter.php.

References array.

Referenced by tln_fixatts(), and tln_fixstyle().

492 {
493  if (strcspn($attvalue, "\t\r\n\0 ") != strlen($attvalue)) {
494  $attvalue = str_replace(
495  array("\t", "\r", "\n", "\0", " "),
496  array('', '', '', '', ''),
497  $attvalue
498  );
499  }
500 }
Create styles array
The data for the language used.
+ Here is the caller graph for this function: