ILIAS  release_5-1 Revision 5.0.0-5477-g43f3e3fab5f
htmlfilter.php File Reference

Go to the source code of this file.

Functions

 tln_tagprint ($tagname, $attary, $tagtype)
 This function returns the final tag out of the tag name, an array of attributes, and the type of the tag. More...
 
 tln_casenormalize (&$val)
 A small helper function to use with array_walk. More...
 
 tln_skipspace ($body, $offset)
 This function skips any whitespace from the current position within a string and to the next non-whitespace value. More...
 
 tln_findnxstr ($body, $offset, $needle)
 This function looks for the next character within a string. More...
 
 tln_findnxreg ($body, $offset, $reg)
 This function takes a PCRE-style regexp and tries to match it within the string. More...
 
 tln_getnxtag ($body, $offset)
 This function looks for the next tag. More...
 
 tln_deent (&$attvalue, $regex, $hex=false)
 Translates entities into literal values so they can be checked. More...
 
 tln_defang (&$attvalue)
 This function checks attribute values for entity-encoded values and returns them translated into 8-bit strings so we can run checks on them. More...
 
 tln_unspace (&$attvalue)
 Kill any tabs, newlines, or carriage returns. More...
 
 tln_fixatts ( $tagname, $attary, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
 This function runs various checks against the attributes. More...
 
 tln_fixurl ($attname, &$attvalue, $trans_image_path, $block_external_images)
 
 tln_fixstyle ($body, $pos, $trans_image_path, $block_external_images)
 
 tln_body2div ($attary, $trans_image_path)
 
 tln_sanitize ( $body, $tag_list, $rm_tags_with_content, $self_closing_tags, $force_tag_closing, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
 
 HTMLFilter ($body, $trans_image_path, $block_external_images=false)
 

Function Documentation

◆ HTMLFilter()

HTMLFilter (   $body,
  $trans_image_path,
  $block_external_images = false 
)

Definition at line 1013 of file htmlfilter.php.

1014{
1015
1016 $tag_list = array(
1017 false,
1018 "object",
1019 "meta",
1020 "html",
1021 "head",
1022 "base",
1023 "link",
1024 "frame",
1025 "iframe",
1026 "plaintext",
1027 "marquee"
1028 );
1029
1030 $rm_tags_with_content = array(
1031 "script",
1032 "applet",
1033 "embed",
1034 "title",
1035 "frameset",
1036 "xmp",
1037 "xml"
1038 );
1039
1040 $self_closing_tags = array(
1041 "img",
1042 "br",
1043 "hr",
1044 "input",
1045 "outbind"
1046 );
1047
1048 $force_tag_closing = true;
1049
1050 $rm_attnames = array(
1051 "/.*/" =>
1052 array(
1053 // "/target/i",
1054 "/^on.*/i",
1055 "/^dynsrc/i",
1056 "/^data.*/i",
1057 "/^lowsrc.*/i"
1058 )
1059 );
1060
1061 $bad_attvals = array(
1062 "/.*/" =>
1063 array(
1064 "/^src|background/i" =>
1065 array(
1066 array(
1067 '/^([\'"])\s*\S+script\s*:.*([\'"])/si',
1068 '/^([\'"])\s*mocha\s*:*.*([\'"])/si',
1069 '/^([\'"])\s*about\s*:.*([\'"])/si'
1070 ),
1071 array(
1072 "\\1$trans_image_path\\2",
1073 "\\1$trans_image_path\\2",
1074 "\\1$trans_image_path\\2"
1075 )
1076 ),
1077 "/^href|action/i" =>
1078 array(
1079 array(
1080 '/^([\'"])\s*\S+script\s*:.*([\'"])/si',
1081 '/^([\'"])\s*mocha\s*:*.*([\'"])/si',
1082 '/^([\'"])\s*about\s*:.*([\'"])/si'
1083 ),
1084 array(
1085 "\\1#\\1",
1086 "\\1#\\1",
1087 "\\1#\\1"
1088 )
1089 ),
1090 "/^style/i" =>
1091 array(
1092 array(
1093 "/\/\*.*\*\//",
1094 "/expression/i",
1095 "/binding/i",
1096 "/behaviou*r/i",
1097 "/include-source/i",
1098 '/position\s*:/i',
1099 '/(\\\\)?u(\\\\)?r(\\\\)?l(\\\\)?/i',
1100 '/url\s*\‍(\s*([\'"])\s*\S+script\s*:.*([\'"])\s*\‍)/si',
1101 '/url\s*\‍(\s*([\'"])\s*mocha\s*:.*([\'"])\s*\‍)/si',
1102 '/url\s*\‍(\s*([\'"])\s*about\s*:.*([\'"])\s*\‍)/si',
1103 '/(.*)\s*:\s*url\s*\‍(\s*([\'"]*)\s*\S+script\s*:.*([\'"]*)\s*\‍)/si'
1104 ),
1105 array(
1106 "",
1107 "idiocy",
1108 "idiocy",
1109 "idiocy",
1110 "idiocy",
1111 "idiocy",
1112 "url",
1113 "url(\\1#\\1)",
1114 "url(\\1#\\1)",
1115 "url(\\1#\\1)",
1116 "\\1:url(\\2#\\3)"
1117 )
1118 )
1119 )
1120 );
1121
1122 if ($block_external_images) {
1123 array_push(
1124 $bad_attvals{'/.*/'}{'/^src|background/i'}[0],
1125 '/^([\'\"])\s*https*:.*([\'\"])/si'
1126 );
1127 array_push(
1128 $bad_attvals{'/.*/'}{'/^src|background/i'}[1],
1129 "\\1$trans_image_path\\1"
1130 );
1131 array_push(
1132 $bad_attvals{'/.*/'}{'/^style/i'}[0],
1133 '/url\‍(([\'\"])\s*https*:.*([\'\"])\‍)/si'
1134 );
1135 array_push(
1136 $bad_attvals{'/.*/'}{'/^style/i'}[1],
1137 "url(\\1$trans_image_path\\1)"
1138 );
1139 }
1140
1141 $add_attr_to_tag = array(
1142 "/^a$/i" =>
1143 array('target' => '"_blank"')
1144 );
1145
1146 $trusted = tln_sanitize(
1147 $body,
1148 $tag_list,
1149 $rm_tags_with_content,
1150 $self_closing_tags,
1151 $force_tag_closing,
1152 $rm_attnames,
1153 $bad_attvals,
1154 $add_attr_to_tag,
1155 $trans_image_path,
1156 $block_external_images
1157 );
1158 return $trusted;
1159}
tln_sanitize( $body, $tag_list, $rm_tags_with_content, $self_closing_tags, $force_tag_closing, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:842

References tln_sanitize().

+ Here is the call graph for this function:

◆ tln_body2div()

tln_body2div (   $attary,
  $trans_image_path 
)

Definition at line 791 of file htmlfilter.php.

792{
793 $divattary = array('class' => "'bodyclass'");
794 $text = '#000000';
795 $has_bgc_stl = $has_txt_stl = false;
796 $styledef = '';
797 if (is_array($attary) && sizeof($attary) > 0){
798 foreach ($attary as $attname=>$attvalue){
799 $quotchar = substr($attvalue, 0, 1);
800 $attvalue = str_replace($quotchar, "", $attvalue);
801 switch ($attname){
802 case 'background':
803 $styledef .= "background-image: url('$trans_image_path'); ";
804 break;
805 case 'bgcolor':
806 $has_bgc_stl = true;
807 $styledef .= "background-color: $attvalue; ";
808 break;
809 case 'text':
810 $has_txt_stl = true;
811 $styledef .= "color: $attvalue; ";
812 break;
813 }
814 }
815 // Outlook defines a white bgcolor and no text color. This can lead to
816 // white text on a white bg with certain themes.
817 if ($has_bgc_stl && !$has_txt_stl) {
818 $styledef .= "color: $text; ";
819 }
820 if (strlen($styledef) > 0){
821 $divattary{"style"} = "\"$styledef\"";
822 }
823 }
824 return $divattary;
825}
$text

References $text.

Referenced by tln_sanitize().

+ Here is the caller graph for this function:

◆ tln_casenormalize()

tln_casenormalize ( $val)

A small helper function to use with array_walk.

Modifies a by-ref value and makes it lowercase.

Parameters
string$vala value passed by-ref.
Returns
void since it modifies a by-ref value.

Definition at line 69 of file htmlfilter.php.

70{
71 $val = strtolower($val);
72}

◆ tln_deent()

tln_deent ( $attvalue,
  $regex,
  $hex = false 
)

Translates entities into literal values so they can be checked.

Parameters
string$attvaluethe by-ref value to check.
string$regexthe regular expression to check against.
boolean$hexwhether the entities are hexadecimal.
Returns
boolean True or False depending on whether there were matches.

Definition at line 439 of file htmlfilter.php.

440{
441 preg_match_all($regex, $attvalue, $matches);
442 if (is_array($matches) && sizeof($matches[0]) > 0) {
443 $repl = array();
444 for ($i = 0; $i < sizeof($matches[0]); $i++) {
445 $numval = $matches[1][$i];
446 if ($hex) {
447 $numval = hexdec($numval);
448 }
449 $repl{$matches[0][$i]} = chr($numval);
450 }
451 $attvalue = strtr($attvalue, $repl);
452 return true;
453 } else {
454 return false;
455 }
456}

Referenced by tln_defang().

+ Here is the caller graph for this function:

◆ tln_defang()

tln_defang ( $attvalue)

This function checks attribute values for entity-encoded values and returns them translated into 8-bit strings so we can run checks on them.

Parameters
string$attvalueA string to run entity check against.

Skip this if there aren't ampersands or backslashes.

Definition at line 465 of file htmlfilter.php.

466{
470 if (strpos($attvalue, '&') === false
471 && strpos($attvalue, '\\') === false
472 ) {
473 return;
474 }
475 do {
476 $m = false;
477 $m = $m || tln_deent($attvalue, '/\&#0*(\d+);*/s');
478 $m = $m || tln_deent($attvalue, '/\&#x0*((\d|[a-f])+);*/si', true);
479 $m = $m || tln_deent($attvalue, '/\\\\(\d+)/s', true);
480 } while ($m == true);
481 $attvalue = stripslashes($attvalue);
482}
tln_deent(&$attvalue, $regex, $hex=false)
Translates entities into literal values so they can be checked.
Definition: htmlfilter.php:439

References tln_deent().

Referenced by tln_fixatts(), and tln_fixstyle().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_findnxreg()

tln_findnxreg (   $body,
  $offset,
  $reg 
)

This function takes a PCRE-style regexp and tries to match it within the string.

Parameters
string$bodyThe string to look for needle in.
integer$offsetStart looking from here.
string$regA PCRE-style regex to match.
Returns
array|boolean Returns a false if no matches found, or an array with the following members:
  • integer with the location of the match within $body
  • string with whatever content between offset and the match
  • string with whatever it is we matched

Definition at line 127 of file htmlfilter.php.

128{
129 $matches = array();
130 $retarr = array();
131 $preg_rule = '%^(.*?)(' . $reg . ')%s';
132 preg_match($preg_rule, substr($body, $offset), $matches);
133 if (!isset($matches[0]) || !$matches[0]) {
134 $retarr = false;
135 } else {
136 $retarr[0] = $offset + strlen($matches[1]);
137 $retarr[1] = $matches[1];
138 $retarr[2] = $matches[2];
139 }
140 return $retarr;
141}

Referenced by tln_getnxtag().

+ Here is the caller graph for this function:

◆ tln_findnxstr()

tln_findnxstr (   $body,
  $offset,
  $needle 
)

This function looks for the next character within a string.

It's really just a glorified "strpos", except it catches the failures nicely.

Parameters
string$bodyThe string to look for needle in.
integer$offsetStart looking from this position.
string$needleThe character/string to look for.
Returns
integer location of the next occurrence of the needle, or strlen($body) if needle wasn't found.

Definition at line 105 of file htmlfilter.php.

106{
107 $pos = strpos($body, $needle, $offset);
108 if ($pos === false) {
109 $pos = strlen($body);
110 }
111 return $pos;
112}

Referenced by tln_getnxtag().

+ Here is the caller graph for this function:

◆ tln_fixatts()

tln_fixatts (   $tagname,
  $attary,
  $rm_attnames,
  $bad_attvals,
  $add_attr_to_tag,
  $trans_image_path,
  $block_external_images 
)

This function runs various checks against the attributes.

Parameters
string$tagnameString with the name of the tag.
array$attaryArray with all tag attributes.
array$rm_attnamesSee description for tln_sanitize
array$bad_attvalsSee description for tln_sanitize
array$add_attr_to_tagSee description for tln_sanitize
string$trans_image_path
boolean$block_external_images
Returns
array with modified attributes.

See if this attribute should be removed.

Remove any backslashes, entities, or extraneous whitespace.

Now let's run checks on the attvalues. I don't expect anyone to comprehend this. If you do, get in touch with me so I can drive to where you live and shake your hand personally. :)

There are two arrays in valary. First is matches. Second one is replacements

See if we need to append any attributes to this tag.

Definition at line 514 of file htmlfilter.php.

522 {
523 while (list($attname, $attvalue) = each($attary)) {
527 foreach ($rm_attnames as $matchtag => $matchattrs) {
528 if (preg_match($matchtag, $tagname)) {
529 foreach ($matchattrs as $matchattr) {
530 if (preg_match($matchattr, $attname)) {
531 unset($attary{$attname});
532 continue;
533 }
534 }
535 }
536 }
540 $oldattvalue = $attvalue;
541 tln_defang($attvalue);
542 if ($attname == 'style' && $attvalue !== $oldattvalue) {
543 $attvalue = "idiocy";
544 $attary{$attname} = $attvalue;
545 }
546 tln_unspace($attvalue);
547
554 foreach ($bad_attvals as $matchtag => $matchattrs) {
555 if (preg_match($matchtag, $tagname)) {
556 foreach ($matchattrs as $matchattr => $valary) {
557 if (preg_match($matchattr, $attname)) {
563 list($valmatch, $valrepl) = $valary;
564 $newvalue = preg_replace($valmatch, $valrepl, $attvalue);
565 if ($newvalue != $attvalue) {
566 $attary{$attname} = $newvalue;
567 $attvalue = $newvalue;
568 }
569 }
570 }
571 }
572 }
573 if ($attname == 'style') {
574 if (preg_match('/[\0-\37\200-\377]+/', $attvalue)) {
575 $attary{$attname} = '"disallowed character"';
576 }
577 preg_match_all("/url\s*\‍((.+)\‍)/si", $attvalue, $aMatch);
578 if (count($aMatch)) {
579 foreach($aMatch[1] as $sMatch) {
580 $urlvalue = $sMatch;
581 tln_fixurl($attname, $urlvalue, $trans_image_path, $block_external_images);
582 $attary{$attname} = str_replace($sMatch, $urlvalue, $attvalue);
583 }
584 }
585 }
586 }
590 foreach ($add_attr_to_tag as $matchtag => $addattary) {
591 if (preg_match($matchtag, $tagname)) {
592 $attary = array_merge($attary, $addattary);
593 }
594 }
595 return $attary;
596}
tln_unspace(&$attvalue)
Kill any tabs, newlines, or carriage returns.
Definition: htmlfilter.php:491
tln_fixurl($attname, &$attvalue, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:598
tln_defang(&$attvalue)
This function checks attribute values for entity-encoded values and returns them translated into 8-bi...
Definition: htmlfilter.php:465

References tln_defang(), tln_fixurl(), and tln_unspace().

Referenced by tln_sanitize().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_fixstyle()

tln_fixstyle (   $body,
  $pos,
  $trans_image_path,
  $block_external_images 
)

First look for general BODY style declaration, which would be like so: body {background: blah-blah} and change it to .bodyclass so we can just assign it to a

Fix url('blah') declarations.
Remove any backslashes, entities, and extraneous whitespace.

Definition at line 666 of file htmlfilter.php.

667{
668 // workaround for </style> in between comments
669 $content = '';
670 $sToken = '';
671 $bSucces = false;
672 $bEndTag = false;
673 for ($i=$pos,$iCount=strlen($body);$i<$iCount;++$i) {
674 $char = $body{$i};
675 switch ($char) {
676 case '<':
677 $sToken = $char;
678 break;
679 case '/':
680 if ($sToken == '<') {
681 $sToken .= $char;
682 $bEndTag = true;
683 } else {
684 $content .= $char;
685 }
686 break;
687 case '>':
688 if ($bEndTag) {
689 $sToken .= $char;
690 if (preg_match('/<\/\s*style\s*>/i',$sToken,$aMatch)) {
691 $newpos = $i + 1;
692 $bSucces = true;
693 break 2;
694 } else {
695 $content .= $sToken;
696 }
697 $bEndTag = false;
698 } else {
699 $content .= $char;
700 }
701 break;
702 case '!':
703 if ($sToken == '<') {
704 // possible comment
705 if (isset($body{$i+2}) && substr($body,$i,3) == '!--') {
706 $i = strpos($body,'-->',$i+3);
707 if ($i === false) { // no end comment
708 $i = strlen($body);
709 }
710 $sToken = '';
711 }
712 } else {
713 $content .= $char;
714 }
715 break;
716 default:
717 if ($bEndTag) {
718 $sToken .= $char;
719 } else {
720 $content .= $char;
721 }
722 break;
723 }
724 }
725 if ($bSucces == FALSE){
726 return array(FALSE, strlen($body));
727 }
728
729
730
737 $content = preg_replace("|body(\s*\{.*?\})|si", ".bodyclass\\1", $content);
738
742 // $content = preg_replace("|url\s*\‍(\s*([\'\"])\s*\S+script\s*:.*?([\'\"])\s*\‍)|si",
743 // "url(\\1$trans_image_path\\2)", $content);
744
745 // first check for 8bit sequences and disallowed control characters
746 if (preg_match('/[\16-\37\200-\377]+/',$content)) {
747 $content = '<!-- style block removed by html filter due to presence of 8bit characters -->';
748 return array($content, $newpos);
749 }
750
751 // remove @import line
752 $content = preg_replace("/^\s*(@import.*)$/mi","\n<!-- @import rules forbidden -->\n",$content);
753
754 $content = preg_replace("/(\\\\)?u(\\\\)?r(\\\\)?l(\\\\)?/i", 'url', $content);
755 preg_match_all("/url\s*\‍((.+)\‍)/si",$content,$aMatch);
756 if (count($aMatch)) {
757 $aValue = $aReplace = array();
758 foreach($aMatch[1] as $sMatch) {
759 // url value
760 $urlvalue = $sMatch;
761 tln_fixurl('style',$urlvalue, $trans_image_path, $block_external_images);
762 $aValue[] = $sMatch;
763 $aReplace[] = $urlvalue;
764 }
765 $content = str_replace($aValue,$aReplace,$content);
766 }
767
771 $contentTemp = $content;
772 tln_defang($contentTemp);
773 tln_unspace($contentTemp);
774
775 $match = array('/\/\*.*\*\//',
776 '/expression/i',
777 '/behaviou*r/i',
778 '/binding/i',
779 '/include-source/i',
780 '/javascript/i',
781 '/script/i',
782 '/position/i');
783 $replace = array('','idiocy', 'idiocy', 'idiocy', 'idiocy', 'idiocy', 'idiocy', '');
784 $contentNew = preg_replace($match, $replace, $contentTemp);
785 if ($contentNew !== $contentTemp) {
786 $content = $contentNew;
787 }
788 return array($content, $newpos);
789}

References tln_defang(), tln_fixurl(), and tln_unspace().

Referenced by tln_sanitize().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_fixurl()

tln_fixurl (   $attname,
$attvalue,
  $trans_image_path,
  $block_external_images 
)

Replace empty src tags with the blank image. src is only used for frames, images, and image inputs. Doing a replace should not affect them working as should be, however it will stop IE from being kicked off when src for img tags are not set

Definition at line 598 of file htmlfilter.php.

599{
600 $sQuote = '"';
601 $attvalue = trim($attvalue);
602 if ($attvalue && ($attvalue[0] =='"'|| $attvalue[0] == "'")) {
603 // remove the double quotes
604 $sQuote = $attvalue[0];
605 $attvalue = trim(substr($attvalue,1,-1));
606 }
607
614 if ($attvalue == '') {
615 $attvalue = $sQuote . $trans_image_path . $sQuote;
616 } else {
617 // first, disallow 8 bit characters and control characters
618 if (preg_match('/[\0-\37\200-\377]+/',$attvalue)) {
619 switch ($attname) {
620 case 'href':
621 $attvalue = $sQuote . 'http://invalid-stuff-detected.example.com' . $sQuote;
622 break;
623 default:
624 $attvalue = $sQuote . $trans_image_path . $sQuote;
625 break;
626 }
627 } else {
628 $aUrl = parse_url($attvalue);
629 if (isset($aUrl['scheme'])) {
630 switch(strtolower($aUrl['scheme'])) {
631 case 'mailto':
632 case 'http':
633 case 'https':
634 case 'ftp':
635 if ($attname != 'href') {
636 if ($block_external_images == true) {
637 $attvalue = $sQuote . $trans_image_path . $sQuote;
638 } else {
639 if (!isset($aUrl['path'])) {
640 $attvalue = $sQuote . $trans_image_path . $sQuote;
641 }
642 }
643 } else {
644 $attvalue = $sQuote . $attvalue . $sQuote;
645 }
646 break;
647 case 'outbind':
648 $attvalue = $sQuote . $attvalue . $sQuote;
649 break;
650 case 'cid':
651 $attvalue = $sQuote . $attvalue . $sQuote;
652 break;
653 default:
654 $attvalue = $sQuote . $trans_image_path . $sQuote;
655 break;
656 }
657 } else {
658 if (!isset($aUrl['path']) || $aUrl['path'] != $trans_image_path) {
659 $$attvalue = $sQuote . $trans_image_path . $sQuote;
660 }
661 }
662 }
663 }
664}

Referenced by tln_fixatts(), and tln_fixstyle().

+ Here is the caller graph for this function:

◆ tln_getnxtag()

tln_getnxtag (   $body,
  $offset 
)

This function looks for the next tag.

Parameters
string$bodyString where to look for the next tag.
integer$offsetStart looking from here.
Returns
array|boolean false if no more tags exist in the body, or an array with the following members:
  • string with the name of the tag
  • array with attributes and their values
  • integer with tag type (1, 2, or 3)
  • integer where the tag starts (starting "<")
  • integer where the tag ends (ending ">") first three members will be false, if the tag is invalid.

We are here: blah blah <tag attribute="value"> ------—^

There are 3 kinds of tags:

  1. Opening tag, e.g.: aClosing tag, e.g.:
  2. XHTML-style content-less tag, e.g.:

A comment or an SGML declaration.

Assume tagtype 1 for now. If it's type 3, we'll switch values later.

Look for next [\W-_], which will indicate the end of the tag name.

$match can be either of these: '>' indicating the end of the tag entirely. '\s' indicating the end of the tag name. '/' indicating that this is type-3 xhtml tag.

Whatever else we find there indicates an invalid tag.

This is an xhtml-style tag with a closing / at the end, like so: . Check if it's followed by the closing bracket. If not, then this tag is invalid

Check if it's whitespace

This is an invalid tag! Look for the next closing ">".

At this point we're here: <tagname attribute='blah'> ----—^

At this point we loop in order to find all attributes.

Non-closed tag.

See if we arrived at a ">" or "/>", which means that we reached the end of the tag.

Yep. So we did.

There are several types of attributes, with optional [:space:] between members. Type 1: attrname[:space:]=[:space:]'CDATA' Type 2: attrname[:space:]=[:space:]"CDATA" Type 3: attr[:space:]=[:space:]CDATA Type 4: attrname

We leave types 1 and 2 the same, type 3 we check for '"' and convert to "&quot" if needed, then wrap in double quotes. Type 4 we convert into: attrname="yes".

Looks like body ended before the end of tag.

We arrived at the end of attribute name. Several things possible here: '>' means the end of the tag and this is attribute type 4 '/' if followed by '>' means the same thing as above '\s' means a lot of things – look what it's followed by. anything else means the attribute is invalid.

This is an xhtml-style tag with a closing / at the end, like so: . Check if it's followed by the closing bracket. If not, then this tag is invalid

Skip whitespace and see what we arrive at.

Two things are valid here: '=' means this is attribute type 1 2 or 3. \w means this was attribute type 4. anything else we ignore and re-loop. End of tag and invalid stuff will be caught by our checks at the beginning of the loop.

Here are 3 possibilities: "'" attribute type 1 '"' attribute type 2 everything else is the content of tag type 3

These are hateful. Look for \s, or >.

If it's ">" it will be caught at the top.

That was attribute type 4.

An illegal character. Find next '>' and return.

The fact that we got here indicates that the tag end was never found. Return invalid tag indication so it gets stripped.

Definition at line 157 of file htmlfilter.php.

158{
159 if ($offset > strlen($body)) {
160 return false;
161 }
162 $lt = tln_findnxstr($body, $offset, '<');
163 if ($lt == strlen($body)) {
164 return false;
165 }
171 $pos = tln_skipspace($body, $lt + 1);
172 if ($pos >= strlen($body)) {
173 return array(false, false, false, $lt, strlen($body));
174 }
184 switch (substr($body, $pos, 1)) {
185 case '/':
186 $tagtype = 2;
187 $pos++;
188 break;
189 case '!':
193 if (substr($body, $pos + 1, 2) == '--') {
194 $gt = strpos($body, '-->', $pos);
195 if ($gt === false) {
196 $gt = strlen($body);
197 } else {
198 $gt += 2;
199 }
200 return array(false, false, false, $lt, $gt);
201 } else {
202 $gt = tln_findnxstr($body, $pos, '>');
203 return array(false, false, false, $lt, $gt);
204 }
205 break;
206 default:
211 $tagtype = 1;
212 break;
213 }
214
218 $regary = tln_findnxreg($body, $pos, '[^\w\-_]');
219 if ($regary == false) {
220 return array(false, false, false, $lt, strlen($body));
221 }
222 list($pos, $tagname, $match) = $regary;
223 $tagname = strtolower($tagname);
224
233 switch ($match) {
234 case '/':
240 if (substr($body, $pos, 2) == '/>') {
241 $pos++;
242 $tagtype = 3;
243 } else {
244 $gt = tln_findnxstr($body, $pos, '>');
245 $retary = array(false, false, false, $lt, $gt);
246 return $retary;
247 }
248 //intentional fall-through
249 case '>':
250 return array($tagname, false, $tagtype, $lt, $pos);
251 break;
252 default:
256 if (!preg_match('/\s/', $match)) {
260 $gt = tln_findnxstr($body, $lt, '>');
261 return array(false, false, false, $lt, $gt);
262 }
263 break;
264 }
265
273 $attary = array();
274
275 while ($pos <= strlen($body)) {
276 $pos = tln_skipspace($body, $pos);
277 if ($pos == strlen($body)) {
281 return array(false, false, false, $lt, $pos);
282 }
287 $matches = array();
288 if (preg_match('%^(\s*)(>|/>)%s', substr($body, $pos), $matches)) {
292 $pos += strlen($matches[1]);
293 if ($matches[2] == '/>') {
294 $tagtype = 3;
295 $pos++;
296 }
297 return array($tagname, $attary, $tagtype, $lt, $pos);
298 }
299
317 $regary = tln_findnxreg($body, $pos, '[^\w\-_]');
318 if ($regary == false) {
322 return array(false, false, false, $lt, strlen($body));
323 }
324 list($pos, $attname, $match) = $regary;
325 $attname = strtolower($attname);
334 switch ($match) {
335 case '/':
341 if (substr($body, $pos, 2) == '/>') {
342 $pos++;
343 $tagtype = 3;
344 } else {
345 $gt = tln_findnxstr($body, $pos, '>');
346 $retary = array(false, false, false, $lt, $gt);
347 return $retary;
348 }
349 //intentional fall-through
350 case '>':
351 $attary{$attname} = '"yes"';
352 return array($tagname, $attary, $tagtype, $lt, $pos);
353 break;
354 default:
358 $pos = tln_skipspace($body, $pos);
359 $char = substr($body, $pos, 1);
368 if ($char == '=') {
369 $pos++;
370 $pos = tln_skipspace($body, $pos);
377 $quot = substr($body, $pos, 1);
378 if ($quot == '\'') {
379 $regary = tln_findnxreg($body, $pos + 1, '\'');
380 if ($regary == false) {
381 return array(false, false, false, $lt, strlen($body));
382 }
383 list($pos, $attval, $match) = $regary;
384 $pos++;
385 $attary{$attname} = '\'' . $attval . '\'';
386 } elseif ($quot == '"') {
387 $regary = tln_findnxreg($body, $pos + 1, '\"');
388 if ($regary == false) {
389 return array(false, false, false, $lt, strlen($body));
390 }
391 list($pos, $attval, $match) = $regary;
392 $pos++;
393 $attary{$attname} = '"' . $attval . '"';
394 } else {
398 $regary = tln_findnxreg($body, $pos, '[\s>]');
399 if ($regary == false) {
400 return array(false, false, false, $lt, strlen($body));
401 }
402 list($pos, $attval, $match) = $regary;
406 $attval = preg_replace('/\"/s', '&quot;', $attval);
407 $attary{$attname} = '"' . $attval . '"';
408 }
409 } elseif (preg_match('|[\w/>]|', $char)) {
413 $attary{$attname} = '"yes"';
414 } else {
418 $gt = tln_findnxstr($body, $pos, '>');
419 return array(false, false, false, $lt, $gt);
420 }
421 break;
422 }
423 }
428 return array(false, false, false, $lt, strlen($body));
429}
tln_skipspace($body, $offset)
This function skips any whitespace from the current position within a string and to the next non-whit...
Definition: htmlfilter.php:84
tln_findnxreg($body, $offset, $reg)
This function takes a PCRE-style regexp and tries to match it within the string.
Definition: htmlfilter.php:127
tln_findnxstr($body, $offset, $needle)
This function looks for the next character within a string.
Definition: htmlfilter.php:105

References tln_findnxreg(), tln_findnxstr(), and tln_skipspace().

Referenced by tln_sanitize().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_sanitize()

tln_sanitize (   $body,
  $tag_list,
  $rm_tags_with_content,
  $self_closing_tags,
  $force_tag_closing,
  $rm_attnames,
  $bad_attvals,
  $add_attr_to_tag,
  $trans_image_path,
  $block_external_images 
)
Parameters
string$bodyThe HTML you wish to filter
array$tag_listsee description above
array$rm_tags_with_contentsee description above
array$self_closing_tagssee description above
boolean$force_tag_closingsee description above
array$rm_attnamessee description above
array$bad_attvalssee description above
array$add_attr_to_tagsee description above
string$trans_image_path
boolean$block_external_images
Returns
string Sanitized html safe to show on your pages.

Normalize rm_tags and rm_tags_with_content.

See if tag_list is of tags to remove or tags to allow. false means remove these tags true means allow these tags

Take care of netscape's stupid javascript entities like &{alert('boo')};

Take care of <style>

Got to the end of tag we needed to remove.

$rm_tags_with_content

See if this is a self-closing type and change tagtype appropriately.

See if we should skip this tag and any content inside it.

Convert body into div.

This is where we run other checks.

Definition at line 842 of file htmlfilter.php.

853 {
857 $rm_tags = array_shift($tag_list);
858 @array_walk($tag_list, 'tln_casenormalize');
859 @array_walk($rm_tags_with_content, 'tln_casenormalize');
860 @array_walk($self_closing_tags, 'tln_casenormalize');
866 $curpos = 0;
867 $open_tags = array();
868 $trusted = "<!-- begin tln_sanitized html -->\n";
869 $skip_content = false;
874 $body = preg_replace('/&(\{.*?\};)/si', '&amp;\\1', $body);
875 while (($curtag = tln_getnxtag($body, $curpos)) != false) {
876 list($tagname, $attary, $tagtype, $lt, $gt) = $curtag;
877 $free_content = substr($body, $curpos, $lt-$curpos);
881 if ($tagname == "style" && $tagtype == 1){
882 list($free_content, $curpos) =
883 tln_fixstyle($body, $gt+1, $trans_image_path, $block_external_images);
884 if ($free_content != FALSE){
885 if ( !empty($attary) ) {
886 $attary = tln_fixatts($tagname,
887 $attary,
888 $rm_attnames,
889 $bad_attvals,
890 $add_attr_to_tag,
891 $trans_image_path,
892 $block_external_images
893 );
894 }
895 $trusted .= tln_tagprint($tagname, $attary, $tagtype);
896 $trusted .= $free_content;
897 $trusted .= tln_tagprint($tagname, null, 2);
898 }
899 continue;
900 }
901 if ($skip_content == false){
902 $trusted .= $free_content;
903 }
904 if ($tagname != false) {
905 if ($tagtype == 2) {
906 if ($skip_content == $tagname) {
910 $tagname = false;
911 $skip_content = false;
912 } else {
913 if ($skip_content == false) {
914 if ($tagname == "body") {
915 $tagname = "div";
916 }
917 if (isset($open_tags{$tagname}) &&
918 $open_tags{$tagname} > 0
919 ) {
920 $open_tags{$tagname}--;
921 } else {
922 $tagname = false;
923 }
924 }
925 }
926 } else {
930 if ($skip_content == false) {
935 if ($tagtype == 1
936 && in_array($tagname, $self_closing_tags)
937 ) {
938 $tagtype = 3;
939 }
944 if ($tagtype == 1
945 && in_array($tagname, $rm_tags_with_content)
946 ) {
947 $skip_content = $tagname;
948 } else {
949 if (($rm_tags == false
950 && in_array($tagname, $tag_list)) ||
951 ($rm_tags == true
952 && !in_array($tagname, $tag_list))
953 ) {
954 $tagname = false;
955 } else {
959 if ($tagname == "body"){
960 $tagname = "div";
961 $attary = tln_body2div($attary, $trans_image_path);
962 }
963 if ($tagtype == 1) {
964 if (isset($open_tags{$tagname})) {
965 $open_tags{$tagname}++;
966 } else {
967 $open_tags{$tagname} = 1;
968 }
969 }
973 if (is_array($attary) && sizeof($attary) > 0) {
974 $attary = tln_fixatts(
975 $tagname,
976 $attary,
977 $rm_attnames,
978 $bad_attvals,
979 $add_attr_to_tag,
980 $trans_image_path,
981 $block_external_images
982 );
983 }
984 }
985 }
986 }
987 }
988 if ($tagname != false && $skip_content == false) {
989 $trusted .= tln_tagprint($tagname, $attary, $tagtype);
990 }
991 }
992 $curpos = $gt + 1;
993 }
994 $trusted .= substr($body, $curpos, strlen($body) - $curpos);
995 if ($force_tag_closing == true) {
996 foreach ($open_tags as $tagname => $opentimes) {
997 while ($opentimes > 0) {
998 $trusted .= '</' . $tagname . '>';
999 $opentimes--;
1000 }
1001 }
1002 $trusted .= "\n";
1003 }
1004 $trusted .= "<!-- end tln_sanitized html -->\n";
1005 return $trusted;
1006}
tln_fixatts( $tagname, $attary, $rm_attnames, $bad_attvals, $add_attr_to_tag, $trans_image_path, $block_external_images)
This function runs various checks against the attributes.
Definition: htmlfilter.php:514
tln_fixstyle($body, $pos, $trans_image_path, $block_external_images)
Definition: htmlfilter.php:666
tln_tagprint($tagname, $attary, $tagtype)
This function returns the final tag out of the tag name, an array of attributes, and the type of the ...
Definition: htmlfilter.php:41
tln_getnxtag($body, $offset)
This function looks for the next tag.
Definition: htmlfilter.php:157
tln_body2div($attary, $trans_image_path)
Definition: htmlfilter.php:791

References tln_body2div(), tln_fixatts(), tln_fixstyle(), tln_getnxtag(), and tln_tagprint().

Referenced by HTMLFilter().

+ Here is the call graph for this function:
+ Here is the caller graph for this function:

◆ tln_skipspace()

tln_skipspace (   $body,
  $offset 
)

This function skips any whitespace from the current position within a string and to the next non-whitespace value.

Parameters
string$bodythe string
integer$offsetthe offset within the string where we should start looking for the next non-whitespace character.
Returns
integer the location within the $body where the next non-whitespace char is located.

Definition at line 84 of file htmlfilter.php.

85{
86 preg_match('/^(\s*)/s', substr($body, $offset), $matches);
87 if (sizeof($matches[1])) {
88 $count = strlen($matches[1]);
89 $offset += $count;
90 }
91 return $offset;
92}

Referenced by tln_getnxtag().

+ Here is the caller graph for this function:

◆ tln_tagprint()

tln_tagprint (   $tagname,
  $attary,
  $tagtype 
)

This function returns the final tag out of the tag name, an array of attributes, and the type of the tag.

htmlfilter.inc

This set of functions allows you to filter html in order to remove any malicious tags from it. Useful in cases when you need to filter user input for any cross-site-scripting attempts.

Copyright (C) 2002-2004 by Duke University

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

@Author Konstantin Riabitsev icon@.nosp@m.linu.nosp@m.x.duk.nosp@m.e.ed.nosp@m.u @Author Jim Jagielski <jim@j.nosp@m.aguN.nosp@m.ET.co.nosp@m.m / jimja.nosp@m.g@gm.nosp@m.ail.c.nosp@m.om> @Version 1.1 ($Date$) This function is called by tln_sanitize internally.

Parameters
string$tagnamethe name of the tag.
array$attarythe array of attributes and their values
integer$tagtypeThe type of the tag (see in comments).
Returns
string A string with the final tag representation.

Definition at line 41 of file htmlfilter.php.

42{
43 if ($tagtype == 2) {
44 $fulltag = '</' . $tagname . '>';
45 } else {
46 $fulltag = '<' . $tagname;
47 if (is_array($attary) && sizeof($attary)) {
48 $atts = array();
49 while (list($attname, $attvalue) = each($attary)) {
50 array_push($atts, "$attname=$attvalue");
51 }
52 $fulltag .= ' ' . join(' ', $atts);
53 }
54 if ($tagtype == 3) {
55 $fulltag .= ' /';
56 }
57 $fulltag .= '>';
58 }
59 return $fulltag;
60}

Referenced by tln_sanitize().

+ Here is the caller graph for this function:

◆ tln_unspace()

tln_unspace ( $attvalue)

Kill any tabs, newlines, or carriage returns.

Our friends the makers of the browser with 95% market value decided that it'd be funny to make "java[tab]script" be just as good as "javascript".

Parameters
string$attvalueThe attribute value before extraneous spaces removed.

Definition at line 491 of file htmlfilter.php.

492{
493 if (strcspn($attvalue, "\t\r\n\0 ") != strlen($attvalue)) {
494 $attvalue = str_replace(
495 array("\t", "\r", "\n", "\0", " "),
496 array('', '', '', '', ''),
497 $attvalue
498 );
499 }
500}

Referenced by tln_fixatts(), and tln_fixstyle().

+ Here is the caller graph for this function: