ILIAS  Release_4_2_x_branch Revision 61807
 All Data Structures Namespaces Files Functions Variables Groups Pages
HTMLPurifier_Lexer_PEARSax3 Class Reference

Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML. More...

+ Inheritance diagram for HTMLPurifier_Lexer_PEARSax3:
+ Collaboration diagram for HTMLPurifier_Lexer_PEARSax3:

Public Member Functions

 tokenizeHTML ($string, $config, $context)
 Lexes an HTML string into tokens.
 openHandler (&$parser, $name, $attrs, $closed)
 Open tag event handler, interface is defined by PEAR package.
 closeHandler (&$parser, $name)
 Close tag event handler, interface is defined by PEAR package.
 dataHandler (&$parser, $data)
 Data event handler, interface is defined by PEAR package.
 escapeHandler (&$parser, $data)
 Escaped text handler, interface is defined by PEAR package.
 muteStrictErrorHandler ($errno, $errstr, $errfile=null, $errline=null, $errcontext=null)
 An error handler that mutes strict errors.
- Public Member Functions inherited from HTMLPurifier_Lexer
 __construct ()
 parseData ($string)
 Parses special entities into the proper characters.
 normalize ($html, $config, $context)
 Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
 extractBody ($html)
 Takes a string of HTML (fragment or document) and returns the content.

Protected Attributes

 $tokens = array()
 Internal accumulator array for SAX parsers.
 $last_token_was_empty
- Protected Attributes inherited from HTMLPurifier_Lexer
 $_special_entity2str
 Most common entity to raw value conversion table for special entities.

Private Attributes

 $parent_handler
 $stack = array()

Additional Inherited Members

- Static Public Member Functions inherited from HTMLPurifier_Lexer
static create ($config)
 Retrieves or sets the default Lexer as a Prototype Factory.
- Data Fields inherited from HTMLPurifier_Lexer
 $tracksLineNumbers = false
 Whether or not this lexer implements line-number/column-number tracking.
- Static Protected Member Functions inherited from HTMLPurifier_Lexer
static escapeCDATA ($string)
 Translates CDATA sections into regular sections (through escaping).
static escapeCommentedCDATA ($string)
 Special CDATA case that is especially convoluted for <script>
static removeIEConditional ($string)
 Special Internet Explorer conditional comments should be removed.
static CDATACallback ($matches)
 Callback function for escapeCDATA() that does the work.

Detailed Description

Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML.

PEAR, not suprisingly, also has a SAX parser for HTML. I don't know very much about implementation, but it's fairly well written. However, that abstraction comes at a price: performance. You need to have it installed, and if the API changes, it might break our adapter. Not sure whether or not it's UTF-8 aware, but it has some entity parsing trouble (in all areas, text and attributes).

Quite personally, I don't recommend using the PEAR class, and the defaults don't use it. The unit tests do perform the tests on the SAX parser too, but whatever it does for poorly formed HTML is up to it.

Todo:
Generalize so that XML_HTMLSax is also supported.
Warning
Entity-resolution inside attributes is broken.

Definition at line 22 of file PEARSax3.php.

Member Function Documentation

HTMLPurifier_Lexer_PEARSax3::closeHandler ( $parser,
  $name 
)

Close tag event handler, interface is defined by PEAR package.

Definition at line 81 of file PEARSax3.php.

{
// HTMLSax3 seems to always send empty tags an extra close tag
// check and ignore if you see it:
// [TESTME] to make sure it doesn't overreach
if ($this->last_token_was_empty) {
$this->last_token_was_empty = false;
return true;
}
$this->tokens[] = new HTMLPurifier_Token_End($name);
if (!empty($this->stack)) array_pop($this->stack);
return true;
}
HTMLPurifier_Lexer_PEARSax3::dataHandler ( $parser,
  $data 
)

Data event handler, interface is defined by PEAR package.

Definition at line 97 of file PEARSax3.php.

References $data.

{
$this->last_token_was_empty = false;
$this->tokens[] = new HTMLPurifier_Token_Text($data);
return true;
}
HTMLPurifier_Lexer_PEARSax3::escapeHandler ( $parser,
  $data 
)

Escaped text handler, interface is defined by PEAR package.

Definition at line 106 of file PEARSax3.php.

References $data.

{
if (strpos($data, '--') === 0) {
// remove trailing and leading double-dashes
$data = substr($data, 2);
if (strlen($data) >= 2 && substr($data, -2) == "--") {
$data = substr($data, 0, -2);
}
if (isset($this->stack[sizeof($this->stack) - 1]) &&
$this->stack[sizeof($this->stack) - 1] == "style") {
$this->tokens[] = new HTMLPurifier_Token_Text($data);
} else {
$this->tokens[] = new HTMLPurifier_Token_Comment($data);
}
$this->last_token_was_empty = false;
}
// CDATA is handled elsewhere, but if it was handled here:
//if (strpos($data, '[CDATA[') === 0) {
// $this->tokens[] = new HTMLPurifier_Token_Text(
// substr($data, 7, strlen($data) - 9) );
//}
return true;
}
HTMLPurifier_Lexer_PEARSax3::muteStrictErrorHandler (   $errno,
  $errstr,
  $errfile = null,
  $errline = null,
  $errcontext = null 
)

An error handler that mutes strict errors.

Definition at line 132 of file PEARSax3.php.

{
if ($errno == E_STRICT) return;
return call_user_func($this->parent_handler, $errno, $errstr, $errfile, $errline, $errcontext);
}
HTMLPurifier_Lexer_PEARSax3::openHandler ( $parser,
  $name,
  $attrs,
  $closed 
)

Open tag event handler, interface is defined by PEAR package.

Definition at line 63 of file PEARSax3.php.

References HTMLPurifier_Lexer\parseData().

{
// entities are not resolved in attrs
foreach ($attrs as $key => $attr) {
$attrs[$key] = $this->parseData($attr);
}
if ($closed) {
$this->tokens[] = new HTMLPurifier_Token_Empty($name, $attrs);
$this->last_token_was_empty = true;
} else {
$this->tokens[] = new HTMLPurifier_Token_Start($name, $attrs);
}
$this->stack[] = $name;
return true;
}

+ Here is the call graph for this function:

HTMLPurifier_Lexer_PEARSax3::tokenizeHTML (   $string,
  $config,
  $context 
)

Lexes an HTML string into tokens.

Parameters
$stringString HTML.
Returns
HTMLPurifier_Token array representation of HTML.

Reimplemented from HTMLPurifier_Lexer.

Definition at line 34 of file PEARSax3.php.

References $config, $tokens, and HTMLPurifier_Lexer\normalize().

{
$this->tokens = array();
$this->last_token_was_empty = false;
$string = $this->normalize($string, $config, $context);
$this->parent_handler = set_error_handler(array($this, 'muteStrictErrorHandler'));
$parser = new XML_HTMLSax3();
$parser->set_object($this);
$parser->set_element_handler('openHandler','closeHandler');
$parser->set_data_handler('dataHandler');
$parser->set_escape_handler('escapeHandler');
// doesn't seem to work correctly for attributes
$parser->set_option('XML_OPTION_ENTITIES_PARSED', 1);
$parser->parse($string);
restore_error_handler();
return $this->tokens;
}

+ Here is the call graph for this function:

Field Documentation

HTMLPurifier_Lexer_PEARSax3::$last_token_was_empty
protected

Definition at line 29 of file PEARSax3.php.

HTMLPurifier_Lexer_PEARSax3::$parent_handler
private

Definition at line 31 of file PEARSax3.php.

HTMLPurifier_Lexer_PEARSax3::$stack = array()
private

Definition at line 32 of file PEARSax3.php.

HTMLPurifier_Lexer_PEARSax3::$tokens = array()
protected

Internal accumulator array for SAX parsers.

Definition at line 28 of file PEARSax3.php.

Referenced by tokenizeHTML().


The documentation for this class was generated from the following file: