ILIAS
eassessment Revision 61809
|
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML. More...
Public Member Functions | |
tokenizeHTML ($string, $config, $context) | |
Lexes an HTML string into tokens. | |
openHandler (&$parser, $name, $attrs, $closed) | |
Open tag event handler, interface is defined by PEAR package. | |
closeHandler (&$parser, $name) | |
Close tag event handler, interface is defined by PEAR package. | |
dataHandler (&$parser, $data) | |
Data event handler, interface is defined by PEAR package. | |
escapeHandler (&$parser, $data) | |
Escaped text handler, interface is defined by PEAR package. | |
Public Member Functions inherited from HTMLPurifier_Lexer | |
__construct () | |
parseData ($string) | |
Parses special entities into the proper characters. | |
normalize ($html, $config, $context) | |
Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff. | |
extractBody ($html) | |
Takes a string of HTML (fragment or document) and returns the content. |
Protected Attributes | |
$tokens = array() | |
Internal accumulator array for SAX parsers. | |
Protected Attributes inherited from HTMLPurifier_Lexer | |
$_special_entity2str | |
Most common entity to raw value conversion table for special entities. |
Additional Inherited Members | |
Static Public Member Functions inherited from HTMLPurifier_Lexer | |
static | create ($config) |
Retrieves or sets the default Lexer as a Prototype Factory. | |
Data Fields inherited from HTMLPurifier_Lexer | |
$tracksLineNumbers = false | |
Whether or not this lexer implements line-number/column-number tracking. | |
Static Protected Member Functions inherited from HTMLPurifier_Lexer | |
static | escapeCDATA ($string) |
Translates CDATA sections into regular sections (through escaping). | |
static | escapeCommentedCDATA ($string) |
Special CDATA case that is especially convoluted for <script> | |
static | CDATACallback ($matches) |
Callback function for escapeCDATA() that does the work. |
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML.
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know very much about implementation, but it's fairly well written. However, that abstraction comes at a price: performance. You need to have it installed, and if the API changes, it might break our adapter. Not sure whether or not it's UTF-8 aware, but it has some entity parsing trouble (in all areas, text and attributes).
Quite personally, I don't recommend using the PEAR class, and the defaults don't use it. The unit tests do perform the tests on the SAX parser too, but whatever it does for poorly formed HTML is up to it.
Definition at line 22 of file PEARSax3.php.
HTMLPurifier_Lexer_PEARSax3::closeHandler | ( | & | $parser, |
$name | |||
) |
Close tag event handler, interface is defined by PEAR package.
Definition at line 70 of file PEARSax3.php.
References $name.
HTMLPurifier_Lexer_PEARSax3::dataHandler | ( | & | $parser, |
$data | |||
) |
Data event handler, interface is defined by PEAR package.
Definition at line 84 of file PEARSax3.php.
References $data.
HTMLPurifier_Lexer_PEARSax3::escapeHandler | ( | & | $parser, |
$data | |||
) |
Escaped text handler, interface is defined by PEAR package.
Definition at line 92 of file PEARSax3.php.
References $data.
HTMLPurifier_Lexer_PEARSax3::openHandler | ( | & | $parser, |
$name, | |||
$attrs, | |||
$closed | |||
) |
Open tag event handler, interface is defined by PEAR package.
Definition at line 54 of file PEARSax3.php.
References $name, and HTMLPurifier_Lexer\parseData().
HTMLPurifier_Lexer_PEARSax3::tokenizeHTML | ( | $string, | |
$config, | |||
$context | |||
) |
Lexes an HTML string into tokens.
$string | String HTML. |
Reimplemented from HTMLPurifier_Lexer.
Definition at line 30 of file PEARSax3.php.
References $config, $tokens, and HTMLPurifier_Lexer\normalize().
|
protected |
Internal accumulator array for SAX parsers.
Definition at line 28 of file PEARSax3.php.
Referenced by tokenizeHTML().