ILIAS
Release_4_2_x_branch Revision 61807
|
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML. More...
Public Member Functions | |
tokenizeHTML ($string, $config, $context) | |
Lexes an HTML string into tokens. | |
openHandler (&$parser, $name, $attrs, $closed) | |
Open tag event handler, interface is defined by PEAR package. | |
closeHandler (&$parser, $name) | |
Close tag event handler, interface is defined by PEAR package. | |
dataHandler (&$parser, $data) | |
Data event handler, interface is defined by PEAR package. | |
escapeHandler (&$parser, $data) | |
Escaped text handler, interface is defined by PEAR package. | |
muteStrictErrorHandler ($errno, $errstr, $errfile=null, $errline=null, $errcontext=null) | |
An error handler that mutes strict errors. | |
Public Member Functions inherited from HTMLPurifier_Lexer | |
__construct () | |
parseData ($string) | |
Parses special entities into the proper characters. | |
normalize ($html, $config, $context) | |
Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff. | |
extractBody ($html) | |
Takes a string of HTML (fragment or document) and returns the content. |
Protected Attributes | |
$tokens = array() | |
Internal accumulator array for SAX parsers. | |
$last_token_was_empty | |
Protected Attributes inherited from HTMLPurifier_Lexer | |
$_special_entity2str | |
Most common entity to raw value conversion table for special entities. |
Private Attributes | |
$parent_handler | |
$stack = array() |
Additional Inherited Members | |
Static Public Member Functions inherited from HTMLPurifier_Lexer | |
static | create ($config) |
Retrieves or sets the default Lexer as a Prototype Factory. | |
Data Fields inherited from HTMLPurifier_Lexer | |
$tracksLineNumbers = false | |
Whether or not this lexer implements line-number/column-number tracking. | |
Static Protected Member Functions inherited from HTMLPurifier_Lexer | |
static | escapeCDATA ($string) |
Translates CDATA sections into regular sections (through escaping). | |
static | escapeCommentedCDATA ($string) |
Special CDATA case that is especially convoluted for <script> | |
static | removeIEConditional ($string) |
Special Internet Explorer conditional comments should be removed. | |
static | CDATACallback ($matches) |
Callback function for escapeCDATA() that does the work. |
Proof-of-concept lexer that uses the PEAR package XML_HTMLSax3 to parse HTML.
PEAR, not suprisingly, also has a SAX parser for HTML. I don't know very much about implementation, but it's fairly well written. However, that abstraction comes at a price: performance. You need to have it installed, and if the API changes, it might break our adapter. Not sure whether or not it's UTF-8 aware, but it has some entity parsing trouble (in all areas, text and attributes).
Quite personally, I don't recommend using the PEAR class, and the defaults don't use it. The unit tests do perform the tests on the SAX parser too, but whatever it does for poorly formed HTML is up to it.
Definition at line 22 of file PEARSax3.php.
HTMLPurifier_Lexer_PEARSax3::closeHandler | ( | & | $parser, |
$name | |||
) |
Close tag event handler, interface is defined by PEAR package.
Definition at line 81 of file PEARSax3.php.
HTMLPurifier_Lexer_PEARSax3::dataHandler | ( | & | $parser, |
$data | |||
) |
Data event handler, interface is defined by PEAR package.
Definition at line 97 of file PEARSax3.php.
References $data.
HTMLPurifier_Lexer_PEARSax3::escapeHandler | ( | & | $parser, |
$data | |||
) |
Escaped text handler, interface is defined by PEAR package.
Definition at line 106 of file PEARSax3.php.
References $data.
HTMLPurifier_Lexer_PEARSax3::muteStrictErrorHandler | ( | $errno, | |
$errstr, | |||
$errfile = null , |
|||
$errline = null , |
|||
$errcontext = null |
|||
) |
An error handler that mutes strict errors.
Definition at line 132 of file PEARSax3.php.
HTMLPurifier_Lexer_PEARSax3::openHandler | ( | & | $parser, |
$name, | |||
$attrs, | |||
$closed | |||
) |
Open tag event handler, interface is defined by PEAR package.
Definition at line 63 of file PEARSax3.php.
References HTMLPurifier_Lexer\parseData().
HTMLPurifier_Lexer_PEARSax3::tokenizeHTML | ( | $string, | |
$config, | |||
$context | |||
) |
Lexes an HTML string into tokens.
$string | String HTML. |
Reimplemented from HTMLPurifier_Lexer.
Definition at line 34 of file PEARSax3.php.
References $config, $tokens, and HTMLPurifier_Lexer\normalize().
|
protected |
Definition at line 29 of file PEARSax3.php.
|
private |
Definition at line 31 of file PEARSax3.php.
|
private |
Definition at line 32 of file PEARSax3.php.
|
protected |
Internal accumulator array for SAX parsers.
Definition at line 28 of file PEARSax3.php.
Referenced by tokenizeHTML().