Parser that uses PHP 5's DOM extension (part of the core). More...

Inheritance diagram for HTMLPurifier_Lexer_DOMLex:

Collaboration diagram for HTMLPurifier_Lexer_DOMLex:

Public Member Functions
	__construct ()
	tokenizeHTML ($html, $config, $context)
	muteErrorHandler ($errno, $errstr)
	An error handler that mutes all errors.
	callbackUndoCommentSubst ($matches)
	Callback function for undoing escaping of stray angled brackets in comments.
	callbackArmorCommentEntities ($matches)
	Callback function that entity-izes ampersands in comments so that callbackUndoCommentSubst doesn't clobber them.
Public Member Functions inherited from HTMLPurifier_Lexer
	parseData ($string)
	Parses special entities into the proper characters.
	normalize ($html, $config, $context)
	Takes a piece of HTML and normalizes it by converting entities, fixing encoding, extracting bits, and other good stuff.
	extractBody ($html)
	Takes a string of HTML (fragment or document) and returns the content.

Protected Member Functions
	tokenizeDOM ($node, &$tokens)
	Iterative function that tokenizes a node, putting it into an accumulator.
	createStartNode ($node, &$tokens, $collect)
	createEndNode ($node, &$tokens)
	transformAttrToAssoc ($node_map)
	Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.
	wrapHTML ($html, $config, $context)
	Wraps an HTML fragment in the necessary HTML.

Private Attributes
	$factory
	HTMLPurifier_TokenFactory

Additional Inherited Members
Static Public Member Functions inherited from HTMLPurifier_Lexer
static	create ($config)
	Retrieves or sets the default Lexer as a Prototype Factory.
Data Fields inherited from HTMLPurifier_Lexer
	$tracksLineNumbers = false
	Whether or not this lexer implements line-number/column-number tracking.
Static Protected Member Functions inherited from HTMLPurifier_Lexer
static	escapeCDATA ($string)
	Translates CDATA sections into regular sections (through escaping).
static	escapeCommentedCDATA ($string)
	Special CDATA case that is especially convoluted for <script>
static	removeIEConditional ($string)
	Special Internet Explorer conditional comments should be removed.
static	CDATACallback ($matches)
	Callback function for escapeCDATA() that does the work.
Protected Attributes inherited from HTMLPurifier_Lexer
	$_special_entity2str
	Most common entity to raw value conversion table for special entities.

Detailed Description

Parser that uses PHP 5's DOM extension (part of the core).

In PHP 5, the DOM XML extension was revamped into DOM and added to the core. It gives us a forgiving HTML parser, which we use to transform the HTML into a DOM, and then into the tokens. It is blazingly fast (for large documents, it performs twenty times faster than HTMLPurifier_Lexer_DirectLex,and is the default choice for PHP 5.

Note: Any empty elements will have empty tokens associated with them, even if this is prohibited by the spec. This is cannot be fixed until the spec comes into play.; PHP's DOM extension does not actually parse any entities, we use our own function to do that.

Warning: DOM tends to drop whitespace, which may wreak havoc on indenting. If this is a huge problem, due to the fact that HTML is hand edited and you are unable to get a parser cache that caches the the output of HTML Purifier while keeping the original HTML lying around, you may want to run Tidy on the resulting output or use HTMLPurifier_DirectLex

Definition at line 27 of file DOMLex.php.

Constructor & Destructor Documentation

HTMLPurifier_Lexer_DOMLex::__construct ( )

Reimplemented from HTMLPurifier_Lexer.

Definition at line 35 of file DOMLex.php.

    {
        // setup the factory
        parent::__construct();
        $this->factory = new HTMLPurifier_TokenFactory();
    }

Member Function Documentation

HTMLPurifier_Lexer_DOMLex::callbackArmorCommentEntities ( $matches )

Callback function that entity-izes ampersands in comments so that callbackUndoCommentSubst doesn't clobber them.

Parameters

array $matches

Returns: string

Definition at line 244 of file DOMLex.php.

    {
        return '<!--' . str_replace('&', '&amp;', $matches[1]) . $matches[2];
    }

HTMLPurifier_Lexer_DOMLex::callbackUndoCommentSubst ( $matches )

Callback function for undoing escaping of stray angled brackets in comments.

Parameters

array $matches

Returns: string

Definition at line 233 of file DOMLex.php.

    {
        return '<!--' . strtr($matches[1], array('&amp;' => '&', '&lt;' => '<')) . $matches[2];
    }

HTMLPurifier_Lexer_DOMLex::createEndNode	(		$node,
		&	$tokens
	)

protected

Parameters

DOMNode	$node
	HTMLPurifier_Token[]	$tokens

Definition at line 191 of file DOMLex.php.

Referenced by tokenizeDOM().

    {
        $tokens[] = $this->factory->createEnd($node->tagName);
    }

Here is the caller graph for this function:

HTMLPurifier_Lexer_DOMLex::createStartNode	(		$node,
		&	$tokens,
			$collect
	)

protected

Parameters

DOMNode	$node	DOMNode to be tokenized.
	HTMLPurifier_Token[]	$tokens Array-list of already tokenized tokens.
bool	$collect	Says whether or start and close are collected, set to false at first recursion because it's the implicit DIV tag you're dealing with.

Returns: bool if the token needs an endtoken

Todo:: data and tagName properties don't seem to exist in DOMNode?

Definition at line 131 of file DOMLex.php.

References HTMLPurifier_Lexer\parseData(), and transformAttrToAssoc().

Referenced by tokenizeDOM().

    {
        // intercept non element nodes. WE MUST catch all of them,
        // but we're not getting the character reference nodes because
        // those should have been preprocessed
        if ($node->nodeType === XML_TEXT_NODE) {
            $tokens[] = $this->factory->createText($node->data);
            return false;
        } elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
            // undo libxml's special treatment of <script> and <style> tags
            $last = end($tokens);
            $data = $node->data;
            // (note $node->tagname is already normalized)
            if ($last instanceof HTMLPurifier_Token_Start && ($last->name == 'script' || $last->name == 'style')) {
                $new_data = trim($data);
                if (substr($new_data, 0, 4) === '<!--') {
                    $data = substr($new_data, 4);
                    if (substr($data, -3) === '-->') {
                        $data = substr($data, 0, -3);
                    } else {
                        // Highly suspicious! Not sure what to do...
                    }
                }
            }
            $tokens[] = $this->factory->createText($this->parseData($data));
            return false;
        } elseif ($node->nodeType === XML_COMMENT_NODE) {
            // this is code is only invoked for comments in script/style in versions
            // of libxml pre-2.6.28 (regular comments, of course, are still
            // handled regularly)
            $tokens[] = $this->factory->createComment($node->data);
            return false;
        } elseif ($node->nodeType !== XML_ELEMENT_NODE) {
            // not-well tested: there may be other nodes we have to grab
            return false;
        }
        $attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();
        // We still have to make sure that the element actually IS empty
        if (!$node->childNodes->length) {
            if ($collect) {
                $tokens[] = $this->factory->createEmpty($node->tagName, $attr);
            }
            return false;
        } else {
            if ($collect) {
                $tokens[] = $this->factory->createStart(
                    $tag_name = $node->tagName, // somehow, it get's dropped
                    $attr
                );
            }
            return true;
        }
    }

Here is the call graph for this function:

Here is the caller graph for this function:

HTMLPurifier_Lexer_DOMLex::muteErrorHandler	(	$errno,
		$errstr
	)

An error handler that mutes all errors.

Parameters

int	$errno
string	$errstr

Definition at line 223 of file DOMLex.php.

{

}

HTMLPurifier_Lexer_DOMLex::tokenizeDOM	(		$node,
		&	$tokens
	)

protected

Iterative function that tokenizes a node, putting it into an accumulator.

To iterate is human, to recurse divine - L. Peter Deutsch

Parameters

DOMNode	$node	DOMNode to be tokenized.
	HTMLPurifier_Token[]	$tokens Array-list of already tokenized tokens.

Returns: HTMLPurifier_Token of node appended to previously passed tokens.

Definition at line 92 of file DOMLex.php.

References createEndNode(), and createStartNode().

Referenced by HTMLPurifier_Lexer_PH5P\tokenizeHTML(), and tokenizeHTML().

    {
        $level = 0;
        $nodes = array($level => new HTMLPurifier_Queue(array($node)));
        $closingNodes = array();
        do {
            while (!$nodes[$level]->isEmpty()) {
                $node = $nodes[$level]->shift(); // FIFO
                $collect = $level > 0 ? true : false;
                $needEndingTag = $this->createStartNode($node, $tokens, $collect);
                if ($needEndingTag) {
                    $closingNodes[$level][] = $node;
                }
                if ($node->childNodes && $node->childNodes->length) {
                    $level++;
                    $nodes[$level] = new HTMLPurifier_Queue();
                    foreach ($node->childNodes as $childNode) {
                        $nodes[$level]->push($childNode);
                    }
                }
            }
            $level--;
            if ($level && isset($closingNodes[$level])) {
                while ($node = array_pop($closingNodes[$level])) {
                    $this->createEndNode($node, $tokens);
                }
            }
        } while ($level > 0);
    }

Here is the call graph for this function:

Here is the caller graph for this function:

HTMLPurifier_Lexer_DOMLex::tokenizeHTML	(	$html,
		$config,
		$context
	)

Parameters

string	$html
HTMLPurifier_Config	$config
HTMLPurifier_Context	$context

Returns: HTMLPurifier_Token[]

Reimplemented from HTMLPurifier_Lexer.

Reimplemented in HTMLPurifier_Lexer_PH5P.

Definition at line 48 of file DOMLex.php.

References $comment, HTMLPurifier_Lexer\normalize(), tokenizeDOM(), and wrapHTML().

    {
        $html = $this->normalize($html, $config, $context);
        // attempt to armor stray angled brackets that cannot possibly
        // form tags and thus are probably being used as emoticons
        if ($config->get('Core.AggressivelyFixLt')) {
            $char = '[^a-z!\/]';
            $comment = "/<!--(.*?)(-->|\z)/is";
            $html = preg_replace_callback($comment, array($this, 'callbackArmorCommentEntities'), $html);
            do {
                $old = $html;
                $html = preg_replace("/<($char)/i", '&lt;\\1', $html);
            } while ($html !== $old);
            $html = preg_replace_callback($comment, array($this, 'callbackUndoCommentSubst'), $html); // fix comments
        }
        // preprocess html, essential for UTF-8
        $html = $this->wrapHTML($html, $config, $context);
        $doc = new DOMDocument();
        $doc->encoding = 'UTF-8'; // theoretically, the above has this covered
        set_error_handler(array($this, 'muteErrorHandler'));
        $doc->loadHTML($html);
        restore_error_handler();
        $tokens = array();
        $this->tokenizeDOM(
            $doc->getElementsByTagName('html')->item(0)-> // <html>
            getElementsByTagName('body')->item(0)-> //   <body>
            getElementsByTagName('div')->item(0), //     <div>
            $tokens
        );
        return $tokens;
    }

Here is the call graph for this function:

HTMLPurifier_Lexer_DOMLex::transformAttrToAssoc ( $node_map )

protected

Converts a DOMNamedNodeMap of DOMAttr objects into an assoc array.

Parameters

DOMNamedNodeMap $node_map DOMNamedNodeMap of DOMAttr objects.

Returns: array Associative array of attributes.

Definition at line 203 of file DOMLex.php.

Referenced by createStartNode().

    {
        // NamedNodeMap is documented very well, so we're using undocumented
        // features, namely, the fact that it implements Iterator and
        // has a ->length attribute
        if ($node_map->length === 0) {
            return array();
        }
        $array = array();
        foreach ($node_map as $attr) {
            $array[$attr->name] = $attr->value;
        }
        return $array;
    }

Here is the caller graph for this function:

HTMLPurifier_Lexer_DOMLex::wrapHTML	(	$html,
		$config,
		$context
	)

protected

Wraps an HTML fragment in the necessary HTML.

Parameters

string	$html
HTMLPurifier_Config	$config
HTMLPurifier_Context	$context

Returns: string

Definition at line 256 of file DOMLex.php.

References $ret.

Referenced by HTMLPurifier_Lexer_PH5P\tokenizeHTML(), and tokenizeHTML().

    {
        $def = $config->getDefinition('HTML');
        $ret = '';
        if (!empty($def->doctype->dtdPublic) || !empty($def->doctype->dtdSystem)) {
            $ret .= '<!DOCTYPE html ';
            if (!empty($def->doctype->dtdPublic)) {
                $ret .= 'PUBLIC "' . $def->doctype->dtdPublic . '" ';
            }
            if (!empty($def->doctype->dtdSystem)) {
                $ret .= '"' . $def->doctype->dtdSystem . '" ';
            }
            $ret .= '>';
        }
        $ret .= '<html><head>';
        $ret .= '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';
        // No protection if $html contains a stray </div>!
        $ret .= '</head><body><div>' . $html . '</div></body></html>';
        return $ret;
    }

Here is the caller graph for this function:

Field Documentation

HTMLPurifier_Lexer_DOMLex::$factory

private

HTMLPurifier_TokenFactory

Definition at line 33 of file DOMLex.php.

The documentation for this class was generated from the following file:

Services/Html/HtmlPurifier/library/HTMLPurifier/Lexer/DOMLex.php

Public Member Functions

Protected Member Functions

Private Attributes

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Field Documentation