Welcome! » Log In » Create A New Profile

Iterative traversal of DOM [patch]

Posted by Darhazer 
Iterative traversal of DOM [patch]
January 05, 2011 05:07AM

Hi,

Since there are some deep DOMs you can hit the maximum nesting level limit in tokenizeDOM (we've experienced this even with maximum nesting level of 300). So here is an iterative version of the same function with simple quoue/dequoue approach. I was going to upload a .patch file, but no file upload is permitted.

NOTE: createStartNode() and createEndNode() are helper functions (actually parts of the old body of tokenizeDOM) and do not exists in the original code. You have to replace the original tokenizeDOM() function with these 3 functions:

/**
 * Iterative function that tokenizes a node, putting it into an accumulator.
 * To iterate is human, to recurse divine - L. Peter Deutsch
 * @param $node     DOMNode to be tokenized.
 * @param $tokens   Array-list of already tokenized tokens.
 * @returns Tokens of node appended to previously passed tokens.
 */
protected function tokenizeDOM($node, &$tokens) {

	$level = 0;
	$nodes = array($level => array($node));
	$closingNodes = array();
	do {
		while (!empty($nodes[$level])) {
			$node = array_shift($nodes[$level]); // FIFO
			$collect = $level > 0 ? true : false;
			$needEndingTag = $this->createStartNode($node, $tokens, $collect);
			if ($needEndingTag) {
				$closingNodes[$level][] = $node;
			}
			if ($node->childNodes && $node->childNodes->length) {
				$level++;
				$nodes[$level] = array();
				foreach ($node->childNodes as $childNode) {
					array_push($nodes[$level], $childNode);
				}
			}
		}
		$level--;
		if ($level && isset($closingNodes[$level])) {
			while($node = array_pop($closingNodes[$level])) {
				$this->createEndNode($node, $tokens);
			}
		}
	} while ($level > 0);
}

/**
 * @param $node  DOMNode to be tokenized.
 * @param $tokens   Array-list of already tokenized tokens.
 * @param $collect  Says whether or start and close are collected, set to
 *					false at first recursion because it's the implicit DIV
 *					tag you're dealing with.
 * @returns bool if the token needs an endtoken
 */
protected function createStartNode($node, &$tokens, $collect)
{
	// intercept non element nodes. WE MUST catch all of them,
	// but we're not getting the character reference nodes because
	// those should have been preprocessed
	if ($node->nodeType === XML_TEXT_NODE) {
		$tokens[] = $this->factory->createText($node->data);
		return false;
	} elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
		// undo libxml&#039;s special treatment of <script> and <style> tags
		$last = end($tokens);
		$data = $node->data;
		// (note $node->tagname is already normalized)
		if ($last instanceof HTMLPurifier_Token_Start && ($last->name == &#039;script&#039; || $last->name == &#039;style&#039;)) {
			$new_data = trim($data);
			if (substr($new_data, 0, 4) === &#039;<!--&#039;) {
				$data = substr($new_data, 4);
				if (substr($data, -3) === &#039;-->&#039;) {
					$data = substr($data, 0, -3);
				} else {
					// Highly suspicious! Not sure what to do...
				}
			}
		}
		$tokens[] = $this->factory->createText($this->parseData($data));
		return false;
	} elseif ($node->nodeType === XML_COMMENT_NODE) {
		// this is code is only invoked for comments in script/style in versions
		// of libxml pre-2.6.28 (regular comments, of course, are still
		// handled regularly)
		$tokens[] = $this->factory->createComment($node->data);
		return false;
	} elseif (
		// not-well tested: there may be other nodes we have to grab
		$node->nodeType !== XML_ELEMENT_NODE
	) {
		return false;
	}

	$attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();

	// We still have to make sure that the element actually IS empty
	if (!$node->childNodes->length) {
		if ($collect) {
			$tokens[] = $this->factory->createEmpty($node->tagName, $attr);
		}
		return false;
	} else {
		if ($collect) {
			$tokens[] = $this->factory->createStart(
				$tag_name = $node->tagName, // somehow, it get&#039;s dropped
				$attr
			);
		}
		return true;
	}
}

protected function createEndNode($node, &$tokens)
{
	$tokens[] = $this->factory->createEnd($node->tagName);
}

Maxim Krizhanovsky

Favit Network

favit.com

Re: Iterative traversal of DOM [patch]
January 05, 2011 03:29PM

Uh, what you've posted isn't really a patch 8)

But the sentiment is in the right place, so I'll paste that code in and give it a run by the regression suite.

Re: Iterative traversal of DOM [patch]
January 05, 2011 03:40PM

Great news, it passes all the tests. I'm going to do a code audit now.

Re: Iterative traversal of DOM [patch]
January 05, 2011 04:06PM

The scheme used for iteration is a bit unusual (IIRC, DOM publishes nextNode, firstNode and parentNode pointers, so you should technically be able to do this by just following the pointers and not bothering with maintaining a manual stack.) Did you try that method and decide it wasn't workable?

Re: Iterative traversal of DOM [patch]
January 09, 2011 01:55PM

Hi,

I've posted the code, not the contents of a .patch file, maybe the later was wiser.

Thank you for looking into this code mofication, although I've setuped tests as well while modifyin the code, I'm glad it passed all your tests.

Nope, I didn't try this with internal nodes, but it sound reasonable. I'll give it a try some time next week and I'll post an update.

Kind regards,

Maxim Krizhanovsky

Favit Network

Re: Iterative traversal of DOM [patch]
January 09, 2011 02:00PM

Ah, ok. Your code works, and it's not as hairy as some of the iterative algorithms in Strategy, so I can't complain too much. Let me know if you want to rework the code, or if you don't feel like doing so and I'll just use your new implementation as is. :-)

Re: Iterative traversal of DOM [patch]
January 12, 2011 04:04AM

You can use the implementation as is. If I come up with better solution, I'll let you know. It's interesting to measure performance as well (and that's the main point I'll try with internal pointers, but I can't provide any realistic timeframe)

Best regards,

Maxim Krizhanovsky

Re: Iterative traversal of DOM [patch]
January 19, 2011 05:10PM

Committed to my local branch, will be pushed soon.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: