Welcome! » Log In » Create A New Profile

Missing $node->tagName

Posted by mckelvey 
Missing $node->tagName
September 08, 2017 03:40PM

Hiya!

I was working with an instance of Craft CMS which incorporates HTML Purifier (standalone) as part of it’s save validation on rich text and was running into an issue where $node->tagName did not exist in `createStartNode`. Given the PHP docs on DOMElement and the code therein (v. 4.9.3 btw) this seemed impossible but obviously wasn’t as I later found the little @todo in the function docblock.

While I was seeing this on one instance, I was not on another. The best possibility I could garner is that the server in question was running an older version of libxml (2.7.6) versus the successful instance libxml version (2.9.1 and 2.9.2). Additionally, the newer libxml libs were paired with PHP 5 versus the older libxml paired with PHP 7.

At any rate, swapping out the libxml to truly test this was not a timely option, so I handled the issue myself and have included the result. I also had to deal with a lack of $node->data for DOMText.

https://gist.github.com/mckelvey/3820ff4a1052325d032f85d24c2363b1

    /* replaces lines 18985-19057 of HTMLPurifier.standalone.php v.4.9.3 */
    
    /**
     * @param DOMNode $node
     */
    protected function getTagName($node)
    {
        if (property_exists($node, 'tagName')) {
            return $node->tagName;
        } else if (property_exists($node, 'nodeName')) {
            return $node->nodeName;
        } else if (property_exists($node, 'localName')) {
            return $node->localName;
        }
        return null;
    }

    /**
     * @param DOMNode $node
     */
    protected function getData($node)
    {
        if (property_exists($node, 'data')) {
            return $node->data;
        } else if (property_exists($node, 'nodeValue')) {
            return $node->nodeValue;
        } else if (property_exists($node, 'textContent')) {
            return $node->textContent;
        }
        return null;
    }


    /**
     * @param DOMNode $node DOMNode to be tokenized.
     * @param HTMLPurifier_Token[] $tokens   Array-list of already tokenized tokens.
     * @param bool $collect  Says whether or start and close are collected, set to
     *                    false at first recursion because it's the implicit DIV
     *                    tag you're dealing with.
     * @return bool if the token needs an endtoken
     * @todo data and tagName properties don't seem to exist in DOMNode?
     */
    protected function createStartNode($node, &$tokens, $collect, $config)
    {
        // intercept non element nodes. WE MUST catch all of them,
        // but we're not getting the character reference nodes because
        // those should have been preprocessed
        if ($node->nodeType === XML_TEXT_NODE) {
            $data = $this->getData($node); // Handle variable data property
            if ($data !== null) {
              $tokens[] = $this->factory->createText($data);
            }
            return false;
        } elseif ($node->nodeType === XML_CDATA_SECTION_NODE) {
            // undo libxml&#039;s special treatment of <script> and <style> tags
            $last = end($tokens);
            $data = $node->data;
            // (note $node->tagname is already normalized)
            if ($last instanceof HTMLPurifier_Token_Start && ($last->name == &#039;script&#039; || $last->name == &#039;style&#039;)) {
                $new_data = trim($data);
                if (substr($new_data, 0, 4) === &#039;<!--&#039;) {
                    $data = substr($new_data, 4);
                    if (substr($data, -3) === &#039;-->&#039;) {
                        $data = substr($data, 0, -3);
                    } else {
                        // Highly suspicious! Not sure what to do...
                    }
                }
            }
            $tokens[] = $this->factory->createText($this->parseText($data, $config));
            return false;
        } elseif ($node->nodeType === XML_COMMENT_NODE) {
            // this is code is only invoked for comments in script/style in versions
            // of libxml pre-2.6.28 (regular comments, of course, are still
            // handled regularly)
            $tokens[] = $this->factory->createComment($node->data);
            return false;
        } elseif ($node->nodeType !== XML_ELEMENT_NODE) {
            // not-well tested: there may be other nodes we have to grab
            return false;
        }
        $attr = $node->hasAttributes() ? $this->transformAttrToAssoc($node->attributes) : array();
        $tag_name = $this->getTagName($node); // Handle variable tagName property
        if (empty($tag_name)) {
            return (bool) $node->childNodes->length;
        }
        // We still have to make sure that the element actually IS empty
        if (!$node->childNodes->length) {
            if ($collect) {
                $tokens[] = $this->factory->createEmpty($tag_name, $attr);
            }
            return false;
        } else {
            if ($collect) {
                $tokens[] = $this->factory->createStart($tag_name, $attr);
            }
            return true;
        }
    }
    
    /**
     * @param DOMNode $node
     * @param HTMLPurifier_Token[] $tokens
     */
    protected function createEndNode($node, &$tokens)
    {
        $tag_name = $this->getTagName($node); // Handle variable tagName property
        $tokens[] = $this->factory->createEnd($tag_name);
    }
Re: Missing $node->tagName
September 08, 2017 09:57PM

Thanks. Would you mind opening a GitHub PR with your change?

Re: Missing $node->tagName
September 09, 2017 12:36AM

Happy to. I hadn’t seen the repo, but just found it and will issue a PR.

Thanks!

David

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: