Welcome! » Log In » Create A New Profile

Remove tags without attribute

Posted by xkobal 
Remove tags without attribute
October 22, 2008 06:40AM

Hello,

I would like to know if there is a way to remove a tag without attribute.

For example: I want to keep

<span class="myClass">

but I don't want

<span>

to avoid span tags in my content with no utility.

Is-there a way to do that ?

Best regards.

Re: Remove tags without attribute
October 22, 2008 12:27PM

Who says span tags can't have utility without attributes? The CSS selector a span certainly doesn't think so!

More seriously, I'm curious to know why you think this would be necessary.

Re: Remove tags without attribute
October 22, 2008 12:33PM

Nobody say it's necessary to remove span without attribute except me :D

Let me explain to you.

I'm creating Back Office CMS for my company with YUI Rich Rext Editor and I use HTML Purifier for the back-end cleaning of the HTML.

Everything works very well except that the YUI editor let some

<span>

in the content that I don't want to keep. There is no utility for these span!

I hope you understand my problem.

Re: Remove tags without attribute
October 22, 2008 12:35PM

Aight. So a feature like this wouldn't be terribly high priority for me, but I'll put it on my TODO list. Would you be interested in helping make a patch? I would be most certainly willing to help and give a general game plan for implementing this.

Re: Remove tags without attribute
October 22, 2008 12:41PM

Sure that I need help because I need to implement that feature and I am relatively a newbie on HTML purifier so I must dive in the code to see how can I realize a good implementation.

Can you help me for the direction that I must take ?

Re: Remove tags without attribute
October 22, 2008 12:50PM

Ok. So since we're attempting to remove a tag, either an Injector or something hooking into RemoveForeignElements would be appropriate. We're going to go with an injector, since RemoveForeignElements is dumb, and we're modifying the DOM.

In order to use injectors properly, you will need to download a development version of HTML Purifier; you can find instructions here for doing so. If you need help using Git, do not hesitate to ask.

Once you've got that set up, ping me here and I'll give you the game plan.

Re: Remove tags without attribute
October 22, 2008 04:18PM

OK, I'll going to download development version tomorrow.

Don't be surprised if sometimes I'm not responding since long hours because I am in Paris and we've got 9 hours of difference !

I'll post when I'll be OK with your version and injector.

Thanks for your help.

Re: Remove tags without attribute
October 23, 2008 09:51AM

hello,

I have some problem to synchronize with GIT, here is my error Code :

git clone git://repo.or.cz/htmlpurifier.git

repo.or.cz[0: 62.24.64.27]: errno=No error fatal: unable to connect a socket (No error)

I am on Windows Vista, do you what is wrong ?

Re: Remove tags without attribute
October 23, 2008 11:03AM

Git support on windows has been a rather iffy thing in the past... Which version for windows did you download? Is your back-end server on windows too?

Re: Remove tags without attribute
October 23, 2008 11:08AM

No my Server is Linux but my PHP source code is on my computer.

The Git windows version is the last version of google : git version 1.5.6.1.1071.g76fb

Re: Remove tags without attribute
October 23, 2008 12:02PM

I assume you mean this project: http://code.google.com/p/msysgit/

They may be interested in knowing about your error, as it's a problem with git. As they mention on that page, the official support is current under the cygwin version, but that may or may not improve the issue.

Also, do you have a php environment on your Vista box? The reason I ask, is that to run unit tests, you will need the php command line support. Otherwise, you might want to run git on the server and do your htmlpurifier development part on the server. Or install a VMWare server and run an instance of Linux there. :)

(Offtopic - Ha! As I'm working on an injector myself, I noticed that the forum used the AutoFormat.Linkify Injector on the link above. Nice!)

Re: Remove tags without attribute
October 23, 2008 12:14PM

I am goign to check and maybe install cygwin.

I have all the environnement PHp on my desktop and even the command line.

Re: Remove tags without attribute
October 23, 2008 01:31PM

I'm inclined to ascribe this error to temporary network problems. Can you try again right now?

Anyway, installing Cygwin, if you don't mind basically installing an OS inside your OS, is a pretty awesome idea, especially if you've had some experience with Unix command line. It Unixfies your system, and makes lots of things so much easier. I use the Cygwin version of Git, even though I have msysgit installed myself.

Re: Remove tags without attribute
April 07, 2009 08:23AM

I would also find this feature useful. My situation is cleaning up HTML pasted from Word. This works pretty well but often leaves behind tags with no attributes.

I may also get some time to work on this, but not for a few weeks (fortunately my dev environment is Linux based so I shouldn't have the problems with git)

Re: Remove tags without attribute
April 07, 2009 10:19AM

Let me know when you're setting up a Git repository for this, and I can help out.

Re: Remove tags without attribute
July 07, 2009 12:06PM

Ok, I've checked out the latest source from Git. I've had a quick look through the source for Injector and it's subclasses, and also at MakeWellFormed.

It looks like I need to implement the handleElement method in Injector. In that method, I can see if the current token is a span with no attributes. If this is so, I need to remove the current token (by setting it to false), then use forwardUntilEndToken to get to the closing span tag and delete that, then rewind back to the token after the open span that was deleted.

Does that sound about right?

Re: Remove tags without attribute
July 07, 2009 12:17PM

That sounds about right. It might be a little tricky, so make sure you have some tests!

Re: Remove tags without attribute
July 08, 2009 07:27AM

Ok, I've made some progress. The following works:

<p>a <span>b <em>c</em> d</span> e</p>
gets turned into:
<p>a b <em>c</em> d e</p>

but when spans are nested, only the outer one is removed. e.g.:

<span>one<span>two<span>three</span></span></span>
becomes:
one<span>two<span>three</span></span>

My code so far is this:

/**
 * Injector that removes spans with no attributes
 */
class HTMLPurifier_Injector_RemoveSpansWithoutAttributes extends HTMLPurifier_Injector
{
    public $name = &#039;RemoveSpansWithoutAttributes&#039;;
    public $needed = array(&#039;span&#039;);

    public function handleElement(&$token) {
        if (!$token instanceof HTMLPurifier_Token_Start) return;
        if ($token->name !== &#039;span&#039; || !empty($token->attr))
            return;
        
        $nesting = 0;
        $spanContentTokens = array();
        while ($this->forwardUntilEndToken(&$i, &$current, &$nesting)) 
            $spanContentTokens[] = $current;
        
        // Delete the number of tokens we&#039;ve iterated over, plus the current token
        $tokensToDelete = $i - $this->inputIndex + 1;
        
        $token = array_merge(array($tokensToDelete), $spanContentTokens);
        $this->rewind($this->inputIndex - 1);
    }
}

So, it looks like the rewind function isn't doing anything here. What do I need to do to make the contents of the span be reprocessed?

Re: Remove tags without attribute
July 08, 2009 09:21AM

Ok, I think I see the problem. Any tokens that are deleted are get their 'skip' flag set, and aren't sent back to the injector. So because I am removing and then adding back the tokens inside the span, they don't get sent back to my injector again.

Re: Remove tags without attribute
July 08, 2009 10:34AM

Yeah, don't do that; just remove the tokens you want to remove.

Re: Remove tags without attribute
July 08, 2009 12:31PM

Ok, I have it working now. However, I'm having a problem with one of my tests. I want to make sure that a span still gets removed if it had attributes, but they were all removed because they were not allowed. In my setup() function in my testcase, I'm doing the following:

$this->config->set('HTML.Allowed', 'span[class],div,p,strong,em');

However, this seems to have no effect - attributes other than class are still allowed. My testcase extends HTMLPurifier_InjectorHarness.

Re: Remove tags without attribute
July 08, 2009 12:46PM

Good point. In this case, there's a simple fix, although the generalized case is a little more complicated.

Make class a required attribute on span (you can overload it or patch the source), and HTML Purifier will perform attribute validation early.

To deal with the general case, you'll have to run AttrValidator on any incoming tokens, since attributes get validated after we think the tree is ok.

Re: Remove tags without attribute
July 08, 2009 12:46PM

As a footnote, you should set the "already validated attributes" flag so that we don't duplicate work. Check the source that deals with required attributes to see what this is.

Re: Remove tags without attribute
July 09, 2009 05:20AM

Ok, here's my work so far:

From 392d1b8b9ec2e4f83c7251ea909e699cee74f16d Mon Sep 17 00:00:00 2001
From: Paul Stone <patches@pdjs.co.uk>
Date: Thu, 9 Jul 2009 10:12:10 +0100
Subject: [PATCH] Created Injector to remove spans without any attributes

---
 .../Injector/RemoveSpansWithoutAttributes.php      |   56 +++++++++++
 .../Injector/RemoveSpansWithoutAttributesTest.php  |  100 ++++++++++++++++++++
 2 files changed, 156 insertions(+), 0 deletions(-)
 create mode 100755 library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
 create mode 100755 tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php

diff --git a/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
new file mode 100755
index 0000000..191d85b
--- /dev/null
+++ b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
@@ -0,0 +1,56 @@
+<?php
+
+/**
+ * Injector that removes spans with no attributes
+ */
+class HTMLPurifier_Injector_RemoveSpansWithoutAttributes extends HTMLPurifier_Injector
+{
+    public $name = &#039;RemoveSpansWithoutAttributes&#039;;
+    public $needed = array(&#039;span&#039;);
+
+    private $attrValidator;
+    
+    /**
+     * Used by AttrValidator
+     */
+    private $config;
+    private $context;
+    
+    public function prepare($config, $context) {
+        $this->attrValidator = new HTMLPurifier_AttrValidator();
+        $this->config = $config;
+        $this->context = $context;
+        return parent::prepare($config, $context);
+    }
+    
+    public function handleElement(&$token) {
+        if ($token->name !== &#039;span&#039; || !$token instanceof HTMLPurifier_Token_Start) 
+            return;
+
+        // We need to validate the attributes now since this doesn&#039;t normally
+        // happen until after MakeWellFormed. If all the attributes are removed
+        // the span needs to be removed too.
+        $this->attrValidator->validateToken($token, $this->config, $this->context);
+        $token->armor[&#039;ValidateAttributes&#039;] = true;
+        
+        if (!empty($token->attr))
+            return;
+        
+        $nesting = 0;
+        $spanContentTokens = array();
+        while ($this->forwardUntilEndToken(&$i, &$current, &$nesting)) {}
+        
+        // Mark closing span tag for deletion
+        $current->markForDeletion = true;  
+        
+        // Delete open span tag
+        $token = false;
+    }
+    
+    public function handleEnd(&$token) {
+        if ($token->markForDeletion)
+            $token = false;
+    }
+}
+
+// vim: et sw=4 sts=4
diff --git a/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php b/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php
new file mode 100755
index 0000000..8230b4e
--- /dev/null
+++ b/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php
@@ -0,0 +1,100 @@
+<?php
+
+class HTMLPurifier_Injector_RemoveSpansWithoutAttributesTest extends HTMLPurifier_InjectorHarness
+{
+    function setup() {
+        parent::setup();
+        $this->config->set(&#039;HTML.Allowed&#039;, &#039;span[class],div,p,strong,em&#039;);
+        $this->config->set(&#039;AutoFormat.Custom&#039;, &#039;RemoveSpansWithoutAttributes&#039;);
+    }
+
+    function testSingleSpan() {
+        $this->assertResult(
+            &#039;<span>foo</span>&#039;,
+            &#039;foo&#039;
+        );
+    }
+
+    function testSingleSpanWithAttributes() {
+        $this->assertResult(
+            &#039;<span class="bar">foo</span>&#039;,
+            &#039;<span class="bar">foo</span>&#039;
+        );
+    }
+    
+    function testSingleNestedSpan() {
+        $this->assertResult(
+            &#039;<p><span>foo</span></p>&#039;,
+            &#039;<p>foo</p>&#039;
+        );
+    }    
+    
+    function testSingleNestedSpanWithAttributes() {
+        $this->assertResult(
+            &#039;<p><span class="bar">foo</span></p>&#039;,
+            &#039;<p><span class="bar">foo</span></p>&#039;
+        );
+    }
+    
+        
+    function testSpanWithChildren() {
+        $this->assertResult(
+            &#039;<span>foo <strong>bar</strong> <em>baz</em></span>&#039;,
+            &#039;foo <strong>bar</strong> <em>baz</em>&#039;
+        );
+    }
+    
+    function testSpanWithSiblings() {
+        $this->assertResult(
+            &#039;<p>before <span>inside</span> <strong>after</strong></p>&#039;,
+            &#039;<p>before inside <strong>after</strong></p>&#039;
+        );
+    }
+    
+    function testNestedSpanWithSiblingsAndChildren()
+    {
+        $this->assertResult(
+            &#039;<p>a <span>b <em>c</em> d</span> e</p>&#039;,
+            &#039;<p>a b <em>c</em> d e</p>&#039;
+        );
+    }
+    
+    function testNestedSpansWithoutAttributes() {
+        $this->assertResult(
+            &#039;<span>one<span>two<span>three</span></span></span>&#039;,
+            &#039;onetwothree&#039;
+        );
+    }
+    
+    function testDeeplyNestedSpan() {
+        $this->assertResult(
+            &#039;<div><div><div><span class="a">a <span>b</span> c</span></div></div></div>&#039;,
+            &#039;<div><div><div><span class="a">a b c</span></div></div></div>&#039;
+        );
+    }
+    
+    function testSpanWithInvalidAttributes() {
+        $this->assertResult(
+            &#039;<p><span snorkel buzzer="emu">foo</span></p>&#039;,
+            &#039;<p>foo</p>&#039;
+        );
+    }
+    
+    function testNestedAlternateSpans() {
+        $this->assertResult(
+&#039;<span>a <span class="x">b <span>c <span class="y">d <span>e <span class="z">f
+</span></span></span></span></span></span>&#039;,
+&#039;a <span class="x">b c <span class="y">d e <span class="z">f
+</span></span></span>&#039;
+        );
+    }
+    
+    function testSpanWithSomeInvalidAttributes() {
+        $this->assertResult(
+            &#039;<p><span buzzer="emu" class="bar">foo</span></p>&#039;,
+            &#039;<p><span class="bar">foo</span></p>&#039;
+        );
+    }
+}
+
+// vim: et sw=4 sts=4
-- 
1.5.4.3

The following points are still outstanding:

  • I currently set markForDeletion on the end token in handleElement then check it in handleEnd. Is there a better way to do this? I tried storing an array of end elements to delete, then used in_array in handleEnd, but this caused an infinite loop somewhere.
  • Do I need more tests?
  • Should I add this into the config schema under AutoFormat.RemoveSpansWithoutAttributes ?
Re: Remove tags without attribute
July 09, 2009 11:26AM
I currently set markForDeletion on the end token in handleElement then check it in handleEnd. Is there a better way to do this? I tried storing an array of end elements to delete, then used in_array in handleEnd, but this caused an infinite loop somewhere.

No, I think that's fine. You have a very straightforward and easy to understand implementation. You should double-check, though, and make sure $current is actually the end token you're looking for.

Do I need more tests?

Tests look great!

Should I add this into the config schema under AutoFormat.RemoveSpansWithoutAttributes ?

Yep. Do that (and the $current double-check), I'll commit this into my repository.

Thanks for the great patch!

Re: Remove tags without attribute
July 09, 2009 12:11PM

Ok, this patch should apply on top of the last one:

From 68a6c9a4d2cb3da9867b26f1be336831bd5e03e9 Mon Sep 17 00:00:00 2001
From: Paul Stone <patches@pdjs.co.uk>
Date: Thu, 9 Jul 2009 17:03:36 +0100
Subject: [PATCH] Add an extra safety check in RemoveSpansWithoutAttributes and add a config options for it.

---
 .../AutoFormat.RemoveSpansWithoutAttributes.txt    |   12 ++++++++++++
 .../Injector/RemoveSpansWithoutAttributes.php      |   13 +++++++------
 .../Injector/RemoveSpansWithoutAttributesTest.php  |    2 +-
 3 files changed, 20 insertions(+), 7 deletions(-)
 create mode 100755 library/HTMLPurifier/ConfigSchema/schema/AutoFormat.RemoveSpansWithoutAttributes.txt

diff --git a/library/HTMLPurifier/ConfigSchema/schema/AutoFormat.RemoveSpansWithoutAttributes.txt b/library/HTMLPurifier/ConfigSchema/schema/AutoFormat.RemoveSpansWithoutAttributes.txt
new file mode 100755
index 0000000..026baf5
--- /dev/null
+++ b/library/HTMLPurifier/ConfigSchema/schema/AutoFormat.RemoveSpansWithoutAttributes.txt
@@ -0,0 +1,12 @@
+AutoFormat.RemoveSpansWithoutAttributes
+TYPE: bool
+VERSION: 4.0.1
+DEFAULT: false
+--DESCRIPTION--
+
+<p>
+  This directive causes <code>span</code> tags without any attributes
+  to be removed. It will also remove spans that had all attributes
+  removed during processing.
+</p>
+--# vim: et sw=4 sts=4
diff --git a/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
index 191d85b..2cef00e 100755
--- a/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
+++ b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
@@ -39,12 +39,13 @@ class HTMLPurifier_Injector_RemoveSpansWithoutAttributes extends HTMLPurifier_In
         $nesting = 0;
         $spanContentTokens = array();
         while ($this->forwardUntilEndToken(&$i, &$current, &$nesting)) {}
-        
-        // Mark closing span tag for deletion
-        $current->markForDeletion = true;  
-        
-        // Delete open span tag
-        $token = false;
+
+        if ($current instanceof HTMLPurifier_Token_End && $current->name === &#039;span&#039;) {
+            // Mark closing span tag for deletion
+            $current->markForDeletion = true;  
+            // Delete open span tag
+            $token = false;
+        }
     }
     
     public function handleEnd(&$token) {
diff --git a/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php b/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php
index 8230b4e..aed94d8 100755
--- a/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php
+++ b/tests/HTMLPurifier/Injector/RemoveSpansWithoutAttributesTest.php
@@ -5,7 +5,7 @@ class HTMLPurifier_Injector_RemoveSpansWithoutAttributesTest extends HTMLPurifie
     function setup() {
         parent::setup();
         $this->config->set(&#039;HTML.Allowed&#039;, &#039;span[class],div,p,strong,em&#039;);
-        $this->config->set(&#039;AutoFormat.Custom&#039;, &#039;RemoveSpansWithoutAttributes&#039;);
+        $this->config->set(&#039;AutoFormat.RemoveSpansWithoutAttributes&#039;, true);
     }
 
     function testSingleSpan() {
-- 
1.5.4.3

I just wanted to check that it's OK for me to set $current->markForDeletion even though that variable isn't defined in the Token class. I figured there's not much point in adding it there, since it's only used by this one Injector.

Re: Remove tags without attribute
July 09, 2009 12:14PM

You can put it as a flag instead of a member variable; in fact, that's probably preferred. Prefix it with the name of your formatter.

Re: Remove tags without attribute
July 09, 2009 12:26PM

Ok, how's this?

From 398b36f7f7366cfe0bd03b2cb909b843636732db Mon Sep 17 00:00:00 2001
From: Paul Stone <patches@pdjs.co.uk>
Date: Thu, 9 Jul 2009 17:22:44 +0100
Subject: [PATCH] Clarify name of deletion flag

---
 .../Injector/RemoveSpansWithoutAttributes.php      |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
index 2cef00e..afc114b 100755
--- a/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
+++ b/library/HTMLPurifier/Injector/RemoveSpansWithoutAttributes.php
@@ -42,14 +42,14 @@ class HTMLPurifier_Injector_RemoveSpansWithoutAttributes extends HTMLPurifier_In

         if ($current instanceof HTMLPurifier_Token_End && $current->name === &#039;span&#039;) {
             // Mark closing span tag for deletion
-            $current->markForDeletion = true;
+            $current->RemoveSpansWithoutAttributes_markForDeletion = true;
             // Delete open span tag
             $token = false;
         }
     }

     public function handleEnd(&$token) {
-        if ($token->markForDeletion)
+        if ($token->RemoveSpansWithoutAttributes_markForDeletion)
             $token = false;
     }
 }
--
1.5.4.3
Re: Remove tags without attribute
July 15, 2009 02:24PM

I have put this into my todo list to get landed in trunk.

Re: Remove tags without attribute
August 27, 2009 08:43PM

I've committed the patch to a local topic branch; after reviewing test-cases, I'll stick it in master and publish it. Sorry about the wait!

Sorry, you do not have permission to post/reply in this forum.