Welcome! » Log In » Create A New Profile

Space Characters Getting Converted by HTML Purifier

Posted by Supershaba 
Space Characters Getting Converted by HTML Purifier
July 04, 2011 06:05AM

I've got a problem with a module that I am writing. What I have is an array of names, which has a particular format and can contain characters as well as spaces. The name can be of two formats:

$name[0] = "--> Psychic Barrtier"; $name[1] =" Initial Presence";

In my code, I take each line in the array, and use a preg_match statement to see if it matches these two patterns. So, I am basically looking for a line that starts with two dashes and '>' this, followed by a space. Or I am looking for a line that starts with 4 spaces. This is my preg_match statement:

while (preg_match('/^(--> |--> |--> | {4}|    |    )(.+)$/', $name[$i], $capturedname))

The problem is, only the name starting with the arrow is passing the preg_match statement, when I encounter a name that starts with just the 4 spaces, it doesn't pass the statement. It seems like HTML Purifier is converting the spaces at the beginning to something else and I am wondering what that is. It could be an encoding problem, but what are the spaces getting encoded into? My document is UTF-8. Any help would be greatly appreciated. Thanks.

Re: Space Characters Getting Converted by HTML Purifier
July 04, 2011 09:28AM

I'm very confused why you're calling HTML Purifier at all in the first place. HTML Purifier makes no guarantees about whitespace stability, maybe %Core.LexerImpl with 'DirectLex' will do better but no guarantees.

Re: Space Characters Getting Converted by HTML Purifier
July 04, 2011 10:15AM

This is a Drupal module and it has to run with the HTML Purifier filter on. But when I have the HTML Purifier filter on, the spaces are getting converted to some unknown character and that's the problem. The DirectLex part is hard to understand. So I should go to configure and under LexerImpl, should uncheck the box that say's LexerImpl Null/Disabled or Not Supported? Is there any other option?

Re: Space Characters Getting Converted by HTML Purifier
July 04, 2011 10:47AM

Why... do you have HTML Purifier filter enabled for names?

Re: Space Characters Getting Converted by HTML Purifier
July 05, 2011 07:49AM

I am actually filtering a lot of other things and the name is really not a person's name. What happens is, the user enters a bunch of text and I take the entire user input and split it line by line. Then I compare each line to see if it matches one of two formats, the arrow plus word or 4 spaces plus a word. The problem is, the arrow plus word matches my preg statement, just fine. But when I encounter a line that starts with 4 spaces and a word, it doesn't match.

So that's where I need to determine what the spaces are getting converted to. It only happens when I have the HTML Purifier turned on, which I need to be on for my Drupal module. Otherwise, if I just use 'ampersand+hash+160+semicolon', it matches just fine. The 'ampersand+hash+160+semicolon' is getting converted by the Purifier to something else.

Re: Space Characters Getting Converted by HTML Purifier
July 05, 2011 08:02AM

Yes, yes, but why is this at all relevant text for HTML Purifier to filter?

Re: Space Characters Getting Converted by HTML Purifier
July 05, 2011 10:22AM

HTML Purifier is one of the default filters that run and before I use my own filter, the HTML Purifier runs through the user's input, changing the spaces. I can't turn off HTML Purifier, as it is part of our site and is needed for many other tasks.

Re: Space Characters Getting Converted by HTML Purifier
July 06, 2011 08:22AM

Let me clarify as to what I am actually doing in this module. Basically, the users will be pasting the contents of auto generated text files into posts, and my filter's purpose is to convert this content into rich HTML.

However, as well as the text file contents, the user can also post other text or HTML in the same content field. So we need to HTMLPurify it.

As per the warning in HTMLPurifier on Drupal, I am running HTMLPurifier as a filter before my own custom filter (which is relatively complex).

And so that's really what I am doing here. As for the %Core.LexerImpl option that you had mentioned, I looked into that in my configurations for HTML Purifier and all I have under that is a basic check box that is checked like this:

LexerImpl Null/Disabled [ ] or Not supported

Is my HTML Purifier missing anything, as I don't see a DirectLex option anywhere?

Re: Space Characters Getting Converted by HTML Purifier
July 07, 2011 08:27AM

Then you should purify *after* your own filter runs.

(The LexerImpl thing is a bit tricky for Drupal users, due to an infelicity in the forms. You'll basically need to do the advanced configuration using PHP files; there isn't a way to web-interface it.)

Re: Space Characters Getting Converted by HTML Purifier
July 08, 2011 01:19AM

As you had specified, I did run my Filter ahead of the HTML Purifier. The 4 spaces are passing the preg_match statement, but before the entire html is output into the browser, the program halts and I get the following error in my browser:

Fatal error: Maximum execution time of 30 seconds exceeded in htmlpurifier/library/HTMLPurifier/Strategy/MakeWellFormed.php on line 511.

I clocked my program with just my filter running and it only takes 13 seconds to run. Nevertheless, I increased the maximum execution time to 60 seconds, to see if that made any difference, but I still get that error after 60 seconds.

I have pasted the html output that is supposed to be displayed after my filter runs, at the following link:

http://www.textbin.com/show_text.php?id=2s217

Would really appreciate it, if you could take a look at it and point out what part of that might be causing the error in the HTML Purifier.

Thanks.

Re: Space Characters Getting Converted by HTML Purifier
July 08, 2011 08:23AM

Wow, that's a lot of HTML. Does it actually take 13 seconds to generate w/o HTML Purifier?

Re: Space Characters Getting Converted by HTML Purifier
July 11, 2011 12:20AM

Yes, I clocked it, with just my filter running, and it takes about 13 seconds each time.

Re: Space Characters Getting Converted by HTML Purifier
July 11, 2011 08:14AM

I suspect a combination of a slow server and a very large amount of HTML is causing you to lose, in this case.

Re: Space Characters Getting Converted by HTML Purifier
July 11, 2011 08:30AM

I actually used the special Test Forum (http://htmlpurifier.org/phorum/list.php?4) to run my html output through the HTML Purifier on this site, and I get the same problem (maximum execution time exceeded error). I basically, have the html output, divided into 3, by Booster names, if you take notice. There are three boosters and each of them have a lot of divs underneath them. What I noticed was, if I run just the html output for one of the boosters (meaning 1/3 of the total html output) the HTML purifier has no problem and it runs fine. I checked with your test forum and on my site as well.

But if I enter the html output of more than 1 booster, the purifier hangs. The odd thing is, the html code for all 3 boosters are identical, it basically has a bunch of divs all styled in the same way. Going from booster 1 to booster 2 does not take double the time, so that's where I am confused. If one booster only takes so many seconds, how could it be that running just 3 boosters, can take such an extremely long time and cause the Purifier to hang? I actually extended the maximum_execution_time to 150 seconds and it still got hung. The total size of the html output is about 800KB, is it a problem with the Purifier not being able to handle that big a size?

Re: Space Characters Getting Converted by HTML Purifier
July 14, 2011 01:59AM

Would appreciate it if you can provide any insight as to why the purification fails when I test it in my site as well as the Test forum here.

Re: Space Characters Getting Converted by HTML Purifier
July 14, 2011 08:12AM

I'm not really sure. It's possible that the third segment of HTML is triggering an infinite loop, but I haven't investigated closely enough yet.

Re: Space Characters Getting Converted by HTML Purifier
July 18, 2011 01:13AM

I can actually put in either booster 1 or 2 or 3 by itself and they all run fine in the purifier, but it is when I enter more than 1, that causes the problem. I hope this will help when you look into the problem further. Thanks.

Re: Space Characters Getting Converted by HTML Purifier
July 22, 2011 01:30AM

Any luck with this issue? Have been stuck on this for over a month now, wish there was some way we can get to the root of the problem.

Having a similar issue. I have a simple crawler, that pulls up the html from a link, and tries to purify it. Croaks on:

[Sat Jul 23 03:42:40 2011] [error] [client 127.0.0.1] PHP 7. HTMLPurifier_Strategy_Composite->execute() /libs/htmlpurifier/HTMLPurifier.php:181

[Sat Jul 23 03:42:40 2011] [error] [client 127.0.0.1] PHP 8. HTMLPurifier_Strategy_MakeWellFormed->execute() /libs/htmlpurifier/HTMLPurifier/Strategy/Composite.php:18

[Sat Jul 23 03:42:40 2011] [error] [client 127.0.0.1] PHP 9. HTMLPurifier_Strategy_MakeWellFormed->insertBefore() /libs/htmlpurifier/HTMLPurifier/Strategy/MakeWellFormed.php:215

[Sat Jul 23 03:42:40 2011] [error] [client 127.0.0.1] PHP 10. array_splice() /libs/htmlpurifier/HTMLPurifier/Strategy/MakeWellFormed.php:511

The URL I'm trying to crawl is: http://www.reddit.com/r/technology/comments/iwu64/jawdropping_demo_of_a_lightweight_robot_that/

It's a big chunk of HTML, but its not dying from lack of memory. Any help would be great!

Re: Space Characters Getting Converted by HTML Purifier
July 23, 2011 06:22AM

nAv: When I run that with the latest version of HTML Purifier with PHP 5.3.3, it chews for a while and then spits it out with no warnings. Can you give a full reproduction test case?

Re: Space Characters Getting Converted by HTML Purifier
July 23, 2011 06:32AM

Supershaba: I ran HTML Purifier on your input, and it didn't infinite loop. What version of PHP are you running?

Re: Space Characters Getting Converted by HTML Purifier
July 25, 2011 12:38AM

I am running PHP Version 5.2.6. It goes on an infinite loop on my site as well as when I test it in this site, with the html for more than 1 booster.

Re: Space Characters Getting Converted by HTML Purifier
July 25, 2011 08:04AM

Can you increase the max execution time to something ludicrous, like five minutes?

Re: Space Characters Getting Converted by HTML Purifier
July 26, 2011 08:32AM

Ready to try anything ludicrous, at this moment. I can give it a try and will let you know the outcome.

Re: Space Characters Getting Converted by HTML Purifier
July 27, 2011 09:18AM

Tried running it for 5 mins, as you said, and got the same error: Maximum execution time of 300 seconds exceeded. Really don't know what it is getting caught on.

Re: Space Characters Getting Converted by HTML Purifier
July 29, 2011 08:42AM

Try 10 minutes.

Some other things to try, since I can't reproduce:

  • Try reproducing the problem yourself on another machine, possibly the same machine. Write a PHP file that reads in an HTML file, runs HTML Purifier on it, and prints the output. Try running it from command line versus CGI. If you manage to create a self-contained text case, stick it in a zip or GitHub repo and send it over.
  • Can you tell me what happens if you set %Core.LexerImpl to 'DirectLex'?
  • What happens if you try two of three boosters? How about one booster repeated three times?
Re: Space Characters Getting Converted by HTML Purifier
August 02, 2011 04:13AM

I am working on suggestion 1 at the moment, and will let you know. As far as suggestion 2 goes, as I mentioned before, I can't set the %Core.LexerImpl in my configurations for HTML Purifier. I am running Drupal version 6.22. Currently, for the configurations for HTML Purifier, all I have under that is a basic check box that is checked like this:

LexerImpl Null/Disabled [ ] or Not supported

As for suggestion 3, I tried running the same boosters in your test site, and I can run booster #1 repeated twice, but it times out if I run it repeated 3 times. For some odd reason, the other boosters time out when I repeat them more than once. So only the html for booster #1 is working at the moment. But I can only run this in your test site. I can run a booster once in my site but the HTML Purifier is causing problems in the display of images, even if I run my filter ahead of it. So, basically got problems at both ends, if I run the Purifier ahead of my filter, or after my filter. Will have to look further into the problem of why the images aren't displaying, even if I run my filter ahead of it.

Will keep you updated.

Re: Space Characters Getting Converted by HTML Purifier
August 02, 2011 10:37AM

Just found out some very good news with why the HTML Purifier was hanging with 3 boosters. It was a tiny error in my code, from booster 1 to booster 2 to booster 3, a portion of the html was getting doubled, so, where it had 40 products in booster 1, it had 80 in booster 2, and then 120 in booster 3. So, basically, the file went from 870 KB to about 180 KB. I wouldn't have come across it had you not told me to try the same booster twice. I kept trying different combinations of boosters instead of the same one, until yesterday.

So, now, with my filter running first, my module is displaying like I want it to. Wished I could have figured out what the 4 spaces were getting turned into when the Purifier runs first...Nevertheless, really happy I got it to work this way. Thanks for all your help. Really appreciate it.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with < and >.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: