March 21, 2008 08:56PM

Hi, excellent work with HTML Purifier ;)

I would like to know if there's a way to "autocomplete" the purified HTML. Meaning, if there's no doctype tag and such (html, head, body, etc) in the HTML provided, can HTML Purifier "autocomplete" the provided HTML ?

Basically, I'm parsing emails, and since not all of them have fully standarized HTML parts and I need them to be so, I thought that maybe HTML Purifier could do that automagically for me.

Thanks in advance and keep up the good work ;)

PS: I thought of using this, but didn't do the trick...

$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'TidyLevel', 'heavy');

March 21, 2008 09:49PM

I'm not sure what you mean by "autocomplete". Do you mean auto-detect a doctype, or something else?

March 22, 2008 04:03PM

It says it thinks my post is SPAM T_T

Anyways, let's say you've got something like this...


OK, if that's all the HTML you've got for a document then it's not standarized at all. First off, you're lacking the doctype declaration, html, head, body, etc etc etc etc...

The thing is, there's really no way to know what I'm going to get in advance. Since I'm parsing email, it might even be plain text only and still I would need it to be "autocompleted" to standarized HTML.

Even worst, some email clients do send full HTML with headers and such, other clients only send the body HTML, and not to mention that most likely, they tend to do whatever they feel like doing...

So I need to take whatever it is I'm getting as an input and output it as fully standarized HTML with headers and such.

Thanks for your reply ;)

PS: I started looking around and found about this Tidy thing...I guess that's what I need...

March 23, 2008 10:02AM
Sorry, you can blame Akismet for that. :-(

Anyways, let's say you've got something like this...

Ah, ok. Here's what I would recommend:

Usually, when your building a web-client for email, you won't be generating standalone HTML files for the emails. Instead, the email will be inserted inside your application code, with your own headers and footers and logos and whatnot. For this reason, Tidy is not a good choice for your task.

By default (with %Core.ConvertDocumentToFragment), if HTML Purifier detects you passed a full HTML document to it, it will extract out the contents of the body tag and purify those insides—so you're covered in that respect.

HTML Purifier will then output HTML that can be validly be placed in a div. What you now can do is insert this HTML inside your web-client page (which has its own doctype, which you should have used for HTML Purifier; we don't care about the source HTML's type, we just need to be lenient), or you can insert this in some dummy scaffold HTML if you need a standalone HTML file.

I hope that helped.

March 24, 2008 03:04PM

Hi there! I was able to fix this with that... Now I'm hardcoding the begining and end of the document and in the middle I put HTMLP's output...

works great ;) thanks for your time ^^

