Allowing htmk and head tags
December 02, 2011 05:47AM

Hi everyone. I have to filtrer mainly «script» tags, but to accept «html», «head» and «body»

I tried to customize, but those tags are still filtered.

<pre><![CDATA[ $ParamsFiltres = HTMLPurifier_Config::createDefault();

$tags_ok= 'html,head,title,link,body,style,font,'. 'span,'. 'br,h1,h2,h3,h4,h5,h6,div,p,blockquote,address,hr,ul,ol,li,'. 'table,caption,col,colgroup,thead,tbody,tfoot,tr,th,td,fieldset,legend,code,pre,tt,dl,dt,dd,'. //'article,aside,'. 'strong,em,u,del,img,cite,abbr,acronym,big,small'; $tags_no='meta,script,frameset,frames,noframe,sdfield,a,'. 'object,embed,param,iframe,form,input,select,optgroup,option,textarea,button,'; $ParamsFiltres->set('Core.Encoding' ,'utf-8'); $ParamsFiltres->set('Core.ConvertDocumentToFragment' ,false); // I say, I need a entire HTML $ParamsFiltres->set('Core.HiddenElements' ,array( 'script' => true, )); $ParamsFiltres->set('HTML.Trusted' ,true); $ParamsFiltres->set('HTML.Allowed' ,'html,head,body'); $ParamsFiltres->set('HTML.Parent' ,'html'); $ParamsFiltres->set('HTML.AllowedElements' ,$tags_ok); $ParamsFiltres->set('HTML.ForbiddenElements' ,$tags_no); $ParamsFiltres->set('Filter.ExtractStyleBlocks.Escaping' ,true); $ParamsFiltres->set('HTML.DefinitionID', 'backfromfrontrenderer.html renderer'); $ParamsFiltres->set('HTML.DefinitionRev', 1); $def = $ParamsFiltres->getHTMLDefinition(true); { $def->addElement('html', 'Block', 'Flow', 'Common'); $def->addElement('head', 'Block', 'Flow', 'Common' ); $def->addElement('title', 'Inline', 'Empty', 'Common' ); $def->addElement('style', 'Block', 'Flow', 'Common' ); $def->addElement('link', 'Block', 'Empty', 'Common' ); $def->addElement('body', 'Block', 'Flow', 'Common' ); } $purifier = new HTMLPurifier($ParamsFiltres); ]]></pre>

Re: Allowing htmk and head tags
December 02, 2011 11:42AM

Try %Core.LexerImpl set to DirectLex.

Re: Allowing htmk and head tags
December 16, 2011 03:14AM

Sorry for very late response It works, but I can see the doctype escaped <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

Re: Allowing htmk and head tags
February 17, 2012 05:35AM

Two things:

Sorry for very late response It works, but I can see the doctype escaped <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

You need to deal with that manually. As you've noticed, HTML Purifier isn't set up to deal with entire HTML documents, it expects HTML fragments (HTML in the body); getting full header support is currently effectively impossible. If you want to preserve the DOCTYPE-declaration, you'll need to grab the DOCTYPE with a regex, then add it back in when the HTML Purifier is done.

Be very careful with that, since if your regex ends up too greedy you may end up allowing XSS again. I'd recommend analysing what you've grabbed and trying to construct a safe DOCTYPE out of information found, never actually reusing the input data - for example by mapping strpos() !== false occurrences to fixed strings, e.g.

// [...]
$cleanDoctype = '';
// $dirtyDoctype is what the regex grabbed
if (strpos(strtolower($dirtyDoctype), 'xhtml 1.0') !== false) {
    $cleanDoctype = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"'
         . ' "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">';
}
// [...]
echo $cleanDoctype . $cleanHtml;

However, consider what you're doing: You're configuring HTML Purifier to use its default "Doctype" (without checking the Purifier source, I assume this is HTML 4.01 Transitional). This defines what document structure HTML Purifier will allow. If the default-"Doctype" is for HTML and the DOCTYPE you announce the document to be after purification (the DOCTYPE supplied by the user that you extracted and preserved with the regex) is for XHTML, you can cause browser errors. You might even be opening yourself to an obscure XSS vector that way.

(Edit: Fixed formatting after an HTML escaping bug ravaged the forum.)

Edited 1 time(s). Last edit at 07/30/2012 12:54PM by pinkgothic.

Author:
Your Email:

Subject:

HTML input is enabled. Make sure you escape all HTML and angled brackets with &lt; and &gt;.

Auto-paragraphing is enabled. Double newlines will be converted to paragraphs; for single newlines, use the pre tag.

Allowed tags: a, abbr, acronym, b, blockquote, caption, cite, code, dd, del, dfn, div, dl, dt, em, i, ins, kbd, li, ol, p, pre, s, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, var.

For inputting literal code such as HTML and PHP for display, use CDATA tags to auto-escape your angled brackets, and pre to preserve newlines:

<pre><![CDATA[
Place code here
]]></pre>

Power users, you can hide this notice with:

.htmlpurifier-help {display:none;}

Message: