Welcome! » Log In » Create A New Profile

Problems with HTML which is generated by Word

Posted by Jochem Blok 
Jochem Blok
Problems with HTML which is generated by Word
June 12, 2008 09:56AM

Hello,

First of all I want to thank you for all your effort!

I have a question, when I use HTML purifier on an e-mail with HTML generated by Word the lay-out is not 1 on 1. The style element used in the e-mail is stripped. To prevent this is used: csstidy to extract the css. Then I append the clean body with the clean css. By cleaning up the HTML of Word empty paragraphs are created. The cause of this is: <o:p></o:p>. The css used by Word is: o\:* {behavior:url(#default#VML);} which prevent a white-line, at least I don't see one. After cleaning <o:p></o:p> becomes <p></p> And the style doesn't match...so a white-line appears.

Is it possible that HTML purifier doesn't clean the HTML / CSS but and only prevent XSS?

I can probably use tidy to clean those empty p tags, but it feels like a workaround.

Another question; I want to prevent loading external images. By using URI.munge all external links are replaced. I want to replace only external uri which are images. Links doesn't need to be replaced. Is this possible?

Regards,

Jochem

Re: Problems with HTML which is generated by Word
June 12, 2008 12:54PM
I have a question, when I use HTML purifier on an e-mail with HTML generated by Word the lay-out is not 1 on 1. The style element used in the e-mail is stripped. To prevent this is used: csstidy to extract the css. Then I append the clean body with the clean css. By cleaning up the HTML of Word empty paragraphs are created. The cause of this is: <o:p></o:p>. The css used by Word is: o\:* {behavior:url(#default#VML);} which prevent a white-line, at least I don&#039;t see one. After cleaning <o:p></o:p> becomes <p></p> And the style doesn&#039;t match...so a white-line appears.

Microsoft Word does some pretty funky stuff with their HTML. If the email had looked exactly like the way it had been sent, I would have been very surprised.

First off, you should be using %Filter.ExtractStyleBlocks to extract style blocks from emails. This filter performs XSS filtering on the CSS (yes, it can be found there) and will generally cleanup the CSS. (It needs CSS Tidy, but it seems you're already using that).

behavior: is not a standards-compliant CSS property, and is stripped out. So the CSS generated by Word in this case is not even relevant. As for the removal of empty elements, HTML Purifier does not currently support such a facility: in some cases, empty elements are desired, so HTML Purifier cannot blindly remove all of them.

Cleaning up MSWord HTML is a very difficult thing to do, and requires more research. Would you mind posting the source HTML?

Re: Problems with HTML which is generated by Word
June 12, 2008 01:26PM
I want to prevent loading external images. By using URI.munge all external links are replaced. I want to replace only external uri which are images. Links doesn't need to be replaced. Is this possible?

%URI.DisableExternalResources

The functionality of replacing image tags with links to the image they represent is not implemented yet, but someone else has requested it too.

Jochem Blok
Re: Problems with HTML which is generated by Word
June 13, 2008 03:42AM

The extraction of the css is done the way you proposed: using the filter ExtractStyleBlocks.

An example of an Word e-mail:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
..shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<style>
<!--
 /* Font Definitions */
 @font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
	{font-family:Verdana;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:10.0pt;
	font-family:"Verdana","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0cm;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Verdana","sans-serif";
	color:windowtext;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
..MsoChpDefault
	{mso-style-type:export-only;}
@page Section1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.Section1
	{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
 <o:shapedefaults v:ext="edit" spidmax="2050" />
</xml><![endif]--><!--[if gte mso 9]><xml>
 <o:shapelayout v:ext="edit">
  <o:idmap v:ext="edit" data="1" />
 </o:shapelayout></xml><![endif]-->
</head>

<body lang=NL link=blue vlink=purple>

<div class=Section1>

<p class=MsoNormal><img width=1277 height=994 id="Picture_x0020_1"
src="cid:image001.png@01C8CBDF.5D1BAEE0"><o:p></o:p></p>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

<p class=MsoNormal><b>Name<o:p></o:p></b></p>

<p class=MsoNormal>E-mail : <a href="mailto:mail@example.com"><span
style='color:windowtext'>mail@example.com</span></a><o:p></o:p></p>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

<p class=MsoNormal><b>Company<o:p></o:p></b></p>

<p class=MsoNormal>Address 1<o:p></o:p></p>

<p class=MsoNormal>Address 2<o:p></o:p></p>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

<p class=MsoNormal>Telefoon&nbsp; : +xx xx xxx xxx xx <span style='color:black'><o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='color:black'>Fax&nbsp; : +xx xx xxx xx xx<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='color:black'>Internet : </span><span
style='color:black'><a href="http://www.example.com/"><span lang=EN-US
style='color:black'>http://www.example.com</span></a></span><span
lang=EN-US style='color:black'><o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='color:black'>Kamer van koophandel
xxxxxxxxx<o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='color:black'><o:p>&nbsp;</o:p></span></p>

<p class=MsoNormal><span lang=EN-US style='font-size:7.5pt;color:black'>Op deze
e-mail is een disclaimer van toepassing, ga naar </span><span lang=EN-US
style='font-size:7.5pt'><a
href="http://www.example.com/disclaimer"><span
style='color:black'>www.example.com/disclaimer</span></a><br>
<span style='color:black'>A disclaimer is applicable to this email, please
refer to </span><a href="http://www.example.com/disclaimer"><span
style='color:black'>www.example.com/disclaimer</span></a><o:p></o:p></span></p>

<p class=MsoNormal><span lang=EN-US><o:p>&nbsp;</o:p></span></p>

</div>

</body>

</html>

An example snippets of code for more information about the external resources

Input 1:

<a onMouseOver="window.status='Klik hier om naar de Alternate site te gaan.'; return true" onMouseOut="window.status=''; return true" href="http://www.example.nl/" target="_blank" style="color:#FFFFFF;text-decoration:none">www.example.nl</a>

Preferred output 1 (stripped onmouseover and onmouseout):

<a href="http://www.example.nl/" target="_blank" style="color:#FFFFFF;text-decoration:none">www.example.nl</a>

Input 2:

<a href="http://www.example.nl" target="_blank">
<img src="http://www.example.nl/pix/newsletter/b2c/toplogo.jpg" alt="Klik hier om naar de example site te gaan." width="670" height="80" border="0" usemap="#MapMap"></a>

Preferred output 2 (Only the src of the img is changed, with %s replaced by the original src (just like ExternalResources)):

<a href="http://www.example.nl" target="_blank">
<img src="http://mydomain.nl=q=http%3A%2F%2Fwww.example.nl%2Fpix%2Fnewsletter%2Fb2c%2Ftoplogo.jpg" alt="Klik hier om naar de example site te gaan." width="670" height="80" border="0" usemap="#MapMap"></a>

Input 3 :

<area shape="rect" coords="551,4,647,101" 
href="http://www.example.nl/html/includeStaticSmall.html?file=example/about/aboutus/AboutUsShopInc&tid=45275&treeName=example&Level1=Over+example&Level2=Showroom+en+afhaalbalie&" target="_blank" alt="Route">

Preferred output 3 (nothing changed):

<area shape="rect" coords="551,4,647,101" 
href="http://www.example.nl/html/includeStaticSmall.html?file=example/about/aboutus/AboutUsShopInc&tid=45275&treeName=example&Level1=Over+example&Level2=Showroom+en+afhaalbalie&" target="_blank" alt="Route">
Re: Problems with HTML which is generated by Word
June 13, 2008 03:31PM

The word document is quite helpful and will be taken into consideration when we start attacking that problem.

For Case 1, you can enable target="_blank" using %Attr.AllowedFrameTargets. As for text-decoration:none, that's a bug and will be fixed promptly.

For Case 2, open redirects on your website constitute a security risk, so it would have to be an extension to %URI.SecureMunge. Something like this shouldn't be too hard to add, although I don't know why you'd want to do it.

For Case 3, area is not implemented yet. It is not clear whether or not this element poses a security risk; if it does not, we'll implement it some time.

Re: Problems with HTML which is generated by Word
January 21, 2018 08:58AM

i have some problem too

<div class="MsoNormal">
<span lang="EN">Kunyit menghambat produksi melanin, sehingga bisa digunakan untuk mengobati perubahan warna kulit bahkan mampu menjaga warna kulit.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN"><br />
</span> <span lang="EN">Cara penggunaannya, campur bubuk kunyit kecil dengan 1 sendok teh krim susu. Gosok dengan lembut pada paha bagian dalam Anda. Biarkan hingga kering sebelum membilasnya dengan air hangat. Gunakan obat ini sekali sehari.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN"><br />
</span> <span lang="EN">Atau, campurkan 1 sampai 2 sendok teh bubuk kunyit dengan jus jeruk untuk membuat pasta. Oleskan pada paha bagian dalam gelap dan biarkan selama 15 sampai 20 menit. Gunakan air hangat untuk membersihkan pasta. Lakukan ini sekali sehari.</span><br />
<span lang="EN"><br />
</span></div>
<div class="MsoNormal">
<h3>
<span lang="EN">8. Baking Soda</span></h3>
</div>
<div class="MsoNormal">
<span lang="EN">Baking soda juga dapat mengatasi kulit gelap paha bagian dalam Anda. Sangat baik digunakan sebagai exfoliator untuk menghilangkan sel-sel kulit rusak yang telah gelap akibat sinar matahari.&nbsp;</span><br />
<span lang="EN">Cara Pengaplikasiannya:</span><br />

Are this okay for my blog? thanks yutips

Ilham Fahm
Re: Problems with HTML which is generated by Word
March 08, 2018 10:02AM

of course it's okey. i recommend you should find any usefull information on stackoverflow forum, thanks.. Trulum Awet Muda

Sorry, you do not have permission to post/reply in this forum.