Welcome! » Log In » Create A New Profile

Truncation suggestion (Teaser)

Posted by dragonwize 
Truncation suggestion (Teaser)
August 03, 2007 08:55PM

First, thank you for this great library.

I am using HTMLPurifier for user submitted content, as I would think was its purpose. Hopefully developers aren't using it so they don't have to learn to write good secure code. Anyway many types user submitted content often come with a teaser or summery version. Like a blog post, news story, or just content placed in a sidebar, the list goes on. As you tell using html and truncating a string by by character can be extremely be terrible for design not to mention any security concerns.

My suggestion is that I think HTMLPurifier would be in the perfect position to output a safe teaser as it is already parsing the code.

Re: Truncation suggestion (Teaser)
August 03, 2007 09:25PM

Generally, I don't bother with HTML teasers. That makes creating one very easy: strip_tags(), some sort of "smart" substr, and then htmlspecialchars(). This is one case where strip_tags() has a quite legitimate use!

Re: Truncation suggestion (Teaser)
August 03, 2007 09:29PM

That is what I currently use. But there are time when having a html teaser would be really nice.

Thanks for the work either way.

DragonWize

Re: Truncation suggestion (Teaser)
August 03, 2007 10:31PM

Sometimes, I agree. However, if a document starts off with tags you don't want them showing up in the teaser.

Re: Truncation suggestion (Teaser)
November 24, 2007 12:01AM

In the product I work on, I use the following ornery solution to trim HTML down to a guaranteeably fittable (ug) size for storage in a database column:

- Take HTML.
- magic MS Word cleaning stuff. Without this JTidy is doomed to fail.
- raw truncate to 8*n, where n is the byte size of the field in the database.
- JTidy to fix up the raw truncation into real HTML.
- Is the fixed-up string > n? No: Raw truncate to n, and run JTidy again. Yes: return fixed-up string.
- Is the new fixed-up string > n? No: Take 200 bytes off the raw truncation and run through JTidy.
Repeat trimming by 200 and JTidying until you win. Yes: return fixed-up string.
- If JTidy borked at any point, return a sorry-we-failed-you string instead of the truncated HTML.

The 8*n initial limit is arbitrary, but usually works to get under n bytes in just 1 JTidy pass.

As ugly as it is, the whole sequence is usually pretty fast since it only ever deals with a limited number of bytes -- I have no interest in cleaning up the original HTML document; I only want to make sure the truncated preview is pure, since that will be displayed inline with the rest of the web UI for the product.

While fast, it is susceptible to JTidy failure. I'd say 1 in 500 times, I see a sorry-we-failed-you come up. I agree with the OP that HTMLPurifier is probably a better replacement for what I want, if it only existed as a Java solution.

Re: Truncation suggestion (Teaser)
November 24, 2007 11:30AM

Hmm... are you checking for XSS attacks or anything of the like in your teasers?

Re: Truncation suggestion (Teaser)
November 24, 2007 01:25PM

re: XSS filtering

Yes, we have an internal active content filter that strips Javascript (not perfectly, I believe) before the HTML gets to the JTidy step.

Sorry, you do not have permission to post/reply in this forum.