I had to create an RSS aggregator for my job, and I had to find (or create) a good tool that sanitizes the HTML that comes in. I stumbled upon HTML purifier, and I haven't seen a better tool for the job yet.
Some of the features:
- It can turn the html into valid XHTML (transitional or string)
- So it also balances tags out..
- Removes any code that could expose a security risk. (tested with RSnakes XSS cheatcheat).
- Allows you to truncate HTML (if you don't want to show an entire post) and still results in proper HTML!
So yea, if you need something similar; I'd suggest you check it out..
