Two examples of escaping data are the following:
- Before you insert a value into a SQL query, using for example mysqli::real_escape_string() or PDO::quote().
- Before you insert data into your output HTML, using htmlspecialchars().
The question I'd like to ask today is, when to do this? There are two possible moments:
- Right when the data comes in. For SQL this used to be done with 'magic quotes' quite a bit in PHP-land. In general I don't see this happening a lot anymore for SQL. I do however see data encoded using htmlentities/htmlspecialchars before entering the database.
- The other way to go about it, is to only escape when you know how you're going to use it. For example, only call htmlspecialchars right before you echo() your data into your document.
I would personally argue that #2 is the best way to go about things. The first reason is that you don't know exactly how your data might be used in the future. If you pre-encoded everything using htmlentities, but at some point in the future you need the data to be used in an XML feed, you're going to be in trouble. The reason for this, is that the only valid entities in XML are &, <, >, and "e;. If you are going to need to need to output to CSV, very different rules apply. Other examples are: escaping for urls, escaping for command-line arguments, escaping for javascript and escaping for mime-headers.
In the illustrated example, this is no big disaster. A workaround would be to call htmlspecialchars_decode() or html_entity_decode() first, and then escape for your desired output. A worse case is filtering. If you have been stripping out all, or some html tags before saving it do the database, and later on your decide you wanted to show some of them anyway, that data is now lost.
Conclusion
So my argument is to store raw data. Only encode right before you know where you going to need it. If you're worried about the overhead of escaping right before output in an html page, cache the output.
Whichever route you go, make sure this is clearly documented. There's 2 ways this can go wrong:
- Escaping is done on input and output. Now you see literal &'s in your html, or quotes prepended by slashes. (\'hello\').
- Escaping is forgotten at both ends. Now you might be vulnerable to SQL injection attacks, XSS attacks or data corruption.
What do you think? I'm especially interested in the other side of the argument.

First off - mysqli::real_escape_string() and PDO::quote() don't modify what is stored in the database. It actually ensures that what you're attempting to store is stored exactly as entered and not interpreted as malformed sql or an attack vector.
1) Always use PDO::quote or bind variables. Always. Here's some sample code. It's a simple pattern, it's easy to read by future coders, and it keeps you safe:
function insertFruit($name, $color) {
$sql = 'INSERT
INTO FRUIT
(NAME, COLOR)
VALUES (:name, :color)';
$sth = $dbh->prepare($sql);
return $sth->execute(array(':name' => $name, ':color' => $color));
}
2) Use htmlentites on output so that you can change your rendering algorithm at some point in the future (say if you need to put filtering for characters from word docs on top of the htmlentities). I've learned this the hard way. This is almost always the correct answer. It's not a huge issue if all you're doing is running htmlentities, but once you do another filtering step the potential for trouble down the line increases.
I like to use a templating language that escapes by default on output without me having to think about it, but lets me output unescaped data if I explicitly want it. I use HTML_Template_Nest.
The new Filter API is also a great addition to the toolbox for preventing XSS.
Bind variables!
If you use bind variables when entering data in the database, you don't need to use mysqli::real_escape_string() or PDO::quote(). It's much simpler and it just plain works.
Hi Tim,
Completely agree with all your points. I now always use prepared statements whenever possible. I also agree that the tooling should make it easier to do this (your point about using the pear template class).
As for your very first comment: Even though the sql escaping functions don't modify what's in the database, the same principal applies. A string from an external source is being escaped so it can be safely inserted into the a target string. The target in this case is the SQL query itself, not the database.
Even though usage of magic_quotes is rare these days, I've still seen plenty of applications in the wild that just loop over every variable in _GET and _POST and escape them for MySQL as soon as the request starts.
@Gavin,
I personally hate binding variables. It just feels unnatural. I do use prepared statements, but I want to add them to the list manually when I call execute().
Evert
Hi Evert,
I would say that yes, you'd have to store raw data, and apply escaping when outputting the data.
But here is something to think about.
You don't just store "raw" data. Data has a type, and this is often more fine-grained than what the database has to offer us (CHAR, VARCHAR etc).
Let's take an example from your post: escaping for urls.
Suppose you have a column that you use to store URLs. Then, by definition, whatever you store in there should conform to the relevant standard (http://tools.ietf.org/html/rfc3986#page-11 for URIs). So in my opinion, since the "percent escaping" is a necessary ingredient of valid URIs in order to represent certain meta-characters or characters that are otherwise disallowed, there is only one right way: this has to be in place *before* storing it as URI. If not, you'd have a data type mismatch, as the intended type of the column (URI) does not match the data (unescaped illegal characters for URI). If you insist on storing "raw" data there, then apparently the data type is not URI, but something else ("unescaped URI" ? :p), which means you have to apply percent escaping whenever you want to use that data. As long as you don't store both percent-escaped and raw URIs in the same column, it remains consistent.
Of course, it's different if your application is *generating* urls from bits and pieces of data stored in several columns. Most likely, the bits and pieces have a meaning separate from URIs, it's just that the URI is assembled based on those raw pieces. In this case, the pieces from which the URI is constucted should not be stored with applied percent escaping. In this case, the pieces of data weren't part of a URI until the app started generating one, so the percent escaping should be part of the generation process. (In some cases, the pieces should be escaped separately before concatenating them to a URI, in some cases, the resultant URI can be escaped. But either way, in this case, the escaping must be done afterwards before output)
The same principle applies on higher levels: suppose you're generating an XHTML document that is to contain the URI. If it is to appear as href or src attribute value, then you have to make sure that whatever you're going to put in there is already a valid URI - so it should have been percent escaped already. But because the data type of the thing you're generating right now, i.e. an XHTML document has it's own syntax, your URI has to conform to that. For example, the URI can contain literal ampersands to separate parameters, but these will of course need to be escaped with the appropriate XML entities - just like any other "raw" data you're using to assemble the XHTML document.
So, in my opinion, the golden rule is to have a clear understanding of what type of data you intend to store, and make sure the data conforms to the constraints of that data type before storing it. Then, when using the data, you need a clear understanding of the data type(s) of the source data, and the data type of the result, and apply any transformation required to convert the data types into each other.
I hope this all makes sense .
Kind regards,
Roland.
We can take it one step further.
If you can ever forget to escape your data, you have an architectural problem which is not going to be solved by trying to remember to always escape your data. One day, you will forget to escape because your blood is low on 1,3,7-trimethylxanthine. It should *never* be necessary to call an escape function manually. Create a layer that does the job for you.
So you don't call any MySQL API functions directly when you're writing a new controller function to store a blog comment -- you call your own ExecQuery function that accepts a query and a list of arguments. The ExecQuery function then makes sure any arguments are escaped before passing it on to the database. It could do many other kinds of interesting things too, like managing transactions, processing the result set, looking in a cache, and so on.
Same thing applies when generating HTML or XML - put the *unescaped* data in a text node, so that the XML processor will take care automatically of serializing the data in the right character set and apply escaping as needed.
This is what it could like to generate a piece of XHTML in Python:
def gen_link(target, title): return elem('a', attr('href', target), text(title))
Or if you're feeling more like a ninja coder, it could maybe look like this:
def gen_link(target, title): return serialize_xml(['a', ('href', target), title])
The functions attr() and text() both know what to do with the incoming data. (serialize_xml can look at the data type of the items in the list to figure out what to do with tuples, strings and nested lists.) They can serialize incoming Unicode as UTF-8 or Windows-1252 or some weird Asian character set or whatever is needed.
If you're using some kind of templating engine, *that* is where output escaping should be done, not in the code that uses the templating engine. If the templating engine doesn't escape, that's crappy, but you can remedy it by putting a layer in front of it.
If you need to compose URI's, make a function which accepts all the parts of a URI and which knows what to escape in which manner:
gen_uri('http', 'images.google.com', 'search', {'q': 'purple penguin'}, 'something')
gives
http://images.google.com/search?q=purple+penguin#something
Putting HTML-escaped data in the database is solving the problem at the wrong layer of your code, because then you can only use the data as HTML. What if you're ever going to write some script that wants to generate some JSON or CSV or anything else? Your database will be a pain to work with. And are you sure you're never escaping your data twice?
Just my €0.02...
@Tim Symfony automatically escapes output in the templates unless explicitly told not to.
@Roland,
Very good writeup. Thanks for putting it in better words than I did.
@Evert thanks, glad to. I thought I didn't put it quite this way when I commented on another blog before, but now that I just re-read it, it turns out I could've just copied that comment or referred to that blog. See:
http://brian.moonspot.net/best-practices-escape-html
Yeah, i got confused by the 'store raw data' thing as well
: )