javascript's escape and encodeURI vs. PHP $_POST
I just stumbled upon an odd encoding issue with a web application.
Basically, data is coming into our PHP application through a Javascript's XMLHttpRequest (ajax). The data is sent as a standard form encoding (application/x-www-form-urlencoded), and picked up by PHP using the $_POST array. Any strings in form POST request are 'urlencoded', also known as Percent-encoding. As an example, this will turn a space into the often-seen %20.
Normally everything in the $_POST and $_GET arrays is already decoded, so when you're dealing with these arrays you don't really have to think about this. This time however, I was dealing with some non-latin unicode characters and for some reason they were never decoded and ended up in de database as raw url-encoded strings.
Doing a bit of research led me to the following: normally any special character is encoded as %XX, X and X being 2 hexadecimal values. These values simply represent bytes. The values I got were different altogether and took the form %uXXXX. I just assumed this was part of standard uri-encoding for unicode characters, so I was still a bit shook-up to see that PHP didn't just pick them up.
After a bit of research, I found out that the unicode representation was rejected by W3c, which is probably also why the PHP authors decided to not implement this. Javascript actually has 2 different methods to do percent-encoding, namely:
escape("☢"); // returns %u2622
encodeURI("☢"); // returns %E2%98%A2
Guess which one we were using?
Even though the %u syntax is arguably better to represent unicode characters, W3c seems to have voted against the syntax for backwards compatibility reasons. Before this happened the escape method was already adopted in javascript which in turn caused me to stumble upon this problem and write an article about it.
The more you know..
Game of life with checkboxes
I needed to kill a little bit of time, so I decided to write Conway's Game of Life in HTML.
Preventing XSS in Javascript strings
Escaping user-input in your HTML is essential for preventing worlds #1 vulnerability.
When you're embedding user input into javascript, a simple htmlspecialchars won't cut it, you'll need to make sure you're escaping other things, like \n (line endings), and \ (slashes). Google doctype has a good list of characters in need of proper escaping to prevent users breaking your javascript.
However, when I dropped the question if a simple string replacement would be good enough, the members of the Web security mailing list gave me a different answer.
When escaping or filtering output using a blacklist (such as the one published on google doctype) browser/unicode escaping bugs are not taking into consideration. Some new vulnerability might appear in the future, which would immediately open a hole in your app. For this reason its wiser to go with a much more defensive white-list approach, essentially only letting things through you know is safe.
Introducing Reform
Reform is a tool that does exactly this. Reform allows you to escape your data for a javascript, xml, html or vbscript (yes it still exists) context. It provides libraries for Java, .NET, PHP, Perl, Python, Javascript and ASP. Pretty cool!
One dislike I have is that it only considers I really small set of unicode codepoints safe, especially when dealing with non-latin languages this is going to add a great deal to the bandwidth usage and the legibility of your sourcecode. One would think there has to be more ranges considered 'safe'.
PHP example:
<?php
// Assuming the Reform class is included..
echo '<script type="text/javascript"> var myString = ', Reform::JsString($userInput), '; </script>';
?>
I made a couple of changes in the PHP version, specifically:
- Prepended the 'static' keyword to every method to make it work in PHP5's strict mode.
- Removed the UTF-8 checks, I'm in a controlled environment, mbstring is installed, and the internal encoding is utf-8.
- Added a parameter to Reform::JsString to not automatically put the string between quotes (').







