Escaping MySQL strings with no connection available

We're all being drilled over and over again to always use mysqli::escape_string, PDO::quote, or preferably prepared statements when escaping user-supplied strings for use in MySQL queries.

The downside to these methods is that they only work when there's an open connection to a server. So what if there's no connection available? In traditional Unix philosophy I'm writing an export script that doesn't execute SQL statements right to a server, but sends them to stdout. Forcing people to make a connection seems like a major inconvenience.

So what's left? Manual escaping I suppose.. The manual page for mysqli::escape_string mentions: Characters encoded are NUL (ASCII 0), \n, \r, \, ', ", and Control-Z..

  1. function dontHateMe($unescaped) {
  2. $replacements = array(
  3. "\x00"=>'\x00',
  4. "\n"=>'\n',
  5. "\r"=>'\r',
  6. "\\"=>'\\\\',
  7. "'"=>"\'",
  8. '"'=>'\"',
  9. "\x1a"=>'\x1a'
  10. );
  11. return strtr($unescaped,$replacements);
  12. }

There's a risk though.. Certain multi-byte character sets (such as BIG5 and GBK) may still allow for a security hole. You *should* be fine with UTF-8, so make sure you start your file with:

  1. SET CHARACTER SET utf8;

Still no guarantee from my side though. Tread carefully and avoid this if you can. If you have a better idea, or you feel like shouting at me for this.. let's hear it in the comments.

Special thanks to Spudley for providing me with a reasonable answer to this question.

When to escape your data

Two examples of escaping data are the following:

The question I'd like to ask today is, when to do this? There are two possible moments:

  1. Right when the data comes in. For SQL this used to be done with 'magic quotes' quite a bit in PHP-land. In general I don't see this happening a lot anymore for SQL. I do however see data encoded using htmlentities/htmlspecialchars before entering the database.
  2. The other way to go about it, is to only escape when you know how you're going to use it. For example, only call htmlspecialchars right before you echo() your data into your document.

I would personally argue that #2 is the best way to go about things. The first reason is that you don't know exactly how your data might be used in the future. If you pre-encoded everything using htmlentities, but at some point in the future you need the data to be used in an XML feed, you're going to be in trouble. The reason for this, is that the only valid entities in XML are &, <, >, and &quote;. If you are going to need to need to output to CSV, very different rules apply. Other examples are: escaping for urls, escaping for command-line arguments, escaping for javascript and escaping for mime-headers.

In the illustrated example, this is no big disaster. A workaround would be to call htmlspecialchars_decode() or html_entity_decode() first, and then escape for your desired output. A worse case is filtering. If you have been stripping out all, or some html tags before saving it do the database, and later on your decide you wanted to show some of them anyway, that data is now lost.

Conclusion

So my argument is to store raw data. Only encode right before you know where you going to need it. If you're worried about the overhead of escaping right before output in an html page, cache the output.

Whichever route you go, make sure this is clearly documented. There's 2 ways this can go wrong:

  1. Escaping is done on input and output. Now you see literal &'s in your html, or quotes prepended by slashes. (\'hello\').
  2. Escaping is forgotten at both ends. Now you might be vulnerable to SQL injection attacks, XSS attacks or data corruption.

What do you think? I'm especially interested in the other side of the argument.

Comments are back!

Thanks to a suggestion from Mohammad Lahlouh, all deleted comments are restored. Thanks!

I was able to do this with the mysql binlog, which is a binary log containing all statements executed on the server. If you face a similar problem, here are the steps to restore the DB:

  1. See if you're actually storing binlogs. For me this happened to be the case, although I never actually payed attention (My friend Jens Meijer is so kind to provide hosting for me). In my case they were stored in /var/log/mysql.
  2. Grab your latest backup. In my case it was from October 2008.
  3. Make sure you also create a new backup, just in case this process goes wrong.
  4. Restore your old back.
  5. Find the binlog that corresponds to the date of your backup. I did a simple copy of the ~300 binlogs to a different directory.
  6. Run the binlog through 'mysqlbinlog' and pipe into a new file. I had to append the --database argument, because I wasn't interested in all the other databases. The full command for me was mysqlbinlog mysql-bin.* --database evert2 > commentsbinlog.sql
  7. .
  8. Open the new file, and make sure the first queries are correct. Likely there are some updates/inserts/deletes on top of the file that were already part of your last backup.
  9. Go all the way to the bottom of the file, and remove the offending queries. In my case I had to remove the query that deleted all my comments.
  10. Apply the sql file to your database: mysql -u root -p evert2 < < commentsbinlog.sql

It's wise to assure nobody can access the DB while you are doing this. I hope this helps anyone else with this problem.

Deleted all comments by accident

I just accidentally deleted all comments made on this blog.

This blog is homegrown and hand-made. It has a very simple spamfilter that catches most spam comments. The few that make it through though, I've been deleting myself using the mysql console.

I tend to write simple queries, ending in: .. AND id = 1234, but today I forgot the id =. I had a backup from October 2008, so some old comments are back now, but I'm still pretty sad.

Lesson from today: make backups! I'm pretty good with things I do for clients, but I tend to not uphold the same standards for my own projects.

Oracle to buy Sun

Just in case you hadn't heard it, Oracle is buying Sun for $7.4 billion. I wonder what this means for the (already scattered) MySQL community. I hope they treat the community well.

Indexing geo-data 3: In practice

Since my last post I've found out that using the 'morton' number to index spatial number is also referred to as the Z-order.

To index using this order, you can use this stored function:

  1. CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC
  2. BEGIN
  3.  
  4. -- 11930464 is round(maximum value of a 32bit integer / 360 degrees)
  5.  
  6. DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;
  7.  
  8. SET @lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
  9. SET @lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
  10. SET bit = 1;
  11.  
  12. WHILE bit <= @lat || bit <= @lng DO
  13.  
  14. IF(bit & @lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
  15. IF(bit & @lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;
  16.  
  17. SET pos = pos + 1;
  18.  
  19. SET bit = 1 << pos;
  20.  
  21. END WHILE;
  22.  
  23. RETURN morton;
  24. END;

Some caveats

  • Since the function is using floating-point numbers, there will be rounding errors. These are generally very small, and for us well within the acceptable margin of error.
  • More significantly, this function assumes euclidean geometry (i.e.: a flat surface). The earth obviously isn't, so as you get closer to the poles you might get back results from outside your rectangle, or miss results from within.
  • It's not recommended to use this function directly in your WHERE clause. Even though the function is marked deterministic (i.e.: it will always yield the same results for the same arguments), the MySQL query optimizer currently ignored that modifier. So, set it in a temporary variable first (SET @number =) and use the variable in your where clause.
  • Don't forget dealing with queries spanning over the -180°, 180° longitude line.
  • Also store your actual longitude and latitude values in the DB. This function is not intended to be exact.
  • When you do your queries, select both on the morton number, and longitude and latitude.

A better way to do it?

In my research I've found the Hilbert Curve to be an even better algorithm, but haven't yet gone through the effort of trying to express it in SQL.

Indexing geo-data 2 : simple benchmark

After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.

Test 1: Selecting on longitude, latitude

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. lat < 41.3665028663272 AND
  3. lng < -72.41912841796875 AND
  4. lat > 40.113789191575236 AND
  5. lng > -75.83038330078125;

No index 1.73s. With B-Tree index on latitude 0.72s.

Test 2: Using spatial extensions and POINT field

  1. SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
  2. SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));

Time taken without index: 9.52s. With a spatial index: 0.73s.

Test 3: Using morton number

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. morton > 3667198027933142835 AND morton < 3671111582099533095 AND
  3. lat < 41.3665028663272 AND
  4. lng < -72.41912841796875 AND
  5. lat > 40.113789191575236 AND
  6. lng > -75.83038330078125;

Time taken without index: 0.78s, with index on on morton: 0.65s.

Conclusion

In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.

methodsmallmediumlarge
plain select 1.73s
index on latitude0.72s
using point field9.52s
using point field + spatial index0.00s0.73s18.82s
using morton number0.78s
index on morton 0.00s0.65s3.23s

So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.

Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.

Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).

Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.

Indexing geo-data

Recently we started wondering what the most effective way is to index data based on Longitude and Latitude. Although we're not yet seeing performance problems, we're definitely anticipating them without an effective index. We're using MySQL for anything mission critical, so (some of) this information specifically applies to MySQL.

For many the obvious thing to do might be to add a mysql index on those two numbers:

  1. ALTER TABLE geo ADD INDEX (longitude,latitude)

The problem with how B-TREE indexes work, is that columns within the index will be used in order. Only if an exact match is found for the left-most column (longitude in this case) the latitude column is used. Since we're always selecting on a range of values, in practice this means that the latitude column in the index will in fact always be ignored.

This could be very inefficient if you're zoomed in quite a bit on for example a city on the east-coast (where I'm at). There will be a lot of matches for cities way north or south from here.

Splitting the earth up in rectangles

We figured a better way to do this is to just split up the earth in smaller rectangles. We could round the longitude and latitude numbers off to an integer and index on these.

  1. CREATE TABLE geo (
  2. longitude DOUBLE,
  3. latitude DOUBLE,
  4. idxlong SMALLINT,
  5. idxlat SMALLINT,
  6. INDEX (idxlong,idxlat);
  7. );

When inserting you'd just a idxlong = round(longitude) and you should be good to go.

The problem with this approach is that we split the earth up in 360 x-coordinates, and 180 y-coordinates. Whenever we're on a zoom-level higher than a single one of these sections, the index will not be used effectively. Furthermore, if we zoom in very deep (times square) we run the risk there's a lot of rows matching this area that will need to be evaluated. In short: the index is ineffective if you zoom to much smaller or bigger areas.

Dividing the earth up further

We could divide the earth up in 4 squares, and store that information instead. Every square could then divided up in 4 more squares, and so on.. We end up with what's called a Quadtree. To do this effectively, and not create new columns for every 'zoom level' we might need, we instead attempt to convert the longitude and latitude to a single value.

Simply put, if our X coordinate is 111111 and our Y coordinate is 000000, we want to end up with 101010101010. This is called the Morton number.

We can do this with the following (pseudo-)code:

  1. latitude = 43.63556267294633
  2. longitude = -79.4249939918518
  3.  
  4. // Since these can both be negative, we should convert them to an unsigned number
  5. // longitude goes from -180 to 180 and latitude from -90 to 90
  6.  
  7. latitude += 90;
  8. longitude += 180;
  9.  
  10. // Now we need to turn them into integers. It makes sense to fit them in a 32bit integer.
  11. // The maximum value for a 32bit integer is 4294967295
  12. // Since the numbers now go up to 360, we use round(4294967295/360) = 11930464.
  13.  
  14. latitude = (int)latitude * 11930464;
  15. longitude = (int)longitude * 11930464;
  16.  
  17. // The 'morton number'
  18. morton = 0
  19. // The current bit we're interleaving
  20. bit = 1
  21. // The position of the bit we're interleaving
  22. position = 0
  23.  
  24. while(bit <= latitude or bit <= longitude) {
  25.  
  26. if (bit & latitude) morton = morton | 1 << (2*position+1)
  27. if (bit & longitude) morton = morton | 1 << (2*position)
  28.  
  29. position += 1
  30. bit = 1 << position
  31.  
  32. }

We can now easily index on our 'morton' number.

The big flaw of using a Quadtree

This would be the most effective index if we're ever only interested in the contents of '1 square' regardless of the size. But we are in fact usually interested in a range (Everything between a top-left and bottom-right coordinate) that could cover multiple squares.

The best example is near the international dateline. Because we increased our numbers with 180, the international dateline now lies at x-coordinate 0 and 360. If we would like to SELECT items from both sides of this line, we would need to do two queries (or two ranges). This is a simple example, but the problem in fact occurs at every edge of a square. If we select from a random place in europe, and we happen to go across a square from the 3rd significant bit in our morton number, it means we will end up effectively splitting our table in 4 major segments and we'll end up with scanning a higher number of items for matches.

Solutions

If the top-left and bottom-right are close enough to each other ('close' will need to be defined), we can find out if the query could be problematic by getting the morton numbers for both and comparing the most significant bits we care about:

  1. // This example assumes both the numbers are 64 bit, and we really care about the top 16.
  2. problematic = ((morton1 ^ morton2) >> 48) != 0
  3.  
  4. // Note that the ^ operator is XOR (I had to look it up myself, because I rarely ever need it).

Another solution is to throw all of this out the window, and go with MySQL's Spatial extensions. The spatial extensions provides much more features beyond my need, so I've yet to find out if this is the best solution for myself. MySQL provides a spatial index, which is based off an R-Tree, which effectively uses overlapping rectangles. The other bonus is that things like selecting by radius (e.g.: everything in the range of x km) is possible.

Anything wrong with this logic? Do you have experience with this? I'd like to hear problems and solutions you've encountered!

Forking and MySQL connections

For some of our long-running processes we use PHP. It makes total sense from our perspective, because we can re-use all our existing business logic from our main PHP web application.

To make things more efficient, I recently started some work on using forks and have a couple of worker processes around.

This application is essentially the core of our transcoder. The parent process would retrieve new jobs from the queue and fire up a number of workers to actually transcode the file. The main problem is that the parent opens up a MySQL connection and fires off some queries. After the child process is done, it actually closes the MySQL connection regardless of if it was actually used or not.

This means I'll have to close all mysql connections before forking, and re-connecting right after. No big deal, but still at least a bit annoying.

  1. <?php
  2.  
  3. $db = new MySQLi('hostname','user','password');
  4.  
  5. if (pcntl_fork()) {
  6.  
  7. $status = 0;
  8. // parent
  9. pcntl_wait($status);
  10.  
  11. $result = $db->query('select version()');
  12. if ($db->error) echo $db->error;
  13.  
  14. } else {
  15.  
  16. // the child process does nothing and exits gracefully
  17.  
  18. }
  19.  
  20. ?>

Output:

  1. MySQL server has gone away

Solar Framework's master-slave db adapter

SolarPHP's framework now comes with SQL adapter for MySQL replicated setups since the 1.0.0alpha2 release. The adapter automatically picks master or slave databases depending on the type of query. If its a SELECT, it will go to a slave, if its a UPDATE, DELETE, etc. it will automatically go to the master.

Let me just put it out there, I think that's a really bad idea. While this will work for certain situations, it is advertised as "The Solar_Sql_Adapter_MysqlReplicated adapter handles all the connections and switching-around for you, and you don't need change a single line of application code.".

Quite frankly, I started out with a similar approach. It worked fine for our staging environment, which didn't experience a lot of load, when it went to the live machine things went bad.

Suddenly I realized there were more than couple places in the application that had to do SELECT based on information that was just INSERT'ed. After the INSERT was done, the SELECT was performed on the slave, while the data hadn't arrived yet from the master. This caused some pretty freaky errors. Hoping to fix the problem, I decided to always execute any query on the master after an data-altering statement was done. Turned out that the master got quite a bit more load than the slaves after that change.

Things got worse when there was a replication lag of say, a second. Users would modify some data by submitting a form, and when they ended up on a resulting page the data was gone!

Some of the things we looked at was using a whitelist or blacklist for queries that are always ok/not ok to go to the slave, but considering we then ended up with decoupling a logical block of code, we went for the pragmatic approach. Always pick the master or slave depending on what is required for a specific piece of functionality. Yes, it might require a bit more thinking as you're working on your database code, but the behaviour is predictable and reliable.

 1 2 Next →

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.