Indexing geo-data 2 : simple benchmark

After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.

Test 1: Selecting on longitude, latitude

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. lat < 41.3665028663272 AND
  3. lng < -72.41912841796875 AND
  4. lat > 40.113789191575236 AND
  5. lng > -75.83038330078125;

No index 1.73s. With B-Tree index on latitude 0.72s.

Test 2: Using spatial extensions and POINT field

  1. SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
  2. SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));

Time taken without index: 9.52s. With a spatial index: 0.73s.

Test 3: Using morton number

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. morton > 3667198027933142835 AND morton < 3671111582099533095 AND
  3. lat < 41.3665028663272 AND
  4. lng < -72.41912841796875 AND
  5. lat > 40.113789191575236 AND
  6. lng > -75.83038330078125;

Time taken without index: 0.78s, with index on on morton: 0.65s.

Conclusion

In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.

methodsmallmediumlarge
plain select 1.73s
index on latitude0.72s
using point field9.52s
using point field + spatial index0.00s0.73s18.82s
using morton number0.78s
index on morton 0.00s0.65s3.23s

So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.

Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.

Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).

Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.

PHP: Arrays vs. Objects

In a lot of cases arrays are used in PHP to store object-like information, like the results of a database query. I do this a lot too, but I kind of want to change things around to make use of VO's. I feel this makes a lot more sense, since most of the application I build are heavy OOP anyway, and I get all the added OOP benefits, like type-hinting, inheritance.. well, you know the deal.

I wanted to see what the differences would be in terms of memory consumption, so I set up the following test:

  1. <?php
  2.  
  3. // first test simple associative arrays
  4. $memory1 = xdebug_memory_usage( );
  5.  
  6. $data = array();
  7.  
  8. for($i=0;$i<1000;$i++) {
  9.  
  10. $data[] = array(
  11. 'property1' => md5(microtime()),
  12. 'property2' => md5(microtime()),
  13. 'property3' => md5(microtime()),
  14. );
  15.  
  16. }
  17.  
  18. $array = xdebug_memory_usage()-$memory1 . "\n";
  19.  
  20. // Now do the same thing, but with a class..
  21.  
  22. class Test {
  23.  
  24. public $property1;
  25. public $property2;
  26. public $property3;
  27.  
  28. }
  29.  
  30. $data = array();
  31.  
  32. $memory1 = xdebug_memory_usage( );
  33.  
  34. for($i=0;$i<1000;$i++) {
  35.  
  36. $test = new Test();
  37. $test->property1 = md5(microtime());
  38. $test->property2 = md5(microtime());
  39. $test->property3 = md5(microtime());
  40. $data[] = $test;
  41.  
  42.  
  43. }
  44.  
  45. $object = xdebug_memory_usage()-$memory1;
  46.  
  47. echo 'Arrays: ' . $array . "\n";
  48. echo 'Objects: ' . $object;
  49.  
  50. ?>

My results were

  1. Arrays: 536596
  2. Objects: 521932

I knew there was a good chance objects would take up less memory, because arrays need to store both the propertyname (or key) and value for every record, while the object only needs to store the values, because the propertynames are stored centrally in the class definition, what I didn't expect was that using arrays takes more than 20 times more memory. This is hardly an accurate formula, but it does tell you something.

Right, that was stupid.. I had my testing code wrong and I did the $data=array(); right after the second xdebug_memory_usage(). The actual conclusion here is that there's not much difference. I was hoping the objects would make a significant difference, but its minimal.

 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.