Indexing geo-data 3: In practice

Since my last post I've found out that using the 'morton' number to index spatial number is also referred to as the Z-order.

To index using this order, you can use this stored function:

  1. CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC
  2. BEGIN
  3.  
  4. -- 11930464 is round(maximum value of a 32bit integer / 360 degrees)
  5.  
  6. DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;
  7.  
  8. SET @lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
  9. SET @lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
  10. SET bit = 1;
  11.  
  12. WHILE bit <= @lat || bit <= @lng DO
  13.  
  14. IF(bit & @lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
  15. IF(bit & @lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;
  16.  
  17. SET pos = pos + 1;
  18.  
  19. SET bit = 1 << pos;
  20.  
  21. END WHILE;
  22.  
  23. RETURN morton;
  24. END;

Some caveats

  • Since the function is using floating-point numbers, there will be rounding errors. These are generally very small, and for us well within the acceptable margin of error.
  • More significantly, this function assumes euclidean geometry (i.e.: a flat surface). The earth obviously isn't, so as you get closer to the poles you might get back results from outside your rectangle, or miss results from within.
  • It's not recommended to use this function directly in your WHERE clause. Even though the function is marked deterministic (i.e.: it will always yield the same results for the same arguments), the MySQL query optimizer currently ignored that modifier. So, set it in a temporary variable first (SET @number =) and use the variable in your where clause.
  • Don't forget dealing with queries spanning over the -180°, 180° longitude line.
  • Also store your actual longitude and latitude values in the DB. This function is not intended to be exact.
  • When you do your queries, select both on the morton number, and longitude and latitude.

A better way to do it?

In my research I've found the Hilbert Curve to be an even better algorithm, but haven't yet gone through the effort of trying to express it in SQL.

Indexing geo-data 2 : simple benchmark

After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.

Test 1: Selecting on longitude, latitude

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. lat < 41.3665028663272 AND
  3. lng < -72.41912841796875 AND
  4. lat > 40.113789191575236 AND
  5. lng > -75.83038330078125;

No index 1.73s. With B-Tree index on latitude 0.72s.

Test 2: Using spatial extensions and POINT field

  1. SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
  2. SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));

Time taken without index: 9.52s. With a spatial index: 0.73s.

Test 3: Using morton number

  1. SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  2. morton > 3667198027933142835 AND morton < 3671111582099533095 AND
  3. lat < 41.3665028663272 AND
  4. lng < -72.41912841796875 AND
  5. lat > 40.113789191575236 AND
  6. lng > -75.83038330078125;

Time taken without index: 0.78s, with index on on morton: 0.65s.

Conclusion

In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.

methodsmallmediumlarge
plain select 1.73s
index on latitude0.72s
using point field9.52s
using point field + spatial index0.00s0.73s18.82s
using morton number0.78s
index on morton 0.00s0.65s3.23s

So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.

Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.

Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).

Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.

Indexing geo-data

Recently we started wondering what the most effective way is to index data based on Longitude and Latitude. Although we're not yet seeing performance problems, we're definitely anticipating them without an effective index. We're using MySQL for anything mission critical, so (some of) this information specifically applies to MySQL.

For many the obvious thing to do might be to add a mysql index on those two numbers:

  1. ALTER TABLE geo ADD INDEX (longitude,latitude)

The problem with how B-TREE indexes work, is that columns within the index will be used in order. Only if an exact match is found for the left-most column (longitude in this case) the latitude column is used. Since we're always selecting on a range of values, in practice this means that the latitude column in the index will in fact always be ignored.

This could be very inefficient if you're zoomed in quite a bit on for example a city on the east-coast (where I'm at). There will be a lot of matches for cities way north or south from here.

Splitting the earth up in rectangles

We figured a better way to do this is to just split up the earth in smaller rectangles. We could round the longitude and latitude numbers off to an integer and index on these.

  1. CREATE TABLE geo (
  2. longitude DOUBLE,
  3. latitude DOUBLE,
  4. idxlong SMALLINT,
  5. idxlat SMALLINT,
  6. INDEX (idxlong,idxlat);
  7. );

When inserting you'd just a idxlong = round(longitude) and you should be good to go.

The problem with this approach is that we split the earth up in 360 x-coordinates, and 180 y-coordinates. Whenever we're on a zoom-level higher than a single one of these sections, the index will not be used effectively. Furthermore, if we zoom in very deep (times square) we run the risk there's a lot of rows matching this area that will need to be evaluated. In short: the index is ineffective if you zoom to much smaller or bigger areas.

Dividing the earth up further

We could divide the earth up in 4 squares, and store that information instead. Every square could then divided up in 4 more squares, and so on.. We end up with what's called a Quadtree. To do this effectively, and not create new columns for every 'zoom level' we might need, we instead attempt to convert the longitude and latitude to a single value.

Simply put, if our X coordinate is 111111 and our Y coordinate is 000000, we want to end up with 101010101010. This is called the Morton number.

We can do this with the following (pseudo-)code:

  1. latitude = 43.63556267294633
  2. longitude = -79.4249939918518
  3.  
  4. // Since these can both be negative, we should convert them to an unsigned number
  5. // longitude goes from -180 to 180 and latitude from -90 to 90
  6.  
  7. latitude += 90;
  8. longitude += 180;
  9.  
  10. // Now we need to turn them into integers. It makes sense to fit them in a 32bit integer.
  11. // The maximum value for a 32bit integer is 4294967295
  12. // Since the numbers now go up to 360, we use round(4294967295/360) = 11930464.
  13.  
  14. latitude = (int)latitude * 11930464;
  15. longitude = (int)longitude * 11930464;
  16.  
  17. // The 'morton number'
  18. morton = 0
  19. // The current bit we're interleaving
  20. bit = 1
  21. // The position of the bit we're interleaving
  22. position = 0
  23.  
  24. while(bit <= latitude or bit <= longitude) {
  25.  
  26. if (bit & latitude) morton = morton | 1 << (2*position+1)
  27. if (bit & longitude) morton = morton | 1 << (2*position)
  28.  
  29. position += 1
  30. bit = 1 << position
  31.  
  32. }

We can now easily index on our 'morton' number.

The big flaw of using a Quadtree

This would be the most effective index if we're ever only interested in the contents of '1 square' regardless of the size. But we are in fact usually interested in a range (Everything between a top-left and bottom-right coordinate) that could cover multiple squares.

The best example is near the international dateline. Because we increased our numbers with 180, the international dateline now lies at x-coordinate 0 and 360. If we would like to SELECT items from both sides of this line, we would need to do two queries (or two ranges). This is a simple example, but the problem in fact occurs at every edge of a square. If we select from a random place in europe, and we happen to go across a square from the 3rd significant bit in our morton number, it means we will end up effectively splitting our table in 4 major segments and we'll end up with scanning a higher number of items for matches.

Solutions

If the top-left and bottom-right are close enough to each other ('close' will need to be defined), we can find out if the query could be problematic by getting the morton numbers for both and comparing the most significant bits we care about:

  1. // This example assumes both the numbers are 64 bit, and we really care about the top 16.
  2. problematic = ((morton1 ^ morton2) >> 48) != 0
  3.  
  4. // Note that the ^ operator is XOR (I had to look it up myself, because I rarely ever need it).

Another solution is to throw all of this out the window, and go with MySQL's Spatial extensions. The spatial extensions provides much more features beyond my need, so I've yet to find out if this is the best solution for myself. MySQL provides a spatial index, which is based off an R-Tree, which effectively uses overlapping rectangles. The other bonus is that things like selecting by radius (e.g.: everything in the range of x km) is possible.

Anything wrong with this logic? Do you have experience with this? I'd like to hear problems and solutions you've encountered!

Geo standards on the web

Location-aware web applications are rising, and I'm personally very interested in this space. Besides the obvious cool-factor, I think there are a lot of uses for location-aware information. This post is a short overview of geo-related standards and api's.

The heart of it all lies in 2 numbers, latitude and longitude.

How to get these numbers?

However you publish the GPS coordinates, the information will at one point in time come from a user. However, asking a user to enter these numbers is not the best end-user experience.

Use a map

A very common way is simply ask a user to pinpoint their location on a map. The most common two providers for these are:

Google maps is the most popular (no numbers!), but if you are looking for more flexibility, yahoo maps might be the best choice. Yahoo maps has an Actionscript API which enables you to add a lot of customizations. Justin Everett-Church has some cool examples of what's possible. Yahoo also provides a REST api, allowing you to just fetch the tiles and do all the stitching yourself.

Address to GPS translation.

This is called Geocoding. Yahoo has a very easy to use REST API to do this translation. Google also provides an API, which is also able to spit out kml as well.

Browser plugins!

W3c currently has a draft for a geo location api, which is also part of the HTML5 movement.

The user interface is nice too. You are asked by the site if you are ok with supplying your coordinates, which then get sent back to the client.

The best implementation I've seen from this is the Geode extension. I can highly recommend to install the add-on and give the demo's a shot. This extension will only work if your computer has WiFi. It makes use of Sky Hook, which has mapped locations of wifi routers across the world. It mainly seems to have done the urban areas in north america and europe.

Geode in action

I actually moved a month or two ago from 43.651904 -79.428498 to 43.645466 -79.448729. For the first month our router was still mapped to the old address, but it updated recently to reflect my new address. Creepy, but cool!

Gears also has a javascript API, which might be a bit more common. My personal experience has been that I've only been able to get information to up to 4 decimals, which placed me in a different part of town.

Firefox 3.1 will also come with the api built in HOWEVER, it serves as an empty shell for extension implementors. You'll still need to install an extension to actually provide the coordinates, which could work through WiFi mapping, Cell-tower triangulation (the method the iPhone uses) or manual input. An extension that does this through manual input of an address is 'Geolocation'.

From devices

There are a couple devices on the market that track geo information. I wrote a small overview earlier today, and the most impressive seems to be the Amod AGL3080. Its conceptually very simple and when you plug it into your machine, it will show up as an external harddrive with a CSV.

Next to that more camera-phones appear on the market that have support for embedding GPS information within the EXIF data of JPEG's. Most notably recent blackberries and the iPhone. Its reasonably easy to extract this information if it is available. PHP has an extension to read it out, but you'll need to do some conversion from the stored coordinates, as they are specified as the (now less common) degrees, minutes, seconds format.

Publishing

Clearly the easiest way to publish location information is through maps, but the concept of adding this meta-data to the documents you produce (be that html, api's, rss ..) can be much more interesting. This will allow people to re-use the data and present it somewhere else.

Geo RSS

Adding geo information to RSS is an easy one. RSS and Atom is already very widely used. Originally used for feeding blogposts, it has gotten much wider usage such as aggregating pictures, video, etc.

georss.org has some simple examples on how to do this. Flickr is already pretty big on GEO information, it confuses me why they haven't yet integrated the standard.

HTML meta tags

This will allow you add GPS information to a specific site or page There's two ways to define this:

  1. <meta name="geo.position" content="62.300626;-84.023437" />
  2. <meta name="ICBM" content="62.300626, -84.023437" />

There's no harm in including both. It's not very clear to me which one is used more, but the former seems to have more popularity in recent implementations.

Geo microformat

There is also a microformat spec for geo information. Personally, I'm not a big fan of microformats, but would like to be proven wrong in its usefulness.

Microblogging

A standard way to include GPS information for microblogging (and with microblogging, 99% reads twitter.), using a very easy format. This allows cool applications such as Twittervision and Twinkle. The latter had an interesting review on the ever offensive Something Awful.

KML

KML is also on its way to become a standard for the geo-web. KML covers much complexer drawing and is already supported by a number of apps, most prominently Google Earth and Nasa Worldwind. People who have google earth installed can be directed straight to a location on the map by serving a .kml file.

Conclusion

Well, I think the geo-aware web has a big and bright future. Having appropriate standards in place to consume and publish this information is very important. Even though a minimal amount of users will immediately benefit from for example geo information in your html head section. Allowing other developers or applications to access this information today will help pave the future for a fully geo-enabled web. I say all this realizing this blog does not provide this information, but hey.. nobody is perfect :). I can proudly say however I have some geo-aware applications in the works.

I'm very interested in your ideas of examples of interesting use of GPS information! If I missed an important or relevant API or you have a cool example, make sure to leave a comment.

/me goes off to find a restaurant near me and wonders where my friends would hang out on a cold saturday night in december.

Gps trackers - any advice?

Like half of the web development community, I recently became really interested in location-enabled stuff ;)

I really want to get a GPS tracker (for christmas!, hint hint!), there seem to be a couple out there, so I'm wondering if there's any people who have such a device and what their experiences are. Ideal features:

  • Small! I don't need to display, I just want to drop it in my backpack and forget about it.
  • Has USB, all I need is something that spits out cvs data (or similar format) on the drive which I can extract.
  • Somewhat good battery life (>24h).
  • Most importantly, charges through USB. Batteries is a no-go (for any device for that matter, I'm looking at you wii controller.)

So we have some options:

Professional mini tracker key

Pro:

  • Title of the product is completely written in uppercase, has to mean something right?
  • Logs data every second.

Con:

  • Uses AA batteries.
  • No specs on battery life.
  • No specs on how to extract the data, but has USB.
  • On the expensive side ($278.99).

TRACKING KEY LAS-1505

Pro:

  • "Ideal For Parents Who Want To See If Their Teenagers Are Speeding", seriously.. the page says that literally.

Con:

  • Runs on AA batteries.
  • Price ($206).

Amod AGL3080

Pro:

  • Looks like a more decent product.
  • Price: $69
  • Uses standard format for data (works as csv on usb drive), accessible with my scripting powers.
  • Works on a set interval or 'Push to log'.
  • Works for 15 hours (good enough).

Con:

  • AA batteries.

Zoombak ZMBK200

Pro:

  • Totally rad name!
  • A/C wall charger! (the only one so far).

Con:

  • Battery life for 150 location requests.
  • Subscription fee! (this is where I'm not even going to look for more cons).

Trackstick 2

Pro:

  • Reasonable price: $114.
  • Exports to CSV.

Con:

  • 1MB flash drive.
  • AA batteries.

Conclusion so far:

The Amod AGL3080 seems by a stretch the most impressive device, not only does it seem to have a good battery life, and plenty of storage. Just judging from the page on amazon, it quite frankly seems the most professional, despite it having the lowest price on the list.

Dear lazyweb, am I making a mistake? Are there better devices out there?

 1

About

My name is Evert, and I've been writing semi-regularly on this blog since 2006.

I'm currently available for contract work.

more info.

Subscribe

Dropbox

Dropbox is a simple cross-platform online backup and sync application. The first 2GB of space is free, and both you and me get an extra 250MB extra space if you sign up through this link.