SabreDAV 0.7
Just released 0.7 of the SabreDAV library. This release implements some of the more obscure WebDAV features, and a plugin system.
Some changes:
- Basic plugin system.
- Comes with a plugin to display html indexes for browsers.
- Sabre_DAV_FilterTree is now Sabre_DAV_Tree_Filter.
- Sabre_DAV_TemporaryFileFilter is now Sabre_DAV_Tree_TemporaryFileFilter.
Any help is greatly appreciated. I'm finding this taking up a large amount of my free time, so anyone interested in documenting, fixing bugs, writing tests, implementing additional webdav rfc's, making logo's definitely send a message to the mailing list.
Indexing geo-data 3: In practice
Since my last post I've found out that using the 'morton' number to index spatial number is also referred to as the Z-order.
To index using this order, you can use this stored function:
CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC
BEGIN
-- 11930464 is round(maximum value of a 32bit integer / 360 degrees)
DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;
SET @lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
SET @lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
SET bit = 1;
WHILE bit <= @lat || bit <= @lng DO
IF(bit & @lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
IF(bit & @lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;
SET pos = pos + 1;
SET bit = 1 << pos;
END WHILE;
RETURN morton;
END;
Some caveats
- Since the function is using floating-point numbers, there will be rounding errors. These are generally very small, and for us well within the acceptable margin of error.
- More significantly, this function assumes euclidean geometry (i.e.: a flat surface). The earth obviously isn't, so as you get closer to the poles you might get back results from outside your rectangle, or miss results from within.
- It's not recommended to use this function directly in your WHERE clause. Even though the function is marked deterministic (i.e.: it will always yield the same results for the same arguments), the MySQL query optimizer currently ignored that modifier. So, set it in a temporary variable first (SET @number =) and use the variable in your where clause.
- Don't forget dealing with queries spanning over the -180°, 180° longitude line.
- Also store your actual longitude and latitude values in the DB. This function is not intended to be exact.
- When you do your queries, select both on the morton number, and longitude and latitude.
A better way to do it?
In my research I've found the Hilbert Curve to be an even better algorithm, but haven't yet gone through the effort of trying to express it in SQL.
Indexing geo-data 2 : simple benchmark
After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.
Test 1: Selecting on longitude, latitude
SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;
No index 1.73s. With B-Tree index on latitude 0.72s.
Test 2: Using spatial extensions and POINT field
SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));
Time taken without index: 9.52s. With a spatial index: 0.73s.
Test 3: Using morton number
SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
morton > 3667198027933142835 AND morton < 3671111582099533095 AND
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;
Time taken without index: 0.78s, with index on on morton: 0.65s.
Conclusion
In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.
| method | small | medium | large |
|---|---|---|---|
| plain select | 1.73s | ||
| index on latitude | 0.72s | ||
| using point field | 9.52s | ||
| using point field + spatial index | 0.00s | 0.73s | 18.82s |
| using morton number | 0.78s | ||
| index on morton | 0.00s | 0.65s | 3.23s |
So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.
Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.
Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).
Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.
Indexing geo-data
Recently we started wondering what the most effective way is to index data based on Longitude and Latitude. Although we're not yet seeing performance problems, we're definitely anticipating them without an effective index. We're using MySQL for anything mission critical, so (some of) this information specifically applies to MySQL.
For many the obvious thing to do might be to add a mysql index on those two numbers:
ALTER TABLE geo ADD INDEX (longitude,latitude)
The problem with how B-TREE indexes work, is that columns within the index will be used in order. Only if an exact match is found for the left-most column (longitude in this case) the latitude column is used. Since we're always selecting on a range of values, in practice this means that the latitude column in the index will in fact always be ignored.
This could be very inefficient if you're zoomed in quite a bit on for example a city on the east-coast (where I'm at). There will be a lot of matches for cities way north or south from here.
Splitting the earth up in rectangles
We figured a better way to do this is to just split up the earth in smaller rectangles. We could round the longitude and latitude numbers off to an integer and index on these.
CREATE TABLE geo (
longitude DOUBLE,
latitude DOUBLE,
idxlong SMALLINT,
idxlat SMALLINT,
INDEX (idxlong,idxlat);
);
When inserting you'd just a idxlong = round(longitude) and you should be good to go.
The problem with this approach is that we split the earth up in 360 x-coordinates, and 180 y-coordinates. Whenever we're on a zoom-level higher than a single one of these sections, the index will not be used effectively. Furthermore, if we zoom in very deep (times square) we run the risk there's a lot of rows matching this area that will need to be evaluated. In short: the index is ineffective if you zoom to much smaller or bigger areas.
Dividing the earth up further
We could divide the earth up in 4 squares, and store that information instead. Every square could then divided up in 4 more squares, and so on.. We end up with what's called a Quadtree. To do this effectively, and not create new columns for every 'zoom level' we might need, we instead attempt to convert the longitude and latitude to a single value.
Simply put, if our X coordinate is 111111 and our Y coordinate is 000000, we want to end up with 101010101010. This is called the Morton number.
We can do this with the following (pseudo-)code:
latitude = 43.63556267294633
longitude = -79.4249939918518
// Since these can both be negative, we should convert them to an unsigned number
// longitude goes from -180 to 180 and latitude from -90 to 90
latitude += 90;
longitude += 180;
// Now we need to turn them into integers. It makes sense to fit them in a 32bit integer.
// The maximum value for a 32bit integer is 4294967295
// Since the numbers now go up to 360, we use round(4294967295/360) = 11930464.
latitude = (int)latitude * 11930464;
longitude = (int)longitude * 11930464;
// The 'morton number'
morton = 0
// The current bit we're interleaving
bit = 1
// The position of the bit we're interleaving
position = 0
while(bit <= latitude or bit <= longitude) {
if (bit & latitude) morton = morton | 1 << (2*position+1)
if (bit & longitude) morton = morton | 1 << (2*position)
position += 1
bit = 1 << position
}
We can now easily index on our 'morton' number.
The big flaw of using a Quadtree
This would be the most effective index if we're ever only interested in the contents of '1 square' regardless of the size. But we are in fact usually interested in a range (Everything between a top-left and bottom-right coordinate) that could cover multiple squares.
The best example is near the international dateline. Because we increased our numbers with 180, the international dateline now lies at x-coordinate 0 and 360. If we would like to SELECT items from both sides of this line, we would need to do two queries (or two ranges). This is a simple example, but the problem in fact occurs at every edge of a square. If we select from a random place in europe, and we happen to go across a square from the 3rd significant bit in our morton number, it means we will end up effectively splitting our table in 4 major segments and we'll end up with scanning a higher number of items for matches.
Solutions
If the top-left and bottom-right are close enough to each other ('close' will need to be defined), we can find out if the query could be problematic by getting the morton numbers for both and comparing the most significant bits we care about:
// This example assumes both the numbers are 64 bit, and we really care about the top 16.
problematic = ((morton1 ^ morton2) >> 48) != 0
// Note that the ^ operator is XOR (I had to look it up myself, because I rarely ever need it).
Another solution is to throw all of this out the window, and go with MySQL's Spatial extensions. The spatial extensions provides much more features beyond my need, so I've yet to find out if this is the best solution for myself. MySQL provides a spatial index, which is based off an R-Tree, which effectively uses overlapping rectangles. The other bonus is that things like selecting by radius (e.g.: everything in the range of x km) is possible.
Anything wrong with this logic? Do you have experience with this? I'd like to hear problems and solutions you've encountered!
Dangers of mutual dependencies
Much like most people, I try work out my class dependencies through a top-down 'waterfall'-ish approach. By attempting this, I think allows me to keep the structure very clear and understandable.

excuse my non-existent UML skills
In this example the Ingredient class (and subclasses) is Never aware of any Recipe classes, but only the other way round.
I try to apply the same model to instantiated objects and packages (groups of classes). When an object encapsulates another object, I attempt to make sure the sub-object object is not aware of the parent. When I design packages, I attempt to make sure 2 packages don't require 'each other'.
An example where this could be a problem is the following. Say, I have a 'Database' package. I want to log every database error to a 'Log' package. The 'Log' package has a couple of implementations, such as 'Log_File', 'Log_Syslog', but now I added 'Log_Database' to log any problems to the database.

Strictly speaking these two packages can no longer be separated and will always have to be used/downloaded together. As a bonus a database-error could occur while logging, resulting in an endless loop (or segmentation fault if you're using PHP).
Another example of a mutual dependency:
file1.php:
<?php
include 'file2.php';
?>
file2.php:
<?php
include 'file1.php';
?>
You get the idea ;)
However, these types of situations are sometimes simply unavoidable (that's why we have include_once). When they are needed, they should be implemented with care and consideration. The problem with the Log class could be fixed if the Log_Database is aware of errors thrown by itself, and this could be repackaged by creating by separating the Log_Database in a new package, depending on both the 'Connection' and 'Log' classes.
Editors note: this post was part of a much larger article around designing plug-in systems, but I lost inspiration and decided to post just this part.
WebDAV-related RFC's
In an attempt to better understand the WebDAV standards space, I made up a non-scientific graph of all the specs and dependencies. I'd like to get started with CalDAV, but I have a few other specs to implement before I'll be able to do that.

The next one for me on the list is ACL. Attempting to integrate these new features within the existing system so far has proven to be very challenging. The big reason is my (perhaps high) requirements on how this is supposed to work:
- It shouldn't touch the existing WebDAV system (at all), because 99% of the users will not use ACL.
- The interface & implementation should still be understandable if you are implementing ACL.
- I like the existing WebDAV class structure as it stands, so if I have to make changes in the design; it should still be easy to grasp.







