Frogmouth – Brad's various words

A look at SpatiaLite indexes

Appropriate indexes are the key to getting the most out of pretty much any database system. I’ve recently taken a deeper look at how SpatiaLite indexes work, and I thought there might be some interest in this.

Why index?

The first thing to note is that the purpose of the index is to avoid scanning large tables (or worse, huge joins of large tables). Why? Because full table scans are slow. Of course, full table scans of really small tables probably won’t be noticed. Conversely, if the results are going to be pretty much everything in the table, then using an index to find that isn’t going to help.

So if we have a table of places in various parts of the world, and an application that displays that data on a map, then when the user zooms to world scale, you might be better off turning off the display of your data until the user zooms in to something more sensible. Of course, in this case, the drawing all those things to the screen is probably going to take much more time than a full table scan, but you hopefully get the idea.

So we could index on some property (like the name, or some feature class), but SpatiaLite does its magic in the field of coordinates (typically northings and eastings, or latitude and longitude).

Importing Data

So lets look at an example. We’ll need some data, and the wonderful people at /http://www.geonames.org/ have made some available. I downloaded the “allcountries.zip” data set from http://download.geonames.org/export/dump/. It is about 200MB zipped.

Unzip allcountries.zip, and you will now have allcountries.txt, which is about 1GB of comma separated values.

Now import it into SpatiaLite, using something like:

.timer ON
BEGIN;

-- Open the original tab separate values geonames file
CREATE VIRTUAL TABLE geonamesvt USING VirtualText("allCountries.txt", "UTF-8", 0);

-- Create the real table
CREATE TABLE geonames (geonameid INT, name TEXT, asciiname TEXT, alternatenames TEXT, latitude DOUBLE, longitude DOUBLE, featureclass TEXT, featurecode TEXT, countrycode TEXT, cc2 TEXT, admin1code TEXT, admin2code TEXT, admin3code TEXT, admin4code TEXT, population LONG, elevation INT, gtopo30alt INT, timezone TEXT, modificationdate TEXT);

-- import the contents of the temporary geonames table into the real table
INSERT INTO geonames (geonameid, name, asciiname, alternatenames, latitude, longitude, featureclass, featurecode, countrycode, cc2, admin1code, admin2code, admin3code, admin4code, population, elevation, gtopo30alt, timezone, modificationdate) SELECT COL001,COL002,COL003,COL004,COL005,COL006,COL007,COL008,COL009,COL010,COL011,COL012,COL013,COL014,COL015,COL016,COL017,COL018,COL019 FROM geonamesvt;
SELECT AddGeometryColumn("geonames", "Geometry", 4326, "POINT", "XY");
SELECT CreateSpatialIndex("geonames", "Geometry");
UPDATE geonames SET Geometry=MakePoint(longitude,latitude,4326);

DROP TABLE geonamesvt;
-- COMMIT that
ANALYZE geonames;

COMMIT;

VACUUM;

If you save that SQL script to a file (say “importgeonames.sql”), you can use it as an input to spatialite on the command line:
spatialite geonames.sqlite < importgeonames.sql.

This is going to take a long time – about 25 minutes on my laptop. The output will probably be about 2GB.

Full table scan

Now we can open that using spatialite:
spatialite geonames.sqlite.

If you turn on timers, you can see how long the queries are taking. Lets do that, and then run a full table scan:

spatialite> .timer on
spatialite> select COUNT(name) from geonames;
8178593
CPU Time: user 2.088131 sys 0.900056

You might want to do that a few times to see if its fairly consistent.

So know we know we have about 8 million entries in the table, and that a full table scan takes about 2 seconds. So if we see any query taking that kind of time, and not returning millions of results, we need to find out why.

Since this a bit geographical, I thought I’d look for places around Greenwich (of Observatory fame, but there is a lot of cool stuff out there if you ever get a chance to visit). Just to make it interesting, lets find places around there that start with “Green”

spatialite> SELECT geonames.name, geonames.latitude, geonames.longitude FROM geonames WHERE geonames.latitude > 51 AND geonames.latitude < 52 AND geonames.longitude > -0.5 and geonames.longitude < 0.5 AND geonames.name LIKE "Green%";
Greenwich|51.47785|-0.01176
Greensted|51.70286|0.21586
Greenhithe|51.45026|0.28539
Greenford|51.52866|-0.35508
Greenwich|51.45|0.05
Green Parks House Psychiatric Unit|51.36866|0.05864
Greenhill|51.58342|-0.3386
Green Man Roundabout|51.57223|0.01635
Greenwich DLR Station|51.4781|-0.014
Greenhithe Railway Station|51.4506|0.27923
Greenwich Railway Station|51.4781|-0.014
Greenwich DLR Station|51.47805|-0.01574
Green End Farm|51.96373|-0.37371
Greenford Station|51.5426|-0.3463
Greenford Underground Station|51.54241|-0.34598
Green Park Underground Station|51.50667|-0.1427
Green Park|51.50413|-0.14415
Greenwich Park|51.47663|0.00225
CPU Time: user 2.092131 sys 0.984061

So that query says “give me anything located in the area bounded by 0.5W and 0.5E and 51N and 52N degrees that has a name starting with Green”.

Those results look reasonable, but clearly we’re doing a full table scan to find them. In this query, I also used the latitude and longitude column values rather than SpatiaLite’s Geometry. We could try it with Geometry:

SELECT geonames.name, geonames.latitude, geonames.longitude FROM geonames WHERE ST_Y(geonames.Geometry) > 51 AND ST_Y(geonames.Geometry) < 52 AND ST_X(geonames.Geometry) > -0.5 and ST_X(geonames.Geometry) < 0.5 AND geonames.name LIKE "Green%";
Greenwich|51.47785|-0.01176
Greensted|51.70286|0.21586
Greenhithe|51.45026|0.28539
Greenford|51.52866|-0.35508
Greenwich|51.45|0.05
Green Parks House Psychiatric Unit|51.36866|0.05864
Greenhill|51.58342|-0.3386
Green Man Roundabout|51.57223|0.01635
Greenwich DLR Station|51.4781|-0.014
Greenhithe Railway Station|51.4506|0.27923
Greenwich Railway Station|51.4781|-0.014
Greenwich DLR Station|51.47805|-0.01574
Green End Farm|51.96373|-0.37371
Greenford Station|51.5426|-0.3463
Greenford Underground Station|51.54241|-0.34598
Green Park Underground Station|51.50667|-0.1427
Green Park|51.50413|-0.14415
Greenwich Park|51.47663|0.00225
CPU Time: user 6.408401 sys 1.108069

Thats worse (and I’m a bit surprised by how much worse). I also tried some MbrContains operations, but that was still slower – perhaps a topic for further investigation and blogging.

Using an Index

In any case, we spent all that effort up front to create an index (with the SELECT CreateSpatialIndex("geonames", "Geometry"); bit of the SQL script), so lets try using it.

select geonames.name, geonames.latitude, geonames.longitude from geonames, idx_geonames_Geometry WHERE idx_geonames_Geometry.ymin > 51 AND idx_geonames_Geometry.ymax < 52.0 AND idx_geonames_Geometry.xmin > -0.5 AND idx_geonames_Geometry.xmax < 0.5 AND idx_geonames_Geometry.pkid == geonames.rowid AND geonames.name LIKE "Green%";
Greenhithe|51.45026|0.28539
Greenhithe Railway Station|51.4506|0.27923
Greenwich|51.47785|-0.01176
Greenwich DLR Station|51.4781|-0.014
Greenwich Railway Station|51.4781|-0.014
Greenwich DLR Station|51.47805|-0.01574
Green Parks House Psychiatric Unit|51.36866|0.05864
Greenwich|51.45|0.05
Greenwich Park|51.47663|0.00225
Greenford|51.52866|-0.35508
Green Park Underground Station|51.50667|-0.1427
Green Park|51.50413|-0.14415
Greensted|51.70286|0.21586
Green End Farm|51.96373|-0.37371
Greenford Station|51.5426|-0.3463
Greenford Underground Station|51.54241|-0.34598
Greenhill|51.58342|-0.3386
Green Man Roundabout|51.57223|0.01635
CPU Time: user 0.016001 sys 0.000000

OK, so that is better – a couple of orders of magnitude faster.

How does the index work?

The magic to this is the use of idx_geonames_Geometry table to find the row IDs that match the rowids in the geonames table. The spatial query in the idx_geonames_Geometry table is optimised by SQLite using the R*Tree algorithm.

The idx_geonames_Geometry table (or any other two dimensional R*Tree index) looks like:

spatialite> .schema idx_geonames_Geometry
CREATE VIRTUAL TABLE "idx_geonames_Geometry" USING rtree(
pkid, xmin, xmax, ymin, ymax);

pkid is the primary key ID value – an Integer. The other four columns define the bounding box, and are floating point values.

That VIRTUAL TABLE bit in the schema means that the data isn’t really in that table, but will be constructed by SQLite when needed. The real data is contained in what SQLite calls “shadow tables” – they have names that start with the name of the virtual table, and end in _node, _rowid and _parent.

R*Tree is optimised for doing range queries (i.e. where some database column value is in some range of values). So it can find the items (well, the row id of the items) in the table we’re really interested in, based on a bounding box [xmin...xmax, ymin...ymax] that comes from the index table.

Things to watch out for

The R*Tree doesn’t store exact minimum bounding rectangles (MBRs), but instead uses an approximation based on 32-bit floating point numbers, although the error is tiny (of the order 0.000012% according to the SQLite documentation). New versions of SQLite always round in the direction that makes sure that the location of items are included in the bounding box (i.e. xmin and ymin get rounded down, xmax and ymax get rounded up). That does what you want for the common case (geometry in an region of interest). It may not be what you want if you are doing points “not in” (e.g. excluding an inner region). It also may not be what you want if you are trying for overlap (i.e. where equal to matters). See the SQLite documentation for R*Tree if that matters to you.

The x coordinate is longitude (or easting). The y coordinate is latitude (or northing).

When selecting the bounding rectangle based on maps on the screen, make sure you get the largest rectangle. For example, for some projections, you can just use any two opposing corners. That may not work everywhere though, since as you approach the poles, the longitude range at the top of the screen isn’t the same as the longitude range at the bottom of the screen. Also, if the map can be rotated, the screen to geo conversion may not produce the greatest latitude number at the top of the screen.

Summary

So by eliminating the number of entries that we have to do something difficult on (and the SQLite LIKE isn’t that computationally intensive, compared to some of the more esoteric spatial operations that SpatiaLite can do) we can get to the “feasible set” very quickly, and put the CPU cycles to work most effectively.

GeoPackage Tools

I’ve been working on GeoPackage a bit. GeoPackage is intended to be a format for geospatial data (vectors, tiles and full rasters) that is suitable for use in offline scenarios. To help me understand the specification, I’ve been doing some simple implementation work, mainly on the tiles part. Its called GeoPackage Tools – a highly imaginative name for some tools that work with GeoPackage…

At this stage, there are three command line tools, and some supporting code, all in python. The command line tools
are:
– mb2gpkg, which converts a [MBTiles](http://mbtiles.org) package to GeoPackage
– gpkginfo, which dumps information on the content of a GeoPackage
– gpkg2img, which combines tiles in a GeoPackage to form a composite image

Note that the GeoPackage specification is still in work, and there could be a lot of changes before that specification is complete. These tools are intended to help understand the specification better in terms of implementation issues and functionality. They are not suitable for any kind of “production” use.

The source code is maintained on Launchpad, and configuration management uses bzr. You can find out more about bzr at http://bazaar.canonical.com/en/

To check out the code, just do

bzr branch lp:gpkgtools

There is no need for any installation or build. However you may need to sort out some dependencies. The main dependencies are:

* Python `>= 2.6`
* Python Image Library
* Spatialite (version 3.0.0 or later)
* pysqlite2 that supports extension loading.

Most of those should be easily available, except perhaps the last one. See instructions in the source code README.html for how to deal with this.

Here are some example:

Convert an mbtiles file into a GeoPackage:

./mb2gpkg ~/devel/samples/mbtiles/mapbox.geography-class.mbtiles geography-class.geopackage

The output of that should look something like:

DEBUG:gpkgtools.util_sqlite:Exporting MBTiles to geopackage
DEBUG:gpkgtools.util_sqlite:/home/bradh/devel/samples/mbtiles/mapbox.geography-class.mbtiles --> geography-class.geopackage
DEBUG:gpkgtools.util_sqlite:Creating manifest
DEBUG:gpkgtools.util_sqlite:Completing raster_table_metadata
DEBUG:gpkgtools.util_sqlite:Completing tile_matrix_metadata
DEBUG:gpkgtools.util_sqlite:Completing raster_format_metadata
DEBUG:gpkgtools.util_sqlite:Creating from_mbtiles_tiles
DEBUG:gpkgtools.util_sqlite:Creating from_mbtiles_tiles_rt_metadata
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:0
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:1
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:2
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:3
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:4
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:5
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:6
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:7
DEBUG:gpkgtools.util_sqlite:Copying data and metadata to geopackage at zoom level:8
DEBUG:gpkgtools.util_sqlite:Committing results
DEBUG:gpkgtools.util_sqlite:analysing database
DEBUG:gpkgtools.util_sqlite:cleaning database

Dump metadata associated with a GeoPackage

./gpkginfo geography-class.geopackage

Which should result in output like:

Title:              GeoPackage Manifest
Abstract:           Manifest containing references to data tables in this GeoPackage.
Identifier:         geography-class.geopackage.manifest
Metadata
    Abstract:       [Not provided]
    Here:
       BBox:        (-180.0, -85.05112877980659, 180.0, 85.05112877980659)
       href:        geography-class.geopackage
Reference Group
    Abstract:       This ReferenceGroup identifies the contents of a GeoPackage file converted from /home/bradh/devel/samples/mbtiles/mapbox.geography-class.mbtiles
    Identifier:     geography-class.geopackage
    # tables:       1
    TableEntry:     from_mbtiles_tiles
       Type:        tiles
       Version:     2012.05.19
       Identifier:  geography-class.geopackage.from_mbtiles_tiles
       Abstract:    One of the example maps that comes with TileMill - a bright & colorful world map that blends retro and high-tech with its folded paper texture and interactive flag tooltips.
Raster / tile entry:   from_mbtiles_tiles
   title:      Geography Class
   abstract:   One of the example maps that comes with TileMill - a bright & colorful world map that blends retro and high-tech with its folded paper texture and interactive flag tooltips.
   version:    1.0.0
   min_x:      -180.0
   min_y:      -85.0511287798
   max_x:      180.0
   max_y:      85.0511287798
   srid:       3857
   mime type:  image/jpeg
     Compression Quality:  75
     Image representation: RGB (Red, Green, Blue true colour)
     Bit Depth:            24
     Image Category:       VIS
     Image Mode:           N
     Number of bands:      3
   zoomlevel: 0
     tile size:            256 x 256
     tile matrix:          1 x 1
     pixel dimensions:     1.40625 x 0.664461943592
   zoomlevel: 1
     tile size:            256 x 256
     tile matrix:          2 x 2
     pixel dimensions:     0.703125 x 0.332230971796
   zoomlevel: 2
     tile size:            256 x 256
     tile matrix:          4 x 4
     pixel dimensions:     0.3515625 x 0.166115485898
   zoomlevel: 3
     tile size:            256 x 256
     tile matrix:          8 x 8
     pixel dimensions:     0.17578125 x 0.083057742949
   zoomlevel: 4
     tile size:            256 x 256
     tile matrix:          16 x 16
     pixel dimensions:     0.087890625 x 0.0415288714745
   zoomlevel: 5
     tile size:            256 x 256
     tile matrix:          32 x 32
     pixel dimensions:     0.0439453125 x 0.0207644357373
   zoomlevel: 6
     tile size:            256 x 256
     tile matrix:          64 x 64
     pixel dimensions:     0.02197265625 x 0.0103822178686
   zoomlevel: 7
     tile size:            256 x 256
     tile matrix:          128 x 128
     pixel dimensions:     0.010986328125 x 0.00519110893431
   zoomlevel: 8
     tile size:            256 x 256
     tile matrix:          256 x 256
     pixel dimensions:     0.0054931640625 x 0.00259555446716

Composite the zoom level three tiles into a single image

./gpkg2img geography-class.geopackage from_mbtiles_tiles 3 geog3.png

Which should product output like:

DEBUG:gpkgtools.util_sqlite:Exporting geopackage tileset to image
DEBUG:gpkgtools.util_sqlite:geography-class.geopackage, from_mbtiles_tiles --> geog3.png

You can open your new image, which should look a bit like this image

Spatialite – selecting “not in”

One of the free software projects I’m involved in is SpatiaLite, which is a set of spatial (as in geospatial) extensions to SQLite.

Most users of spatialite probably don’t even notice – its often embedded, just as sqlite is. However if you are a user who is accessing it through SQL, then occasionally that SQL bit will become a bit confusing.

One user recently asked about why the “Disjoint” query didn’t work as expected. Hopefully the explanation might help other people avoid the same confusion.

I’ll demonstrate this with the SpatiaLite command line tool:

bradh@incana:~$ spatialite :memory:
SpatiaLite version ..: 3.0.1    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct DBF access]
        - 'VirtualXL'           [direct XLS access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualSpatialIndex' [R*Tree metahandler]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.3.2-CAPI-1.7.2
SQLite version ......: 3.7.4
Enter ".help" for instructions
spatialite> CREATE TABLE areas (AreaName text);
the SPATIAL_REF_SYS table already contains some row(s)
spatialite> CREATE TABLE points (PointName text);
spatialite> SELECT AddGeometryColumn('areas', 'Geometry', -1, "POLYGON", "XY");
1
spatialite> SELECT AddGeometryColumn('points', 'Geometry', -1, "POINT", "XY");
1

So now we have two tables (“areas” and “points”) with appropriate geometry columns. We’ll add some content:

spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area1", GeomFromText("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))"));
spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area2", GeomFromText("POLYGON((20 0, 30 0, 30 10, 20 10, 20 0))"));
spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area3", GeomFromText("POLYGON((10 0, 30 0, 30 10, 10 10, 10 0))"));

creates the areas, and

spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointA", GeomFromText("POINT(2 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointB", GeomFromText("POINT(22 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointC", GeomFromText("POINT(12 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointD", GeomFromText("POINT(-10 -10)"));

creates some points.

Some of the points are within one or more areas, and one (“pointD”) is not. The problem is to find the points that are not within any areas.

Here is the original attempt:

spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE Disjoint(areas.Geometry, points.Geometry);
pointB|POINT(22 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
pointA|POINT(2 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
pointA|POINT(2 2)
pointD|POINT(-10 -10)

Well clearly we have PointD, but we also have other points, sometimes multiple times.

So what is the issue? Lets take away the WHERE clause, and look at what is happening:

spatialite> SELECT  PointName, AreaName FROM areas, points;
pointA|area1
pointB|area1
pointC|area1
pointD|area1
pointA|area2
pointB|area2
pointC|area2
pointD|area2
pointA|area3
pointB|area3
pointC|area3
pointD|area3

OK, its a simple SQL join. We are getting the product of each table.

So lets look at one area at a time:

spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area1" AND Disjoint(areas.Geometry, points.Geometry);
pointB|POINT(22 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area2" AND Disjoint(areas.Geometry, points.Geometry);
pointA|POINT(2 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area3" AND Disjoint(areas.Geometry, points.Geometry);
pointA|POINT(2 2)
pointD|POINT(-10 -10)

So when we do the WHERE Disjoint() part, we’re only considering whether this particular combination of point and area is disjoint (point is contained within area, for those still confused by what I meant earlier). So our original query was just the same as running the query on each area in turn.

So if that isn’t going to work, then how do we do “not in” for an arbitrary selection of areas and points?

There are probably a few ways to solve this. I came up with two.

Using Contains operation, and selecting anything missing.

Lets look at how to find any point that is contained in an area:

spatialite> SELECT points.rowid FROM areas, points WHERE Contains (areas.Geometry, points.Geometry);
1
2
2
3

So we can then just select any row in the points table that didn’t appear in the result of that “Contains”:

spatialite> SELECT PointName, AsText(points.Geometry) FROM points WHERE (points.rowid not in (SELECT points.rowid FROM areas, points WHERE Contains (areas.Geometry, points.Geometry)));
pointD|POINT(-10 -10)

Building a union area

Instead of looking at each area, we could just build a single area that is the union of each of the areas table entries, and then check if a point appears in that union area:

spatialite> SELECT PointName, AsText(points.Geometry) FROM (SELECT ST_Union(areas.Geometry) AS AllAreas FROM areas), points WHERE Disjoint(AllAreas, points.Geometry);
pointD|POINT(-10 -10)

Any more ideas?

Anyway, I hope that helps someone, sometime.