Spatialite – selecting “not in”

One of the free software projects I’m involved in is SpatiaLite, which is a set of spatial (as in geospatial) extensions to SQLite.

Most users of spatialite probably don’t even notice – its often embedded, just as sqlite is.  However if you are a user who is accessing it through SQL, then occasionally that SQL bit will become a bit confusing.

One user recently asked about why the “Disjoint” query didn’t work as expected. Hopefully the explanation might help other people avoid the same confusion.

I’ll demonstrate this with the SpatiaLite command line tool:

bradh@incana:~$ spatialite :memory:
SpatiaLite version ..: 3.0.1    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct DBF access]
        - 'VirtualXL'           [direct XLS access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualSpatialIndex' [R*Tree metahandler]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.3.2-CAPI-1.7.2
SQLite version ......: 3.7.4
Enter ".help" for instructions
spatialite> CREATE TABLE areas (AreaName text);
the SPATIAL_REF_SYS table already contains some row(s)
spatialite> CREATE TABLE points (PointName text);
spatialite> SELECT AddGeometryColumn('areas', 'Geometry', -1, "POLYGON", "XY");
1
spatialite> SELECT AddGeometryColumn('points', 'Geometry', -1, "POINT", "XY");
1

So now we have two tables (“areas” and “points”) with appropriate geometry columns. We’ll add some content:

spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area1", GeomFromText("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))"));
spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area2", GeomFromText("POLYGON((20 0, 30 0, 30 10, 20 10, 20 0))"));
spatialite> INSERT INTO 'areas' (AREANAME, Geometry) VALUES ("area3", GeomFromText("POLYGON((10 0, 30 0, 30 10, 10 10, 10 0))"));

creates the areas, and

spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointA", GeomFromText("POINT(2 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointB", GeomFromText("POINT(22 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointC", GeomFromText("POINT(12 2)"));
spatialite> INSERT INTO 'points' (PointName, Geometry) VALUES ("pointD", GeomFromText("POINT(-10 -10)"));

creates some points.

Some of the points are within one or more areas, and one (“pointD”) is not. The problem is to find the points that are not within any areas.

Here is the original attempt:

spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE Disjoint(areas.Geometry, points.Geometry);
pointB|POINT(22 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
pointA|POINT(2 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
pointA|POINT(2 2)
pointD|POINT(-10 -10)

Well clearly we have PointD, but we also have other points, sometimes multiple times.

So what is the issue? Lets take away the WHERE clause, and look at what is happening:

spatialite> SELECT  PointName, AreaName FROM areas, points;
pointA|area1
pointB|area1
pointC|area1
pointD|area1
pointA|area2
pointB|area2
pointC|area2
pointD|area2
pointA|area3
pointB|area3
pointC|area3
pointD|area3

OK, its a simple SQL join. We are getting the product of each table.

So lets look at one area at a time:

spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area1" AND Disjoint(areas.Geometry, points.Geometry);
pointB|POINT(22 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area2" AND Disjoint(areas.Geometry, points.Geometry);
pointA|POINT(2 2)
pointC|POINT(12 2)
pointD|POINT(-10 -10)
spatialite> SELECT PointName, AsText(points.Geometry) FROM areas, points WHERE areas.AreaName is "area3" AND Disjoint(areas.Geometry, points.Geometry);
pointA|POINT(2 2)
pointD|POINT(-10 -10)

So when we do the WHERE Disjoint() part, we’re only considering whether this particular combination of point and area is disjoint (point is contained within area, for those still confused by what I meant earlier). So our original query was just the same as running the query on each area in turn.

So if that isn’t going to work, then how do we do “not in” for an arbitrary selection of areas and points?

There are probably a few ways to solve this. I came up with two.

Using Contains operation, and selecting anything missing.

Lets look at how to find any point that is contained in an area:

spatialite> SELECT points.rowid FROM areas, points WHERE Contains (areas.Geometry, points.Geometry);
1
2
2
3

So we can then just select any row in the points table that didn’t appear in the result of that “Contains”:

spatialite> SELECT PointName, AsText(points.Geometry) FROM points WHERE (points.rowid not in (SELECT points.rowid FROM areas, points WHERE Contains (areas.Geometry, points.Geometry)));
pointD|POINT(-10 -10)

Building a union area

Instead of looking at each area, we could just build a single area that is the union of each of the areas table entries, and then check if a point appears in that union area:

spatialite> SELECT PointName, AsText(points.Geometry) FROM (SELECT ST_Union(areas.Geometry) AS AllAreas FROM areas), points WHERE Disjoint(AllAreas, points.Geometry);
pointD|POINT(-10 -10)

Any more ideas?

Anyway, I hope that helps someone, sometime.