Python Static Dictionaries in Nearest Neighbor Queries
A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point. The brute force approach to this problem is joining the two tables spatially and sorting by distance limiting the result to the number requested. Of course, for very large tables, this is extremely costly. That's where spatial indexes come in. PostGIS implements GiST indexes which are a general form of index that is capable of handling any kind of data with user defined keys. In the case of a nearest neighbor query, the index is used to narrow down the number of items to perform a distance calculation against. This vastly improves the performance of a nearest neighbor query. A very effective algorithm for nearest neighbor queries can be found here. This algorithm effectively grows the search area in a smart fashion until the right number of features are captured. This reduces the overall number of distance calculations required. The user still needs to provide an initial box in which to search, the smaller the better since it will grow.
There are situations in which one would like to know all the nearest neighbors between one feature and another feature. This could be useful in diverse areas such as real estate planning (what is the best place to develop given proximity requirements to schools and grocery stores?) to disaster management (what is the most accessible and safest point between these hardest hit areas?). The nearest neighbor query now has to proceed sequentially along all geometries of one feature computing nearest neighbors to all geometries along another feature. This becomes computationally intractable as the size of the feature tables grow.
An easy way to improve this type of query is to exploit the spatial proximity of adjacent features in a table. Using a smart sequential scan, the algorithm "remembers" the nearest neighbor distance from the last feature and uses that as the initial box size for the next feature. The Python language extension provides a static dictionary that a stored procedure has access to to facilitate this kind of operation. This may be quite obvious to everyone and of course it is plainly listed in the Postgres documentation, but somehow I still managed to overlook it for the longest time while trying to solve this problem. The nearest neighbor distance can be stored and retrieved as in the following code block:
create or replace function nn(geom geometry,
featuretable text,
geomcol text,
initdist double precision, n int)
returns double precision
AS $$
curdist = initdist
key = featuretable
if SD.has_key(key):
curdist = int(SD[key])
else:
SD[key] = initdist
// perform nearest neighbor query using curdist
$$ LANGUAGE plpythonu;
Keep in mind that this is not very threadsafe. Two simultaneous nearest neighbor queries on the same table will interfere with each others' stored distance. I'd imagine that one could use some data about the query plan to store the distance uniquely, but that is beyond my Postgres skills for the moment.