Nearest point search in PostGIS

A frequently retrieving task is to find the nearest candidate to the target candidate based on speific feature.

1. Naive Solution:

The naive way to carry out the query task is to compute all distance pairs for all candidates and order them in ascending order and pick up the smallest one as the result.

-- Closest poitn to the target geometry with (-116.733:longitude, 54.828: latitude)
SELECT forecast_longitude as lon,
forecast_latitude as lat,
centroid_point as point, boundary_geo as area,
ST_Distance(
    ST_Transform('SRID=4326;POINT(-116.733 54.828)'::geometry, 2163),
    ST_Transform(centroid_point, 2163)
) AS distance
FROM services_areas
ORDER BY distance ASC
LIMIT 1;

This native approach takes line time to traverse the whole database. For a larget database and complex feature computation , it is not a reasonable approach.

Let us look at the time taken to execute the query.

For a large table of candidate features, it is not a reasonable approach.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=3008.20..3008.20 rows=1 width=184) (actual time=53.044..53.044 rows=1 loops=1)
    -> Sort (cost=3008.20..3076.36 rows=27265 width=184) (actual time=53.043..53.043 rows=1 loops=1)
        Sort Key: (st_distance('010100002073080000FF1FDF9BB84030C19FAD11C8F06A3241'::geometry, st_transform(centroid_point, 2163)))
        Sort Method: top-N heapsort Memory: 25kB
    -> Seq Scan on services_areas (cost=0.00..2871.88 rows=27265 width=184) (actual time=0.306..43.524 rows=27265 loops=1)
Planning Time: 0.476 ms
Execution Time: 53.069 ms
(7 rows)

2.Index-based KNN

KNN algorithm is to find the k nearest point to the target, where k is an arbitary positive integer.

This index-based kNN system works by evaluating distance bweteen bounding box in PostGIS R-index tree.
Since the index is built with the bounding box of any geometry, so the distances between any geometries are actually somehow inexact.

There are two types of index-based operators measuring the distance of two bounding boxes.

  • <-> means “distance between box centers”

  • <#> means “distance between box edges”

The bounding box of one geometry points is just the point itself, so there is no approximation there. But if the bounding boxes of polygons aren’t the same as points.

The right way to get the nearest point with high-performance yet accurate approach is to pull up top-k
number (k could be a really small number if you are confident that the data is in homogeneous distribution)

-- "Closest" 100 points to target point,
EXPLAIN ANALYSE
WITH closest_candidates AS (
    SELECT
    forecast_longitude as lon,
    forecast_latitude as lat,
    centroid_point as point,
    boundary_geo as area
    FROM
    services_areas
    ORDER BY
    centroid_point <->
    ST_Transform('SRID=4326;POINT(-116.733 54.828)'::geometry, 3401)
    LIMIT 100
    )
SELECT lon,lat,point,area,
    ST_Distance(
    ST_Transform(
        'SRID=4326;POINT(-116.733 54.828)'::geometry,
        point
    ) AS distance
FROM closest_candidates
ORDER BY distance ASC
LIMIT 1;

The execution time is as follows:

 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=15.38..15.39 rows=1 width=136) (actual time=3.150..3.150 rows=1 loops=1)
   CTE closest_candidates
     ->  Limit  (cost=0.28..6.38 rows=100 width=184) (actual time=0.446..1.730 rows=100 loops=1)
           ->  Index Scan using services_areas_centroid_p_idx on services_areas  (cost=0.28..1665.38 rows=27265 width=184) (actual time=0.445..1.683 rows=100 loops=1)
                 Order By: (centroid_point <-> '0101000020490D00004D804FF20E2BFBC013EF8614392C5741'::geometry)
   ->  Sort  (cost=9.00..9.25 rows=100 width=136) (actual time=3.149..3.149 rows=1 loops=1)
         Sort Key: (st_distance('010100002073080000FF1FDF9BB84030C19FAD11C8F06A3241'::geometry, st_transform(closest_candidates.point, 2163)))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  CTE Scan on closest_candidates  (cost=0.00..8.50 rows=100 width=136) (actual time=0.926..2.997 rows=100 loops=1)
 Planning Time: 1.225 ms
 Execution Time: 3.221 ms
(11 rows)

This will saves a lot of time because traversing the R-tree and heap sort for top-K is at most logarithmic time.It’s much faster compared wtih linear time.

For the details of how R-Tree works, I recommend the pdf, which shows how R-tree data structrue in detail, and how insert/search algorithm performed in R-tree.


Author: Liang Tan
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Liang Tan !
  TOC