Serverless geospatial at scale: processing 635k Irish properties without a running server

Addris is property GIS intelligence for Ireland — you open a map and see what sold nearby, for how much, what's been granted planning permission, and how well an area is served by transport. Behind that map sits 635,000+ geocoded property sales, ~37,000 planning applications, ~155,000 commercial properties, and a pile of transport and services data. All of it is ingested and queried by Lambda functions. There is no application server idling between jobs.

The data, and why it's awkward

The raw sources are public but messy. The Property Price Register publishes a CSV of every residential sale — addresses as free text, no coordinates, no Eircode for older records. Planning data comes from the NPAD ArcGIS FeatureServers council by council. Transport is NTA GTFS zips. Commercial valuations come from Tailte Éireann in Irish Transverse Mercator, not WGS84. Nothing arrives map-ready. Every record needs cleaning, geocoding, and projecting into a common coordinate system before it's worth anything.

The ingestion pipeline

The pattern is the same for every source: a scheduled ingest Lambda fetches and fans out, a write Lambda consumes and upserts.

ingest-ppr      (EventBridge, Sunday 02:00 UTC)  →  SQS batches  →  write-records
ingest-planning (weekly, NPAD ArcGIS)            →  PlanningQueue →  write-planning
ingest-transport(monthly, NTA GTFS zip)          →  TransportQueue→  write-transport
ingest-valuations(monthly, Tailte Éireann)       →  ValuationsQueue→ write-valuations (ITM→WGS84)

SQS in the middle is the whole trick. The ingest function's only job is to download a large dataset and chop it into batches onto a queue. The write function scales out to drain that queue, and SQS gives me retries, a dead-letter queue, and natural back-pressure against the database for free. A weekly batch job that touches hundreds of thousands of rows is the textbook case for serverless: it runs for a few minutes once a week, and the rest of the time I pay nothing for compute that isn't running.

Geocoding was the hard part

Turning "4 Main St, Skibbereen" into a coordinate is where the real engineering went. I built a three-tier resolution cascade in write-records, cheapest and most accurate first:

  1. eircodes table — a HERE-geocoded lookup of ~119,000 Eircodes. An exact key hit, no API call.
  2. geocode_cache — previously resolved addresses. Each address is geocoded once, ever.
  3. Photon — a self-hosted geocoder on a small EC2 box, called only for addresses with no Eircode and no cache hit.

Every resolution carries a confidence level — eircode, street, locality, or low — stored on the row. Records that can't be placed go in with NULL geometry and get backfilled later. The Eircode table is the irreplaceable asset here; geocoding pivoted from a dead service to OpenCage (38% coverage) to HERE (99%, because it licenses the official Capita database), and that journey was easily the most painful part of the whole project.

PostGIS does the geometry, Redis does the speed

The store is Postgres with PostGIS on RDS. Every coordinate is geometry(Point, 4326) with a GIST spatial index on it:

location  GEOMETRY(Point, 4326),
CREATE INDEX idx_properties_location ON properties USING GIST(location);

That index is why a map-viewport query is fast. Pan the map and the frontend sends a bounding box; the search-viewport Lambda runs an ST_Within, and search-radius runs ST_DWithin for "everything within 2km". The GIST index turns those from full-table scans into index lookups across 635k rows.

The thing nobody tells you about serverless and relational databases: Lambda will happily open ten thousand connections and melt your database. So nothing talks to RDS directly — everything goes through RDS Proxy with IAM auth, which pools connections so a fan-out of write Lambdas doesn't exhaust Postgres. On top of that, viewport results are cached in ElastiCache Redis, keyed on coordinates rounded to three decimal places, so two users looking at the same area hit cache, not the database.

The honest trade-offs

Serverless for batch ingestion is close to ideal — bursty, scheduled, idle most of the time, and SQS hands you reliability you'd otherwise build yourself. But it isn't free of friction. Lambda's execution limits bite on genuinely long jobs; I hit a real bug where a shared DB layer cached its IAM auth token for 13 minutes and any invocation running past 12 minutes died on a stale token. The two pieces that aren't serverless — RDS and the Photon EC2 box — are also the only two that cost money while idle and need minding. That's not an accident. Spatial queries over hundreds of thousands of polygons want a real database engine, and no amount of architectural fashion changes that. The win is putting serverless where it shines (ingestion) and a managed server where it's irreplaceable (PostGIS), rather than forcing one tool to do both.