Postgresql – Should I create a separate table for geo-location points

database-designpostgresqlroutingspatial

I am developing the backend for a service that resembles Uber in some aspects (just a simplification to give you the general context). Users will book rides and as such users will define the pickup and drop-off locations. These locations will be addresses in a city (with the possibility of multiple cities in the future). The locations I will receive from the frontend/app will be in the form of latitude, longitude, and google-maps-verified address.

I am wondering whether I should store the locations as a separate table, or whether I should embed the locations in whatever table I need them. For example, the bookings table will have a pickup location and a dropoff location. Should these be just references to a locations table, or should I add more columns to store the location right there? I will need at least 3 columns per location (lat, long, address string), with the possibility to have more columns if we decide to break the street address into more components.

The way I see it the two approaches trade-off memory and time performance:

  • A separate table will save some space, as we avoid copying the same
    information multiple times. But I am not sure how much space will be
    saved as we do not have enough real data yet. My ballpark estimate is that a medium-sized city has tens of thousands unique addresses. So if we take 100 ride bookings there is little chance for overlap. If we have 100,000 ride bookings there is bound to be considerable overlap. However, space is not really a limiting factor in our design, even with millions of bookings. It does not really matter if we spend 100MB more in our database.
  • A separate table will make processing slower. First of all, each time we get a booking request we will have to search if the locations in the request already exist in our database. If not, we add them, otherwise we use the existing reference. With potentially tens of thousands addresses in our database this will take some time. Is this extra time significant compared to our other time consuming tasks? An example of another time-consuming task is a call to the google-maps api to find routes from one location to another and get a travel time estimate. I believe this operation will take longer that searching in our database. So it might make little sense optimising the database access when we have other bottlenecks. How about other operations? Say we want to find bookings within a bounding [lat, long] box. This can become more time-consuming with a separate table as we'll need a full table join. Or do a cumbersome two-step search where we find the locations IDs in our table that are within the bounding box and then search the bookings table with these IDs. If we have hundreds or thousands locations in a bounding box this might not be a good way to do it. But again, I am not sure about the utility of trying to optimise this operation. I am not sure how often will I need to search by location. It seems not that often, so I will not gain/lose much one way or the other.

It seems to me that there are no strong arguments based on the space-time tradeoff. Are there other aspects I should consider? For example: clarity of design, ease of development, flexibility and expandability. The separate table approach seems to offer more structure and the opportunity for more flexibility if the model needs to change. For example, it might be beneficial to break the street address in multiple components (street, city, state). We are less likely to do this if we are to use location columns in other tables (we need to use at least 3 columns already, and breaking the address in 3 components, brings this number to 5). About flexibility: in the future, we might have multiple-point rides. Having a location table will help with that.

I am leaning towards a separate table, but I do not see strong arguments for or against it. Are there other arguments that I am missing?

Not sure if the following details are relevant but I will be using postgresql as the RDBMS and sqlalchemy as an ORM to query the database.

Best Answer

resembles Uber in some aspects [...] pickup and drop-off locations

So you need to use PostGIS, and PgRouting.

I will need at least 3 columns per location (lat, long, address string)

No, because the long/lat fit into a POINT from which you'll have to generate your topology.

I am wondering whether I should store the locations as a separate table,

You should store them with the information about the node, of course. GIS is just data. And all PostgreSQL tables are in unordered heaps. You don't break apart tables into 1:1 relationships without reason. Store your locations with a nullable field geog that has a geography(POINT 4326) column.

if we decide to break the street address into more components.

You don't do that in professional gis projects. You let an extension like address_standardizer do it for you. So you store your user input, and then you may cache a column with the stdaddr record type.

But I am not sure how much space will be saved as the we do not have enough real data yet.

I think your first concern should be how to best perform this task. I don't take it your familiar enough yet to begin optimizing. Before you start this project, I'd suggest you get up to snuff with GIS:

Learning the contents: ~600 pages will likely save you a lot of time in the future.