Mysql – Address Parts – Hierarchy vs Flat

database-designMySQL

I'm using Google to geocode addresses, which breaks down an address into geographic units like country, state, city, etc. Depending on the address being geocoded, some geographic units may or may not be returned (e.g. maybe city is not returned, but suburb will be).

I'm tempted to store the data of each broken down address in a single row of a table, like:

USA, California, San Francisco
USA, California, San Jose
USA, California, Los Angeles

Obviously, this doesn't seem normalized as data is being repeated. I still could potentially query this effectively, though (e.g. list all states for USA).

Another option would be to save the data in parent-child relationships, a tree. That would be more normalized, but potentially harder to maintain and query (I can use a library to make this a lot easier, though, so it's not much of a concern for me).

My biggest concern using a parent-child model is that the branches may not all be the same length (e.g. city is not returned, but instead suburb is after the state). If I were to query for all the nodes below the state level, I could get a mix of cities and suburbs — which might not be ideal. I could avoid this problem if I saved the data in a single row, with each geographic unit in their appropriate columns.

What's the proper way to model address parts?

Best Answer

Depending on what country you are in the rules for addresses (especially postal addresses) can get pretty dicey. For example, you're pretty safe to assume that a zip/postal code has one official city name, but both the U.S. and Canada allow for alternative city names for a postal code. I know this for a fact because I used to develop postal address validation software for North America. The non-official names are often recognized by the postal authorities and you typically have to permit their use.

Outside of U.S. / Canada the rules can get even more loose. There are not that many countries with postal authorities that standardize addresses in a fairly hierarchical structure (U.K., Germany, Australia) so in a lot of countries you can't even count on that.

The difficulty with normalizing address data is that, as you've already seen, it is somewhat hierarchical, but inconsistent across different geographies. Japan has prefectures, Canada has provinces and territories, the U.S. has states, districts, territories and commonwealths. Apart from naming conventions, in different places you may skip (or add) entire levels of geography.

You're thinking of the problem from the wrong angle.

You're instinct to be wary of redundant data is a good one. In transactional systems you should always try to start with data that's in third normal form (3NF) at least. This helps you to avoid insert, update and delete anomalies. In other words, the problem with redundant data is that it can become inconsistent when you start changing it.

In your case, however, all of your geographic data is coming from a Google API. If you change an address, I presume you're going to go back to the Google API and refresh the whole thing as a unit. You wouldn't go in and selectively change one element in an address because it had been renamed, for example.

Normalization doesn't especially help you because you aren't manipulating data elements within the address, you're always working with the address as a whole.

In your case, I would recommend leaving address elements classified in columns according to what the Google API says they are. Don't try to build trees because they will be hard to work with and the usual reasons for wanting to represent your data in a tree don't apply to your situation.

NOTES

To test you need MySQL 5.7+, and check SELECT @@GLOBAL.sql_mode or set mode to ONLY_FULL_GROUP_BY.

The query works fine,

SELECT organization, any_value(city) AS city, any_value(country) as country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;

Another interesting behaviour in MySQL is let user in a friendly mode, where any_value() is not necessary — so, is also possible to avoid "AS name" boring syntax. In this special (non-standard) mode, the following query have the same behaviour,

SELECT organization, city, country,
       sum(info1) as tot1, avg(info2) as avg2, count(*) as n 
FROM v1 GROUP BY organization;

Best Answer

Related Solutions

Mysql – Multi level hierarchy relationship

Mysql – Aggregate sampler function for programmer express predictions

NOTES

Related Question