Postgresql – Geographic Synonyms

database-designpostgresql

Let's say I have a table representing principal country divisions (ex States):

create table principal_country_divisions (
  id int primary key,
  name text not null,
  country_code char(2)
);

insert into principal_country_divisions values (1, 'New York', 'US');

I want users to be easily able to find New York via synonyms such as 'New York', 'NY', or 'New York State'.

So I have a synonyms table:

create table synonyms (
  syn text,
  name text,

  primary key (syn, name)
);

insert into synonyms values 
('NY', 'New York'),
('New York State', 'New York');

What is an efficient and easy way to query this and return ONE record for New York?

In particular, they should be able to find the result for the default name 'New York' OR any synonym:

select * from principal_country_divisions where name = 'NY';
result: {1, 'New York', 'US'}

I guess I would start with something like this:

select
id,
name,
country_code
from principal_country_divisions a
where name = 'NY'
or exists (select 1 from synonyms where name = a.name and syn = 'NY')

Can I do this only using a view, or should I use a function?

Best Answer

First of all, you have an integer primary key on principal_country_divisions. Use it. More efficient than joining via name for multiple reasons (storage size, index size, faster integer arithmetic, no collations involved, fixed length).

create table principal_country_divisions (
  country_id    int primary key
  ,name         text not null
  ,country_code char(2)
);

create table synonyms (
   country_id int REFERENCES principal_country_divisions (country_id)
  ,syn        text
  ,primary    key (syn, country_id)
);

syn needs to the the first column of the index (pk), you had that right already. The accompanying index automatically covers equality tests on synonyms.syn.
Be sure to add an index on principal_country_divisions.name:

CREATE INDEX foo ON principal_country_divisions (name);

If you'd want to match patterns, not whole strings, the job would become more complex.

Next, how can you be sure to

return ONE record for New York?

Obviously, name and syn can be the same. There is no unique constraint over both columns and there isn't even one on syn alone. Otherwise your EXISTS query is a good approach - usually fast. You'd just have to avoid multiple rows. The added benefit of EXISTS would be to eliminate duplicates from synonyms alone, but that's ruled out by the pk. This may be faster for the case:

SELECT DISTINCT ON (1)
       a.country_id, a.name, a.country_code
FROM   principal_country_divisions a
LEFT   JOIN synonyms               s USING (country_id)
WHERE  a.name = 'NY'
OR     s.syn  = 'NY'
-- ORDER BY 1, <more expressions to pick from peers>

As you commented, a LEFT JOIN is in order to preserve finds in name.

In case of multiple finds, you can chose what to pick by adding more ORDER BY expressions. Leading columns have to agree with DISTINCT ON, though. Details in this related answer on SO.

Related Solutions

SQL Server – Best Practices for Handling Language

If I understand your requirements correctly....

I would just use the natural key, LanguageCode-CultureCode ("en-US," for example). It's small enough. (I'm using the entire "en-US" as the primary key to differentiate it from "en-GB," for example.)

CREATE TABLE [dbo].[Language](
    [Language] [char](2) NOT NULL,
    [Culture] [char](2) NOT NULL,
    [LanguageCode]  AS (([Language]+'-')+[Culture]) PERSISTED NOT NULL,
 CONSTRAINT [PK_Language] PRIMARY KEY CLUSTERED 
(
    [Language] ASC,
    [Culture] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
 CONSTRAINT [LanguageCode] UNIQUE NONCLUSTERED 
(
    [LanguageCode] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE TABLE [dbo].[Language_Text](
    [LanguageID] [varchar](5) NOT NULL,
    [LanguageCode] [varchar](5) NOT NULL,
    [LanguageName] [nvarchar](20) NULL,
 CONSTRAINT [PK_Language_Text] PRIMARY KEY CLUSTERED 
(
    [LanguageID] ASC,
    [LanguageCode] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[Language_Text]  WITH CHECK ADD  CONSTRAINT [FK_Language_Text_Language] FOREIGN KEY([LanguageCode])
REFERENCES [dbo].[Language] ([LanguageCode])
GO

ALTER TABLE [dbo].[Language_Text] CHECK CONSTRAINT [FK_Language_Text_Language]
GO

This should allow you to get all articles in en or sv, and you can also query for en-US or sv-SE. Presumably, although it wasn't in your example, you could also query for en-CA, fr-CA, en, fr, or CA.

Edit--I'm sorry, you're right, no search by Culture in my old code. Revamped above, sorry. Here's a sample of content:

Language:

Language-Culture-LanguageCode
en  US  en-US
sv  SE  sv-SE

Language_Text:

LanguageID-LanguageCode-LanguageName
en-US   en-US   English
en-US   sv-SE   Engelska
sv-SE   en-US   Swedish
sv-SE   sv-SE   Svenska

Searching by Culture (Canada):

SELECT        test.dbo.Language_Text.LanguageID, test.dbo.Language_Text.LanguageCode, test.dbo.Language_Text.LanguageName
FROM            test.dbo.Language_Text INNER JOIN
                         test.dbo.Language ON test.dbo.Language_Text.LanguageID = test.dbo.Language.LanguageCode
WHERE        (test.dbo.Language.Culture = 'CA')

Searching by Language (French):

SELECT        test.dbo.Language_Text.LanguageID, test.dbo.Language_Text.LanguageCode, test.dbo.Language_Text.LanguageName
FROM            test.dbo.Language_Text INNER JOIN
                         test.dbo.Language ON test.dbo.Language_Text.LanguageID = test.dbo.Language.LanguageCode
WHERE        (test.dbo.Language.Language = 'fr')

Searching by LanguageCode (Swedish):

SELECT LanguageName
  FROM [test].[dbo].[Language_Text]
  where (LanguageID = 'sv-SE')

PostgreSQL – Mixed Search in Multiple Columns

Pattern matching and operators

Full text search is not the right tool for pattern matching (and possibly even fuzzy, fault tolerant input). Typically, trigram-similarity search with the % operator is the superior approach here. You need to install the additional module pg_trgm once per database:

Possibly combine that with LIKE (also supported by trigram indexes), and text search like you had in mind. Related answer where text search and similarity are combined:

PostgreSQL FTS and Trigram-similarity Query Optimization

Overview of pattern-matching functionality in Postgres:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Database design

json is a very bad choice for your intended functionality. It is very inefficient to have to plow through huge columns holding everything in a semi-structured big string (json).

Consider normalizing your schema, which would result in superior performance (when done properly).

If you are hellbent on json, consider the new jsonb in the upcoming (and soon to be released) Postgres 9.4. More (last paragraph):

How to remove known elements from a JSON[] array in PostgreSQL?

And be sure to use functional indexes like:

Further optimization is likely possible. But that would require more detailed information including table definitions, cardinalities, sample data and typical queries.