Postgresql ignoring a substring in the middle of a string

postgresqlpostgresql-9.4

I may have phrased the title of this question badly. I have a documents table that has a filename, path, and file_content_type columns. I want to search against the filename column to determine if there are records with a similar name.

The problem is that a revision suffix (R00, R01, …) is always appended to the filename column. What our business rules consider to be the same file will have a different names because the version suffix will be different.

We must search for the entire filename, including the extension, but ignore the revision suffix. And I'm not sure how to go about this.

I thought about negating the version suffix, but its dynamic. Using fuzzy search or trigram doesn't seem like a sound choice either, nor using ILIKE or SIMILAR TO.

Here are some examples of what we consider to be the same file, and a different file.

Same file; only difference is the version suffix: R00 vs R01:

LLE-MET-AP-0000-PLA-COB-R00.pdf
LLE-MET-AP-0000-PLA-COB-R01.pdf

Different files; paper number is different 0000 vs 0001

MET-AP-0000-PLA-COB-R00.plt
MET-AP-0001-PLA-COB-R00.plt

Different files; extensions are different:

LLE-0000-PLA-COB-R00.dwg
LLE-0000-PLA-COB-R00.plt

There's one more complication. There may be cases where the revision suffix can take different forms:

  1. With a leading zero: R01, R02 …
  2. Without a leading zero: R1, R2 …
  3. Different separators, and lettercase: XXX-R01, XXX_R01, XXX_r1, etc…

A possible solution is to use a regular expression:

SELECT *
FROM documents
WHERE (filename ~* 'LLE-0000-PLA-COB([.-_ ]{0,}[0-9]{1,2})')

But this has a major disadvantage. I must preprocess the search string to remove any version suffix and then use it in the search. If a record has the filename value of LLE-0000-PLA-COB-R01 I have to remove the -R01 before doing a search, and this is not ideal.

I would simple like to say:

Find me any files that have this name LLE-0000-PLA-COB-R01.jpg, but ignore the R01 in your search.

What other choices do I have to approach this problem?

Best Answer

One solution is to use regexps to remove that part of the filename. Assuming the separators are known

regexp_replace(filename, '^(.*?)[-_][rR][0-9]+(\.[^.]+)$', E'\\1\\2')

This would assume separators - or _, then r or R followed by at least one number, a dot and then something but never more dots. The part with r/R+numers would be removed.

The efficiency of this depends on the amount of data and how the searches are done. It could be used in an index to speed up searching. For example

CREATE INDEX filename_name_idx ON documents (regexp_replace(filename, '^(.*?)[-_][rR][0-9]+(\.[^,]+)', E'\\1\\2'));

Then doing a search

SELECT filename FROM documents
  WHERE regexp_replace(filename,
    '^(.*?)[-_][rR][0-9]+(\.[^,]+)', E'\\1\\2') = 'LLE-MET-AP-0000-PLA-COB.pdf';

would use index (if the query optimizer deems it faster).

Do note that the regexp part has to be exactly the same in the index and the search, otherwise the index cannot be used. You can also combine it with LOWER() if case insensitive searching is required.

The same regexp can be used to remove the revision identifier from the search string also, if there is a need to search for matching files for, e.g., LLE-MET-AP-0000-PLA-COB-R00.pdf

Making a view with this would make it a lot nicer to use and the actual regexp would be in the database and not in the application layer.