I am creating a database that references images. Keywords and captions will be part of the metadata. These captions and keywords are not necessarily added to the photos with the intent of facilitating quality searches.
I would like to effectively convert the information in the captions into keywords. Obviously some of the words in the captions will not be useful (e.g. "the","he","she","of",etc.) and I would exclude them, but other words will be very useful, especially if they appear in combination with other words.
I'm trying to figure out the best way to associate these keywords and words from captions with the images. Would it be best to create a table of images (filenames, path to file, size, etc.) and a table of words, where each word has an ID, and then an associative table that references the words to the images? Or perhaps two associative tables, one to link caption words and the other to link keywords?
Is this a good approach?
Best Answer
The best solution depends on your RDBMS, but in general you shouldn't need to manually sanitize the captions because the database will have a "stopword list" or "stoplist" which is used to exclude words from a full-text index (not the same as a regular clustered/non-clustered index). You can modify that list to specify which words are excluded from the full-text index. In most databases, the list would already include prepositions, pronouns and other noise words, meaning a search for birds AND flight would return a caption of "birds in flight" (assuming it exists). Additionally, most systems can stem the indexed terms and give you the ability to search on inflections of a term, e.g. searching for inflections of "fly" will return flying, flew, flown, etc.
If the captions are the keywords, you don't need to parse keywords from them; just add a full-text index to the caption column. If you have or want to add more keywords in addition to the captions, you can index them separately and use them to augment or modify the caption search using the RDBMS' built-in numeric ranking/scoring of results - a hit on a keyword could increase the rank while a miss could do nothing or even decrease it.
If the captions have a 1:1 relationship to images (which they usually do), you don't even need a separate table for those; just store the caption in another column of the images table. Additional keywords could be normalized and stored in a separate table, which would require an intermediate join table with foreign keys from images and keywords. However, this adds complexity and you may find it better to just store a space-delimited list of words in another column in the images table.
The solution you select will depend on several factors including business needs, performance requirements and use cases, i.e. what types of searches you need to support: exact phrase, Boolean AND/OR, fuzzy matches like SOUNDEX, etc. Sometimes full-text indexing isn't even necessary. For instance, if stemming and noise word exclusion aren't important, a search using LIKE with wildcards may suffice.