Mysql – Best solution for search scenario

MySQL

I have the database structure:

|     Tag     |         | Article_Tag |         |    Article    |
---------------         ---------------         -----------------
id                      tag_id                  article_id
name                    article_id              title
                        type (numeric)

In a search field I'm entering various tag names and the search must return the Articles which contain those tag (or at least a part of them).

The type field it's a numeric value representing the weight of the tag in it's article.

The query that I'm using now for solving this problem is:

SELECT DISTINCT sum(article_tag.type), article.id, article.title FROM tag JOIN article_tag ON tag.id = article_tag.tag_id JOIN article ON article.id = article_tag.article_id WHERE [tag.name LIKE ? OR] x n GROUP BY article.id ORDER BY sum(article_tag.type) DESC LIMIT ?

The [tag.name LIKE ? OR] x n represents that tag.name LIKE ? OR will be repeated for the n tags entered in the search field. (For example the "sailing adventure ocean" string will be splitted and the three words will be compared with the tag.name).

The ordering by sum is required because we need some kind of "most relevant articles" feature.

The tag.name field is indexed and the DB Vendor is MySQL.

The Problem

My concern is linked to the scalability of this structure. I don't think that having a large number of rows in the Tag table and lots of OR clauses in that query will lead to a good response time. I ran some tests and the results were acceptable but I'm wondering if there is another solution for my problem. Maybe NoSql?

Thank you!

Best Answer

It depends on the number of articles you expect to have. If you have indices on the columns you are filtering, it will be pretty fast. Joining is very fast and will surely not be the problem.

But your WHERE clause is very intensive. Because you are basically scanning all rows in the table and therefore loading it all from the disk to compare. This cannot be avoided in your case, but it will be the crucial part. Nevertheless, if you aren't going past 100'000 articles I wouldn't be too concerned. Still - test it, if you want to be sure!

However, your use case seems to be indexing document (articles) and get the top x of them, based on a search query. A way to be faster and more efficient is, to create vectors from documents which contain for each word its occurrence. Your search query will be transformed into one too, and then it's all about comparing vectors. (simplified)

You don't have to implement that of course, there are libraries for document indexing and searching. Maybe you want to have a look at lucene (or maybe you can even use solr which should do everything for you). Just to give you some ideas, with those tools you can build your own little 'google'. Lucene would of course build its own (noSql) database / index. So you would maybe need to separate this functionality from the rest of your application (I don't know the context).