Sql-server – Combine relational DB and Elastic search

database-designelasticsearchsql server

We have a large quanitity of text files we want to free-text/full text search, combined with relational structured metadata about the text file.
So, a search could be "Give me all files that belong to group X(or sub groups of X), have author (Ari and Bari and Mari), belongs to organization Y, and contains the text "synthetic". The latter part being a full-text search, and the other being already stored as relational data in our existing db.

In our database(which is rather complex), there are stored a way to ID the files, and a ton of various metadata about the file, spread among tens of tables, ranging from simple 1-1 relationships, to 1-many sets pr file, and even tree-structure relationship(things like "this file is type X, type X is a subgroup of type Y, etc).
And this metadata may change over time, all over the application(which is huge).

Now, I as a database admin, thought this could be solved by using SQL Server to do the search for structured metadata already in the DB, constraining the search to candidate-files, and then passing the candidate file id's to elastic search for full-text searching. (Re-indexing the file on elastic when a file is added or commited is trivial in our code)

However, the elastic-guys in our project naturally had a different idea:
To extract all the meta-data as well as the full-text content from the files, to elastic-search, and run the search exclusively in elastic.

This allows them to run full powered lucene queries easily, and load is taken off the database, which is nice. However, this also to me, introduces a nightmare to keep the structured metadata in sync, and blindly re-indexing/syncing everything periodically is not possible due to the scale of data.

I can see merits/concerns to both options. Is there a best practice for this kind of thing?

Best Answer

Use both.

There is a line that needs to be drawn by you and your team here. SQLSERVER is more expensive than ElasticSearch so when I faced a similar issue it made more sense to spend CPU resources on elasticsearch than on sqlserver.

There's a few things that can make you decide to index your text data in elasticsearch

What kind of load are you looking at ?

A few searches per minute or dozens per seconds ? This is subjective but again if a large amount of your database resources are spent on this one query you might want to offload that.

Keep your data structured

I find elasticsearch query language to be far less intuitive than SQL. I strongly suggest to have an as-normalized-as-possible version of your data in a standard relational database. Then base your elastic index on that.

Elasticsearch is great at many things but writing complex ad hoc queries with aggregations and/or subqueries isn't one of them.

How do you sync the data then ?

Triggers and queues are what I've been using.

Add a trigger on the table that has data you want to keep track of. Here's what one of the queues I've done looks like.

queue

The trigger logs the action (insert/update/delete) and from there you know what to do in your Elasticsearch index. I've found rebuilding the entire record in elastic wasn't too expensive so this is what I do.

This way, you can take a project with a large code base and index any data you like in elasticsearch without having to do any code changes. Everything is handled by the state of your data in your RDBMS of choice.

Elasticsearch (and all the other nosql/document stores) have amazing use cases but storing relational data as a main database is not one of them. Use relational databases for that.