Mysql – Search 200M documents of text by keyword in text

database-designfull-text-searchMySQLpostgresql

Data

~5M documents, would love to have ~223M, with the text of ~10-1000 words each. It must be utf8 and support any spoken language. All words in a text are unique and I can also order them or do whatever needed to have better performance. Full text is saved in another location. Data is almost read-only. For performance, I can generate data, add indexes and mark it as read-only.

Data structure:

{
  id: 1,
  reference_id: 1,
  text: 'text of unique. 10-1000 words each ',
}

Requirements

  • Results must come under 0.1s
  • Sample queries:

    1. find text with words: x or y or z
    2. find text with words: x and y and z
    3. would be great to find text with part of the word: hous would find house, housing etc.

Questions

  1. Is it possible to make it happen with MySQL/PostgreSQL?
  2. What database solutions you would offer?
  3. What config to look at?
  4. Index type?
  5. Query sample?
  6. Server requirements for RAM/HDD/CPU?
  7. Better more powerful machines or array of simpler?

Best Answer

Just think about your requirements and what you're asking and if an answer can even been given here. Wikipedia has 5.3 M documents. You're talking about a load on that scale. And you want to know whether or not you can query it in 0.1 sec? You're going to have provide the sample text. I'm all about making sample data, but how am I even going to generate sample data for this question/

What you want is enterprise Full-Text Search (FTS): PostgreSQL has that. Whether or not it will perform as well as you intend, I have no idea. I've used it for reasonably large projects, it's worked fantastically.

would be great to find text with part of the word: hous would find house, housing etc.

It also supports :* which does what you want. so just use hous:*

It is indexable using a GIST index. All of that is in the docs.

{
  id: 1,
  reference_id: 1,
  text: 'text of unique. 10-1000 words each ',
}

So just

CREATE TABLE foo (
  id            int PRIMARY KEY,
  reference_id  int, -- no idea what this is
  document      text,
);
CREATE INDEX ON foo USING GIST (to_tsvector(document));