Mysql – Search 200M documents of text by keyword in text

database-designfull-text-searchMySQLpostgresql

Data

~5M documents, would love to have ~223M, with the text of ~10-1000 words each. It must be utf8 and support any spoken language. All words in a text are unique and I can also order them or do whatever needed to have better performance. Full text is saved in another location. Data is almost read-only. For performance, I can generate data, add indexes and mark it as read-only.

Data structure:

{
  id: 1,
  reference_id: 1,
  text: 'text of unique. 10-1000 words each ',
}

Requirements

Results must come under 0.1s
Sample queries:
1. find text with words: x or y or z
2. find text with words: x and y and z
3. would be great to find text with part of the word: hous would find house, housing etc.

Questions

Is it possible to make it happen with MySQL/PostgreSQL?
What database solutions you would offer?
What config to look at?
Index type?
Query sample?
Server requirements for RAM/HDD/CPU?
Better more powerful machines or array of simpler?

Best Answer

Just think about your requirements and what you're asking and if an answer can even been given here. Wikipedia has 5.3 M documents. You're talking about a load on that scale. And you want to know whether or not you can query it in 0.1 sec? You're going to have provide the sample text. I'm all about making sample data, but how am I even going to generate sample data for this question/

What you want is enterprise Full-Text Search (FTS): PostgreSQL has that. Whether or not it will perform as well as you intend, I have no idea. I've used it for reasonably large projects, it's worked fantastically.

would be great to find text with part of the word: hous would find house, housing etc.

It also supports :* which does what you want. so just use hous:*

It is indexable using a GIST index. All of that is in the docs.

{
  id: 1,
  reference_id: 1,
  text: 'text of unique. 10-1000 words each ',
}

So just

CREATE TABLE foo (
  id            int PRIMARY KEY,
  reference_id  int, -- no idea what this is
  document      text,
);
CREATE INDEX ON foo USING GIST (to_tsvector(document));

Related Solutions

Sql-server – get similar Full Text Search functionality without using Full-Text Search

Native full-text can handle all of the features you've mentioned.

Ranking/relevance, weighting
Word inflection and expansion
Customisable noise words and thesaurus

Whether or not it will be the best solution for your particular domain is a tough call. Rather than attempt to hand-roll an alternative, spike a proof of concept using SQL FTS and repeat the exercise with one or more open source alternatives.

Lucene is the obvious contender to compare and comes with the "Stack Overflow search is now 81% less crappy" endorsement of the SO team.

Sql-server – Full-text search in SQL Server on multiple tables fails

Hi we can query multiple tables in a full-text search by using views ,but we some advantages and disadvantages . Here in your case you created view dbo.vw_recipe_search it seems your are not using it in altered procedure when i query it i am getting results as per intended

  SELECT *
        FROM dbo.vw_recipe_search r 
        INNER JOIN FREETEXTTABLE(vw_recipe_search, *, 'salsa peas') kt
            ON r.search_id = kt.[KEY]
        ORDER BY RANK DESC;

The problem in this approach is ranking and weightage of each column , like suppose if u want to give high weight to title column than descrip you have to do a lot of work in order by rank column . We used Levenstein string comparision function to order for better results set

Best Answer

Related Solutions

Sql-server – get similar Full Text Search functionality without using Full-Text Search

Sql-server – Full-text search in SQL Server on multiple tables fails

Related Question