Sql-server – Perform regex search queries of a text column in table of 100+ million using regex

sql server

I apologize if this is a newbie question but I'm feeling overwhelmed.

I have 100+ million 2-grams(string with two words), 100+ million trigrams, 100+ 4-grams.

I was thinking of separating 2-grams into a table for them, 3-grams into another table, etc., since there's quite a lot of data and I'd like to break it up at least a bit, and don't know how else I could partition it. Plus, I'd like to avoid using enterprise features of SQL Server such as actual partitioning.

Apart from the text column, I'd have Id(bigint), AppearanceCount(int) and Hash(this'd be the hashed text column, it doesn't even need to be a computed column as I don't predict many, or any, edits of the text column).

Typical usage scenarios for the 2-gram table(the others as well) would be:

  • Search the table with a regex matching the text column.
  • Delete a row with a specified hash
  • Insert a row

In an average case, I believe searching the table with a regex should return around 5-10 results, but outliers are possible.

So, what could I do to make sure all of this works fast?
Fast being a cycle of the following working in a few seconds for n-grams and not more than a few minutes for 4-grams and 5-grams(the current non-database system for 4-grams takes an hour).

I know about full text search, but I haven't been able to find how to use regexes with it. I would really, really prefer to have work here.
If that is not at all possible, I can of course give more details, but I didn't want to overwhelm everyone with an even bigger wall of text.

I'm sorry if this is too general, but I'm in the planning phase of this and would like to know if SQL Server is viable for this.
I could perhaps use something else, but I'm most familiar and comfortable with SQL Server.


They're all basically syntagms, aka parts of a sentence that might or might not be gramatically wrong/have a misspelling.
The words in those syntagms also have variations because the language I'm working with has cases and genders.
https://en.wikipedia.org/wiki/Grammatical_case#Indo-European_languages
https://en.wikipedia.org/wiki/Grammatical_gender

These words are not in English in case that matters, but I'll put in some English examples so that it's understandable.
This is a stupid example, but if English had cases and wanted to use the word House in various situations, we could have the following situation where we're answering some questions:
What is this? "A house."
Where are you going? "To houise."
What are you buying? "Housua"

Basically, you have a root of a word that can change.
So, if people are misspelling the word house or using it in an improper way, we want to delete entries from our database of correct expressions that match a regular expression that signifies how the word house is used incorrectly.

2-grams and 3-grams aren't different at all, it's literally just that 3-grams have one word more.

Let's imagine that the word "home" has different forms based on the type of situation the sentence it's in describes.
Also let's imagine English has a word called "foohume" that has nothing to do with the word "home".

Examples of a 2-gram where for some reason people spell home as hume:
Went hume
Gone hume
Going hume
Nice humes
To huma
About humas
About foohume

Now, we want to delete all 2-grams which contain an incorrect use of the word home. So basically all of these except the last one.
So, someone somewhere would tell the database to delete every 2-gram that matches the following conditions that would be described with a regex:
– Is at the beginning of a 2-gram or has a space in front of it
– Contains the root "hum" followed by a suffix of "e, es, a, as"
So basically it'd be something like "[$ ]hum[e, es, a, as]". I know it's not really a valid regex but you get the idea.

What my fear is is that full text search can't be used properly with this, and that we'll have a situation where we have to sequentially scan
the entire table of 2-grams, see if there's a regex match, and return that row's Id(or the entire row, but Id only might be faster).
If it was more straightforward than a regex, I suppose a full text search would return the results of this fairly quickly,
since as I said there should be about 10 of them on average.

Best Answer

I recently did an SSIS package that went though a table with over 4 billion records, the first step in the package was a query that extracted about 50 million records. The query just took a few seconds. Granted, my query was able to use an index, but it did have multiple conditions, and had to skip through the index to get at a subset of all the records and then test them with another couple conditions. And the return set was in the same order of magnitude as one of your entire tables. So I don't think you will have to do much to make this happen.

Sounds like you may not know how to use the LIKE operator. You can find details for that here. That will let you use some regular expressions to match against your text.

I'd recommend that you load the 2-grams into a table and give it a try. My guess is that a simple query with a LIKE condition will do the job for you.