Sql-server – How much data can SQL Server full text indexing handle

full-text-searchperformancesql-server-2012

I realize that the question is vague and it depends on hardware and our needs.

We currently have 5 million rows of data, a total of 5GB of data which we want to index using full text indexing. Our data increases quite rapidly and it's not unreasonable to assume that in a few years it will be closer to a billion rows and a TB of data.

The index is searchable by web site users, and they expect responses within a second or two.

Is it reasonable to assume that this data set will be indexable using SQL Server 2012 full text indexing? Is it common to do full text indexing of this amount of data? And is there any good reading on the subject, for example from others' experience?

Best Answer

I've not implemented FTS anywhere near that scale personally but there are documented examples:

In addition to being robust in its support of concurrent users, Metalife has found SQL Server 2005 full-text searches to be fast. “SQL Server 2005 is at least as fast as our custom code was,” says Pavlova. “We created a custom algorithm to return all full-text searches against our 1.4 terabyte database in less than a second, and that is what we’ve seen with SQL Server 2005.” [Source]

1.4TB is mentioned as the source database here but that doesn't necessarily translate to TB+ of full text indexed data.

This example is quite smart, extracting content from files before shuffling them off to cheap cloud storage:

There are currently over 50 million files (over 10 Terabytes of data) that the customer is migrating into a custom project management application (for international customers) that can be accessed via the application and can be searched upon. SQL Server 2008 full text indexing is used to index the content of these files which gives them rich searching capabilities within their application. [Source]

Newsgator is documented as running a 2.5 billion row, 4TB full text implementation and FileControl as having a 2 billion, 1TB system back in 2005. Given the other options available today, would they do the same now?

Given the licensing cost of SQL Server I'd be inclined to consider alternatives if they fit your use case. Lucene or Solr are the obvious open source choices.