Mysql – Searching 500 million records of binary data

database-recommendationMySQLschema

I will inserting into a database the signatures of 500,000,000 pictures. The signatures will be generated using libpuzzle. Each signature is 338 bytes. (so 160 GB) plus a table for searching (read more below). I would prefer to keep the main database on one VPS server with a standard HDD (No SSD because of cost issues).

The most important aspect is search time, Insertion time does not matter.

In the past I have attempted this all within MySQL (with way less records) and had one database for everything, the main searching happened with a scheme like:

--
-- Table structure for table `signatures`
--

CREATE TABLE IF NOT EXISTS `signatures` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `compressed_signature` varchar(338) NOT NULL,
  `picture_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `picture_id` (`picture_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1107725 ;

-- --------------------------------------------------------

--
-- Table structure for table `stored_pictures`
--

CREATE TABLE IF NOT EXISTS `stored_pictures` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(255) NOT NULL,
  `pid` bigint(20) unsigned NOT NULL,
  `num` int(11) NOT NULL,
  `updated_at` datetime DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `picture_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_url` (`url`),
  KEY `idx_picture_id` (`picture_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=2773867 ;

-- --------------------------------------------------------

--
-- Table structure for table `words`
--

CREATE TABLE IF NOT EXISTS `words` (
  `pos_and_word` char(5) NOT NULL,
  `signature_id` int(11) NOT NULL,
  KEY `idx_pos_and_word` (`pos_and_word`),
  KEY `signature_id` (`signature_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
--

Where by nature of the libpizzle, you would search the words table for many signature_id and then get all the compressed_signature from the signatures table, do some math and spit back a score for each signature how similar it was for the search. Then for each similiarity that was above a threshold I would get the data I needed from stored_pictures by looking up the picture_id

It took about 5 minutes for 1 search to search 40,000,000 pictures – so I think there is room for improvement. Especially because I want this to be fast up to 500 million records.

Should I seperate all the non-essential data (only about 1% the size, which is everything that is related to the specific picture) in a seperate database? On a seperate server?

Because it's just doing a massive search for pos_and_word and spitting back all the signature_id's that could be a match, I assume that not having any type of relation with the data can help me pick a specific technology that will maximize my speed. Which technology is best for this?

Best Answer

You should consider using a real indexed search engine such as elasticsearch with runs on Java, uses a REST interface with json and so is exceedingly easy to program to, is free and open source, with a good community behind it.

www.elasticsearch.org

You could have it running on the same server as your application as long as you have a decent enough machine and it's built specifically for searching hundreds of millions of documents in near real time.

Pretty easy to install and setup and extremely easy to customize. It also has the added benefit of being built to run in the cloud (though not a requirement) and can cluster out very easily when your application begins to grow in popularity.