Searching for scalable storage solution

database-designdatabase-recommendationnosqlscalabilitystorage

In one project I save statuses of processing in MySQL table:

`id`  `name`  `email`  `has_apples`  `has_bananas`
 1     Bob     a@b.c    1             0
 2     Mike    c@d.e    0             1
 3     John             1             1

name+email pair should be unique.
For now everything works fine. But the table grows rapidly, and currently it contains about 100m rows.
According to the development needs I have to add, lets say, has_oranges column.

Altering such a big table is painful, requires downtime and it doesn't look like that the MySQL is a right tool for this problem.

Even more, the number of such type of data will grow in the future.
I'm looking for a long-term scalable solution, which will allow:

  • fast search by id
  • fast search by name
  • unique control of the name+email pair
  • column scalable
  • storage scalable

I'm not familiar with NoSQL solutions, but it seems that maybe that's the way to go.

Best Answer

I'm not familiar with NoSQL solutions, but it seems that maybe that's the way to go.

I think you're way off in what requires NoSQL and big data. By at least a few orders of magnitude:

  1. 100 million is not much, and grows rapidly needs to be quantified.
  2. Schema changes also need to be quantified, do you need schema?
  3. Speaking for PostgreSQL, extending a table does not result in the table being rewritten.

Using PostgreSQL

Following my own advice and creating an email type here is an example,

CREATE EXTENSION citext;
CREATE DOMAIN email AS citext
  CHECK ( value ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' );

CREATE TABLE contacts (id,name,emailaddress)
AS
  SELECT
    id::int,
    id::text || ' First Last',
    (id::text||'first@last.com')::email
  FROM generate_series(1,100e6) AS gs(id);

Keep in mind, some of that time is spent validating the email addresses against the HTML5 spec. Anyway, that all finishes in 720882.332 ms or 12 minutes on my rusty laptop. Keep in mind, that's without the indexes. Now we add the indexes.

CREATE UNIQUE INDEX ON contacts (name, emailaddress);
ALTER TABLE contacts ADD PRIMARY KEY (id);
ANALYZE contacts;

Tada, test data done.

SELECT * FROM contacts WHERE id = 424242;
   id   |       name        |     emailaddress     
--------+-------------------+----------------------
 424242 | 424242 First Last | 424242first@last.com
(1 row)

Time: 4.534 ms

SELECT * FROM contacts WHERE name = '424242 First Last';
   id   |       name        |     emailaddress     
--------+-------------------+----------------------
 424242 | 424242 First Last | 424242first@last.com
(1 row)

Time: 5.224 ms

ALTER TABLE contacts ADD COLUMN has_apples bool;
ALTER TABLE
Time: 8.612 ms

ALTER TABLE contacts ADD COLUMN has_bananas bool;
ALTER TABLE
Time: 14.813 ms

Adding columns like that is probably not a good idea, why not use the jsonb type

ALTER TABLE contacts ADD COLUMN fruits JSONB;
ALTER TABLE
Time: 7.957 ms

CREATE INDEX ON contacts USING gin(fruits);
Time: 24928.756 ms

INSERT INTO contacts (id,name,emailaddress,fruits)
VALUES
  (100e6+1, 'Evan Carroll', 'foo@whatever.test', '{"bananas":true}');
INSERT 0 1
Time: 8.630 ms

SELECT * FROM contacts WHERE fruits @> '{"bananas":true}';
    id     |     name     |   emailaddress    | has_apples | has_bananas |      fruits       
-----------+--------------+-------------------+------------+-------------+-------------------
 100000001 | Evan Carroll | foo@whatever.test |            |             | {"bananas": true}
(1 row)
Time: 5.188 ms

Etc., Not sure what you mean by "scalable storage", or "unique control of the name+email pair".

Sizes are also manageable. For $70, you can fit the entire index in RAM.

Table: 7301 MB
Index: Pkey: 2.1GB, name/email unique: 6.4GB, gin/json: 102MB.