Storing ~700 Tb in SQL Db

dynamodbnosql

I am a newbie to Db's on a whole, and I came across a tutorial, wherein we needed to store ~700 Tb of data over a few years. It was mentioned that scaling SQL for such limits of data is not the best approach, and hence in the tutorial they went ahead with NoSQL. I have a few doubts regarding this:

Is it possible to store such Tb's in a single machine, as per my understanding, horizontal scaling is not possible in SQL. Please correct this if I am wrong.
What will be the latency comparison of SQL vs NoSQL. Because of horizontal scaling, will a noSQL db like DynamoDB give lower latency than a SQL Db.

Best Answer

Is it possible to store such Tb's in a single machine

Definitely. Though that machine becomes a large single-point-of-failure.

Building a petabyte scale storage system that is efficient and fault tolerant can itself be a challenge, doing it with cost efficiency too even more so. I assume that being a "newbie as DBs" also means you are not particularly experienced with storage solutions either, so you don't want to be trying to put together a custom build.

A not un-common arrangement is to have a couple of clustered machines running your chosen DBMS (SQL Server etc.) using an "off the shelf" SAN based storage system. I did a quick calc with the first search result for "petabyte SAN" and a ~700Tb arrangement came in at around £40,000 and that is just the storage unit and drives to populate it: add onto that racks to hold it all, decent spec machines for the clustered database servers, the enterprise licensing that is likely to be needed, ... This level of storage is never going to be cheap. You also need to factor in the ongoing costs: electricity to power it all and the required air conditioning too, replacing drives when they fail, and also the man-power to monitor and maintain it all!

You would almost certainly be better off looking at managed solutions, perhaps with one of the bigger cloud providers, and let them worry about much of the scaling, redundancy, reliability, power, etc. problems for you. For this scale you are unlikely to find an "off the shelf" price so you'll need to talk to them directly. Be warned: this amount of storage, particularly for rapid access, with good resiliency (in terms of both data security and service availability) is not going to be remotely cheap, there is no way it can be.

as per my understanding, horizontal scaling is not possible in SQL. Please correct this if I am wrong.

It is possible, though whether it is practical compared to other options depends on your data and access patterns.

What will be the latency comparison of SQL vs NoSQL.

Because of horizontal scaling, will a noSQL db like DynamoDB give lower latency than a SQL Db.

All of the above can only really be answered with "it depends" without a lot more information about the data your are needing to store, and at least a vague idea of expected retrieval patterns and other requirements.

Just knowing the size and expected growth of the data is not nearly enough for a more detailed discussion.

Related Solutions

Mysql – Storing dynamic data in NoSQL

I'll answer your question in an orthodox manner, with a twist of heresy:

Orthodoxy: You shouldn't store data in a column in a relational database that isn't "atomic."

Heresy: In your specific situation, you could -- maybe -- consider this blob of JSON to be an atomic object.

Years ago, Chris Date said it like this:

"A relation is said to be in first normal form (abbreviated 1NF) if and only if it satisfies the condition that it contains scalar values only"

^{Date, C.J. An Introduction to Database Systems, 6th edition (Addison-Wesley, 1995)}

Later, he took a somewhat softer stance:

"1NF just means each tuple in the relation contains exactly one value, of the appropriate type, for each attribute. Observe in particular that 1NF places no limitations on what those attribute types are allowed to be."

^{Date, C. J. Database Design and Relational Theory: Normal Forms and All That Jazz (OReilly Media, 2012)}

The "exactly one value" I'm arguing for here is "exactly one JSON object" (which could, in turn contain a JSON array).

Storing things in JSON in a column is a bad idea if you need to the DBMS to manipulate it in any way, since, of course, it can't be properly indexed like properly normalized data can be... but, arguably, if you really really really don't intend for the DBMS to do anything with what you're storing other than write and read it, the case could be made to store a chunk of JSON in a single column, claiming the JSON array of values to be a single atomic value.

The big objection, I think, to doing this, is when it's done out of a lack of familiarity with the relational model or out of laziness or naivete. Obviously, there are a lot of ways it could be done wrong, but I'd suggest that there's nothing inherently wrong about storing a chunk of JSON in a database, As Long As You Know What You're Doing.™

And, of course, you could use a MySQL FULLTEXT index on it, now that those are supported in InnoDB as of (MySQL 5.6).

AWS Database Services for Logging Records like log4j or log4perl

Since you're looking at storing and querying 135,000,000 * 30 * 10 records that likely wouldn't benefit from traditional RDBMS features, I think Hadoop would be the way to go.

My experience is exclusively with Microsoft Azure - if you're not attached to Amazon you might check it out. Either way, Hadoop is open source so the majority of operations and activity should be the same regardless of your platform provider. I'd go with an option that allows you to test the performance of different cluster sizes so you're only paying for what you really need (especially with the 10x increase as an unknown right now).

The following link, despite its Azure focus, gives you a good tutorial of how to query log4j records with Hive. Amazon EMR appears to have great documentation.

Key points for your situation: Create a table structure over your files:

    CREATE EXTERNAL TABLE log4jLogs
    (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE LOCATION 'yourdrive:///logs/';

With the table defined you can then query similar to SQL:

    SELECT t4 AS sev, COUNT(*) AS ErrorCount
    FROM log4jLogs
    WHERE t4 = '[ERROR]'
    AND INPUT__FILE__NAME LIKE '%.log' GROUP BY t4;

With the EXTERNAL definition you can have multiple files and the table will include them all, so just drop them in the appropriate directory. If you're keeping 30 days worth you may want to get fancier and separate days into different folder structures/partition your Hive table.

Best Answer

Related Solutions

Mysql – Storing dynamic data in NoSQL

AWS Database Services for Logging Records like log4j or log4perl

Related Question