Mysql – Effectively saving a graph (terabytes) in a database

foreign keyinsertMySQL

I'm new here so I'm not entirely sure what tags to use so please notify/edit if this should be changed, thanks!

Background
I have a graph with ~4 billion nodes and ~1 trillion edges that I want to store in a database (like sqlite) as in this way data can be inserted on the hard disk rather than in RAM as many other graph data structures require.

Data format
This graph format looks like:

accession node position orientation 
23101.1   1    1        plus
23101.1   100  2        plus
...
23101.1   100  1        min
...
~1trillion

The plan
I think of creating TABLES for accession, node, and orientation that I will refer to using foreign keys. So then the "main" table will be

accession node  position orientation 
<FK1>     <FK2> 1        <FK3>

(P.s. not sure if I should do if for the position too, but this will simply be 1 till graph-path-length).

The question
I only have little knowledge of SQL but based on code like this I would have to execute something like the code below for each line in the graph file:

    INSERT INTO main (accession, node, position, orientation) VALUES
        ( SELECT id from accessions WHERE accession=23101.1,
          SELECT id from nodes WHERE node=1,
          1,
          1
);

and also catch an error when the node/accession does not exist and insert it. However, I wonder whether obtaining the foreign keys with the SELECT .. WHERE will not get really slow when they are billions of rows in the table? So overall, what would be a proper way to store this information in the database (if at all)?

Best Answer

Column < Row < Block < Table

One column had a number, date, string, or a small number of other possibilities. It takes a few bytes or many bytes.

One row like you described (4 columns each taking a few bytes) will take perhaps 40 bytes when you add in some overhead.

One block is 16KB and holds (in your case) a few hundred rows. I mention a block because it is the unit of caching in InnoDB tables. All activity is done on blocks in RAM. If an operation needs to work on the rows in a block but that block is not in RAM, it will be brought into RAM, possibly pushing out some other block. (Read about "caching".)

One table is composed of as many blocks as are needed -- billions in your case!

So,... because of caching those billions of blocks can be handled. But with a lot of I/O.

I am assuming you need one row to describe one edge. And probably another table with nodes -- and 4 billion rows in that table.

These are huge numbers. Am I misinterpreting the number of edges and nodes? What do you expect to walk through all the edges for some operation? If so, let's see...

  • 40TB in edge table;
  • 2.5 billion blocks, each needing to be read at least once;
  • SSD drive that can handle 1K reads/second;
  • 2.5 million seconds to read all edges once;
  • That's about 1 month.

Somethin's gotta give!