Mysql – Effectively saving a graph (terabytes) in a database

foreign keyinsertMySQL

I'm new here so I'm not entirely sure what tags to use so please notify/edit if this should be changed, thanks!

Background
I have a graph with ~4 billion nodes and ~1 trillion edges that I want to store in a database (like sqlite) as in this way data can be inserted on the hard disk rather than in RAM as many other graph data structures require.

Data format
This graph format looks like:

accession node position orientation 
23101.1   1    1        plus
23101.1   100  2        plus
...
23101.1   100  1        min
...
~1trillion

The plan
I think of creating TABLES for accession, node, and orientation that I will refer to using foreign keys. So then the "main" table will be

accession node  position orientation 
<FK1>     <FK2> 1        <FK3>

(P.s. not sure if I should do if for the position too, but this will simply be 1 till graph-path-length).

The question
I only have little knowledge of SQL but based on code like this I would have to execute something like the code below for each line in the graph file:

    INSERT INTO main (accession, node, position, orientation) VALUES
        ( SELECT id from accessions WHERE accession=23101.1,
          SELECT id from nodes WHERE node=1,
          1,
          1
);

and also catch an error when the node/accession does not exist and insert it. However, I wonder whether obtaining the foreign keys with the SELECT .. WHERE will not get really slow when they are billions of rows in the table? So overall, what would be a proper way to store this information in the database (if at all)?

Best Answer

Column < Row < Block < Table

One column had a number, date, string, or a small number of other possibilities. It takes a few bytes or many bytes.

One row like you described (4 columns each taking a few bytes) will take perhaps 40 bytes when you add in some overhead.

One block is 16KB and holds (in your case) a few hundred rows. I mention a block because it is the unit of caching in InnoDB tables. All activity is done on blocks in RAM. If an operation needs to work on the rows in a block but that block is not in RAM, it will be brought into RAM, possibly pushing out some other block. (Read about "caching".)

One table is composed of as many blocks as are needed -- billions in your case!

So,... because of caching those billions of blocks can be handled. But with a lot of I/O.

I am assuming you need one row to describe one edge. And probably another table with nodes -- and 4 billion rows in that table.

These are huge numbers. Am I misinterpreting the number of edges and nodes? What do you expect to walk through all the edges for some operation? If so, let's see...

40TB in edge table;
2.5 billion blocks, each needing to be read at least once;
SSD drive that can handle 1K reads/second;
2.5 million seconds to read all edges once;
That's about 1 month.

Somethin's gotta give!

Related Solutions

MySQL Replication – How to Handle Master to Master Replication After Disk Space Outage

There are two things that need to be mentioned here:

1) The word impossible is a dead giveaway. Client requested master to start replication from impossible position. Here is what this essentially means:

It just so happens that the file size of a binary log and the position of the binary logs are one and the same. The slave wants to read from mysql-bin.001067 position 183468345. Since the word 'impossible' comes up in the message, this indicates that the master binary log mysql-bin.001067 is less than 183468345 bytes. To get replication going again, skip to the next binary log:

CHANGE MASTER TO 
  MASTER_HOST='XX.XX.XXX.XXX', 
  MASTER_USER='replicate', 
  MASTER_PASSWORD='slave', 
  MASTER_PORT=3306, 
  MASTER_LOG_FILE='mysql-bin.001068', 
  MASTER_LOG_POS=NewPos, 
  MASTER_CONNECT_RETRY=10;
START SLAVE;

NewPos is dependent on the version of MySQL.

MySQL 5.6 : NewPos = 120
MySQL 5.5 : NewPos = 107
MySQL 5.1 : NewPos = 106
MySQL 5.0 : NewPos = 98
MySQL 4.x : NewPos = 98

2) You could just look into using data synchronization tools from Percona.

I have used these tools for about 2 years and they help you hunt down differences in tables between master and slave, even if the table on the master is InnoDB and the same table on the slave is MyISAM (provided the tables have the same table structure).

Replication must be on while running these tools.

BTW Percona has a new set of tools called the Percona Toolkit. They forked away from their own MAATKIT tools to make better ones. The tools are probably called pt-table-checksum and pt-table-sync.

Postgresql – Setting up a Postgresql Partition with foreign key

First, I would suggest reading this: http://ledgersmbdev.blogspot.com/2012/08/postgresql-or-modelling-part-3-table.html because although it doesn't cover partitioning, you are doing partitioning using table inheritance and this will cover a lot of details.

In general there are three ways you can handle this. The first is to use specialized key tables (managed with triggers) and reference these tables in your fkey constraints. This is often the simplest approach but it has significant costs in some cases.

The second possibility is you can partition your joining table by the foreign key. This can lead to major complexity problems though and so aside from rare cases, it is my least-favored approach.

A final possibility is you can write your own constraint triggers to manage foreign key enforcement.

Which approach you choose will be dependent on the complexity of joining information, particularly transitively joining information. The second approach allows you to do things entirely declaratively at a significant (possibly major) complexity cost, while the first and third move away from declarative approaches, meaning you may also want to use pgTAP to run test cases on your triggers.....

None of these solutions are without significant complexity. You may want to rethink whether partitioning is the best way to go here and whether partial indexes and other approaches will get you where you need to go.

Best Answer

Related Solutions

MySQL Replication – How to Handle Master to Master Replication After Disk Space Outage

Postgresql – Setting up a Postgresql Partition with foreign key

Related Question