250 billion rows – computing interconnectedness on a really huge scale

database-designoptimizationperformanceschema

I have a table of approximately 700k rows, each identified by a unique itemnumber.

Each row/item can be associated with any of the other rows/items in the table by calculating a single numerical value indicating the strength of the association or the mathematical "distance" between two items, with no association/infinite distance represented by "0" while the numerical value "1" indicates the same item/no distance.

These association numbers are difficult to calculate/computing intensive and are based on data stored in a separate database. Therefore, precalculating them once for all row combinations, and then just for new rows as they are added (<4k new rows per year) seems to make sense.

The resulting table of associations might look like this:

itemnumber    |  associatedwithitem        | associationstrength

23920390293   |   12356456885              | 0.12255888644888

45468411516   |   44565464884              | 0.91155684161123

45648855222   |   98956221818              | 0.00000000000000

45468411516   |   23920390293              | 0.46813185844468

The size of such a table, however, would be immense:
((700,000 x 700,000) – 700,000)/2 = 244.999.650.000 = approximately 250 billion rows,
even after throwing out all self-connections (-700,000 in formula) and storing each connection only one-way (the divide by two in the formula).

I will be running only one type of query as follows:

"Given a list of itemnumber (see table example above), calculate the average (mean) association number."

Each list of itemnumber run against the association table will generally contain < 1k itemnumber, but may rarely be as large as 50k. But because each number in a list of 1k itemnumber will be associated with 700k other itemnumber, such a query would extract 700,000 x 1000 = 700,000,000 association numbers and would then need to have the mean of those 700m association numbers calculated.

Any ideas for the following:

  1. Best data management system to hold this table
  2. Structure (250 billion rows vs 700k rows with blob containing association data for each)
  3. Best way to extract data and calculate means

Any input would be helpful.

Best Answer

Short answer: it depends.

Long answer: the answer depends on several factors. Here are a few:

  1. Does your 700K row table get updated infrequently? If so, I would lean toward pre-calculating the means. If they are not static but change in a regular, predictable interval (e.g., monthly or yearly), pre-calculating may still be a solid answer. If they change often or unpredictably, pre-calculating becomes harder to justify. You do mention that the number of rows added per year is relatively small compared to the total.
  2. Do the existing item numbers ever change? If not, that is a factor in favor of pre-calculation. If they do change, pre-calculation becomes a little harder to justify, depending upon the frequency of change.
  3. Do you have the disk space available? Figure that you have 250,000,000,000 rows. Your association table would have two bigints (8 bytes apiece) and one decimal(8,7) (5 bytes apiece), or 23 bytes per row. For 250 billion rows @ 23 bytes per row, you're looking at a bit over 5 TB. You'll want to factor in disk space for the transaction log, database backups, possibly having development/QA/staging environments, (probably) having non-clustered indexes on the table, etc. Depending upon your database product, you could have compression reduce that by a fairly significant amount, but I'd say you would probably want at least 15-25 TB of disk space available before materializing; if you can't get that much disk space, then materializing the relationship just doesn't work. Even if you generate a surrogate 32-bit integer key for each of the 700K products, you're looking at ~13 bytes per row, or 3 TB for that table (plus additional bits).
  4. How time-sensitive is this process? I don't know how long a regular run would take when you calculate it as part of the script, but if these jobs are something that can run in the background or overnight, then that would be a factor against pre-calculating. Even if you do pre-calculate, you're likely going to scan the 250 billion row table to get your answer, so performance won't be zippy in either event.
  5. How often do these requests come in? If it's once a month, that argues against pre-calculating. But if requests come in daily (or several at once), pre-calculation might be the better option.
  6. What else is this server used for? If you're running this from an active machine, you probably do not want to pre-calculate and select from a big table. The reason is that scanning 5 TB of data would drive pretty much everything else out of memory, hurting the performance of everything else. If, on the other hand, this is a research server, development server, or you don't mind slow performance from the server in the aftermath of running one of these queries, then pre-calculation wouldn't be a big deal.
  7. What is the frequency of item repeats in requests? In other words, does item 23920390293 (for example) show up in a lot of reports, or are the separate requests essentially distinct? If you see common elements, that argues in favor of pre-calculation (because then you only have to calculate once), but if the requests tend rarely to repeat numbers, you don't gain as much from pre-calculation.

If you do keep this in a SQL table, I'd recommend against having a blob. I don't think you would get much (if any) size gain over a compressed table, and you might take a performance hit. Even if you use a separate server to perform application-level calculation (instead of doing it in the database), I think it still makes sense to keep the table normalized.

Also, if you have all the disk space in the world, you might want to keep both sides (distance between A and B, and distance between B and A). You double the amount of disk space required, but make the query that much simpler: one join between your item list and the master item relationship table, rather than two separate queries (and the additional difficulty of ensuring that you have all of the relationships between the tables).