I have a table of approximately 700k rows, each identified by a unique itemnumber.
Each row/item can be associated with any of the other rows/items in the table by calculating a single numerical value indicating the strength of the association or the mathematical "distance" between two items, with no association/infinite distance represented by "0" while the numerical value "1" indicates the same item/no distance.
These association numbers are difficult to calculate/computing intensive and are based on data stored in a separate database. Therefore, precalculating them once for all row combinations, and then just for new rows as they are added (<4k new rows per year) seems to make sense.
The resulting table of associations might look like this:
itemnumber | associatedwithitem | associationstrength
23920390293 | 12356456885 | 0.12255888644888
45468411516 | 44565464884 | 0.91155684161123
45648855222 | 98956221818 | 0.00000000000000
45468411516 | 23920390293 | 0.46813185844468
The size of such a table, however, would be immense:
((700,000 x 700,000) – 700,000)/2 = 244.999.650.000 = approximately 250 billion rows,
even after throwing out all self-connections (-700,000 in formula) and storing each connection only one-way (the divide by two in the formula).
I will be running only one type of query as follows:
"Given a list of itemnumber (see table example above), calculate the average (mean) association number."
Each list of itemnumber run against the association table will generally contain < 1k itemnumber, but may rarely be as large as 50k. But because each number in a list of 1k itemnumber will be associated with 700k other itemnumber, such a query would extract 700,000 x 1000 = 700,000,000 association numbers and would then need to have the mean of those 700m association numbers calculated.
Any ideas for the following:
- Best data management system to hold this table
- Structure (250 billion rows vs 700k rows with blob containing association data for each)
- Best way to extract data and calculate means
Any input would be helpful.
Best Answer
Short answer: it depends.
Long answer: the answer depends on several factors. Here are a few:
If you do keep this in a SQL table, I'd recommend against having a blob. I don't think you would get much (if any) size gain over a compressed table, and you might take a performance hit. Even if you use a separate server to perform application-level calculation (instead of doing it in the database), I think it still makes sense to keep the table normalized.
Also, if you have all the disk space in the world, you might want to keep both sides (distance between A and B, and distance between B and A). You double the amount of disk space required, but make the query that much simpler: one join between your item list and the master item relationship table, rather than two separate queries (and the additional difficulty of ensuring that you have all of the relationships between the tables).