Are there any inherently non-partitionable datasets

partitioningscalability

Are there any practical datasets that, by their nature, cannot be partitioned in any practical way?

I am wondering about this because such datasets would be ill-suited to scaling out on commodity hardware and databases containing such data could be forced to scale up on expensive servers, at great cost.

Edit: by nature, I mean that the lack of partitionability is a consequence of the data and of the queries that must be made against it, rather than poor design choices.

Best Answer

Interesting question about the nature of data itself - we could almost stray into philosophy here!. As Phil Sumner points out, it would be a bad idea to try and do this using fields which always have a similar value - i.e. if one had a table ("Open_Source_Programmers", or even "Programmers" in general) that was partitioned on gender, the partitions would be very skewed.

See here for example - lots of articles about this (I Googled "how many open source programmers are female"), but as far as I have seen, it's certainly not above 5% and is probably closer to the 1.5% mentioned in the article I've quoted.

Similarly, I imagine that partitioning people who self-identified as a "programmer" would be very skewed on an age criterion. IMHO, few would fall outside 15-30.

From Codd's laws (The guaranteed access rule - sometimes called law 1 or 2 depending on whose definition you're looking at), every datum should be accessible "by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row", every table should have a primary key.

By definition, if you, as a database practitioner are looking at a given dataset, it must be interesting to you (or, maybe more often, the business you're working for) for a reason. These business reasons will dictate your partitioning logic. Holding truly random data in a database makes no sense. If it were truly random, what criteria would you use to query it? How would you make sense of any results? A given datum (or data point) will (or should) have some sort of independent variable associated with it (time would probably be the main one).

But, to go back to my Open Source contributors example, any logical decision to partition the data would not include gender - maybe "Continent"? As far as I can see, a good first pass would be to go with "North America", "Europe" and "ROW" (rest of world).

Your question begs another, perhaps more fundamental, question. If data is not partitionable, is it data at all? Furthermore, when you query data, you are, ipso facto, partitioning it in some shape or form (SELECT * FROM.. with no WHERE clause excluded).

Information theory, philosophy are all going to come into play here (I've never formally studied either), but I'm very interested in this on an intellectual level and would be delighted for anybody to critique what I have written and point out any (hopefully not obvious) flaws.