Database partitioning – Horizontal and Vertical sharding – Difference between Normalization and Row Splitting

database-designdatabase-tuningpartitioning

I am trying to grasp the different concepts of Database Partitioning and this is what I understood of it:

Horizontal Partitioning/Sharding: Splitting a table into different tables that will contain a subset of the rows that were in the initial table (an example that I have seen a lot if splitting a Users table by Continent, like a sub table for North America, another one for Europe, etc…). Each partition being in a different physical location (understand 'machine'). As I understood it, Horizontal Partitioning and Sharding are the exact same thing(?).

Vertical Partitioning: From what I understood (http://technet.microsoft.com/en-us/library/ms178148%28v=sql.105%29.aspx ), there are 2 sorts of Vertical Partitioning:

  • Normalization (which consists of removing redundancies from a the database by splitting tables and linking them with a foreign key).

  • Row Splitting, here is what I don't understand, what is the difference between Normalization and Row Splitting? In what those 2 techniques differ from each other?

I have also read in this post (https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases ) that the difference between Horizontal Partitioning and Vertical Partitioning is that in the first you scale by adding more machines, while in the second one you scale by adding more power (CPU, RAM) to your existing machine, is that a correct definition? I thought that the core difference between those 2 techniques resides in the way you split your tables.
This answer does make sense according to MongoDB's definition of Vertical Partitioning:
http://docs.mongodb.org/manual/core/sharding-introduction/

but it goes in contradiction with other answers or articles I have come across:

https://stackoverflow.com/questions/18302773/what-are-horizontal-and-vertical-partitions-in-database-and-what-is-the-differen
http://technet.microsoft.com/en-us/library/ms178148%28v=sql.105%29.aspx
http://building.wanelo.com/post/42361472646/the-case-for-vertical-sharding

I am sorry for the load of questions but I am a bit confused as a lot of different websites that I have came across say different things.

Any help clarifying would be greatly appreciated. Any link to a clear and simple demonstration with a few tables would also be very helpful.

Best Answer

Although Normalization and partitioning both produce a rearrangement of the columns between tables they have very different purposes.

Normalization is first considered during logical datamodel design. It is a set of rules which ensure that each entity type has a well-defined primary key and each non-key attribute depends solely and fully upon that primary key.

Partitioning comes in during physical database design, when we start to map logical attributes to physical columns and determine the operational characteristics required from the system. Sometimes it is an optimisation added after testing under load because performance was found to be inadequate. It can also play a role in implementing a data retention policy.

In partitioning we recognise that a table is made from rows and columns. When we partition we separate some of those rows (or columns) from the others and hold them in a physically different location.

Horizontal partitioning is when some rows are stored in one table, and some in another. There could be many sub-tables. A typical example is when currently-active transactional data is separated from old "archive" data. This keeps "hot" data compact, with associated performance improvements. We many be able to make the archive tables read-only, compressed and on cheaper disk, too.

As the next step each partition may be moved onto separate hardware. This is commonly know as "sharding." Advantages include being able to use many cheaper boxes rather than one very large, very expensive server, and being able to position a user's data geographically close to her. The cost is increased application complexity. Some DBMS incorporate this ability natively.

Vertical partitioning is when some columns are moved to a different table or tables. Similar to horizontal partitioning the motivation is to keep the "hot" table small so access is faster. Say you run an e-marketing company. 99% for the time you need a person's name and email address and nothing else. These will go in one table and all the other stuff which is useful but seldom-used - birthday, golf handicap, PA's phone number etc. - go in a different table. It can also help when the partitions have different update regimes or are owned by different sections of the business. The two tables can have the same primary key column, and corresponding rows could have the same key value. While it is possible to have multiple vertical partitions for a table, and to shard vertically, I've never come across it.

Vertical and horizontal partitioning can be mixed. One may choose to keep all closed orders in a single table and open ones in a separate table i.e. two horizontal partitions. For the open orders, order data may be in one vertical partition and fulfilment data in a separate partition.

The techniques I've talked about are ways to change the design to improve performance. Scaling is when you change the hardware. One can scale up by buying a bigger box with more RAM, CPU or faster disk, or scale out by moving some of the work onto a different box. Scale up is sometimes called scaling vertically whereas scale out can be called horizontal scaling. While horizontal scaling and sharding have an obvious relationship they are not synonymous. It would be possible to use replication technologies to copy an entire database to another location for use by the users there, thus achieving scale-out, without having to partition any tables.