Accumulo table design methodology

database-designnosql

I am just getting started with Accumulo and NoSQL databases and I am looking for some discussion on table design. I get the key value structure that is seen in the manual. However, if I am trying to recreate a relational database, I am not sure how relationships work. Can someone explain to some degree how to setup and "Hello World" database (i.e., manager-employee database). I want to use key-value implementation.

Best Answer

The first thing you should realize is that Accumulo is not a relational database. If you want to make it work like one, you need to create that functionality on top.

One way to do this would be to create rows that look like rows in a relational table, complete with foreign keys that you link together in your application's (an Accumulo client) code. Your client code must enforce the cascading updates/deletes and the other maintenance aspects of relations you may have come to expect from a relational database. Accumulo just stores and sorts Key/Value pairs for you. You decide the semantics of these Key/Value pairs.

Examples:

Note: For illustration purposes, Accumulo Key/Value pairs will be represented as:

RowID:ColumnFamily:ColumnQualifier -> Value

Employee table:

EmployeeID1:Data:Name -> Chuck
EmployeeID1:Data:Position -> Owner

EmployeeID2:Data:Name -> Barbara
EmployeeID2:Data:Position -> StoreManager
EmployeeID2:Data:Manager -> ManagerID1

EmployeeID3:Data:Name -> Tom
EmployeeID3:Data:Position -> SalesManager
EmployeeID3:Data:Manager -> ManagerID2

EmployeeID4:Data:Name -> Sally
EmployeeID4:Data:Manager -> ManagerID3

EmployeeID5:Data:Name -> Rick
EmployeeID5:Data:Manager -> ManagerID3

Manager table:

ManagerID1:Info:EmployeeID -> EmployeeID1
ManagerID1:Employees:1 -> EmployeeID2

ManagerID2:Info:EmployeeID -> EmployeeID2
ManagerID2:Employees:1 -> EmployeeID3

ManagerID3:Info:EmployeeID -> EmployeeID3
ManagerID3:Employees:1 -> EmployeeID4
ManagerID3:Employees:2 -> EmployeeID5

See how I've established foreign keys in both directions? Scanning the Manager table, you can quickly see which employees are managed by whom, but you'll have to look up each EmployeeID in the Employee table to see their information. Scanning the Employee table, you can quickly see information about a particular employee, including the ID of their manager, but you'll have to look up the manager's EmployeeID in the Manager table, and then look that up in the Employee table if you want the name of their manager.

This could be drastically simplified by combining the two tables into a single table and avoiding a separate ManagerID by using EmployeeID for both purposes. Something like:

One Employee table:

...
EmployeeID3:Data:Name -> Tom
EmployeeID3:Data:Position -> SalesManager
EmployeeID3:IsManagedBy:EmployeeID2 -> Since 2001
EmployeeID3:Manages:EmployeeID4 -> Since 2003
EmployeeID3:Manages:EmployeeID5 -> Since 2001
...

However, depending on your application, you may find that in a NoSQL database, you are better off flattening and duplicating your data, rather than normalizing with relations.

Flattened Employee table:

...
Tom:Data:Position -> SalesManager
Tom:IsManagedBy:Barbara -> Since 2001
Tom:Manages:Rick -> Since 2003
Tom:Manages:Sally -> Since 2001
...

Related Solutions

How to design relationships for variant data

This might force a bit of change for your tables, but we do something like this at work:

Ensure each table has a int/guid primary key of some sort:

Organization Table

ID | Name
1  | MyOrganization

Then create a type-table to house your different LeadSource types:

LeadSourceTypeKey | Description
1                 | Organization
2                 | Company Office

Then on your ProductSoldList Table add 2 columns LeadSourceID and LeadSourceType

Then you can easily query the ProductSoldList TABLE and know which LeadSource to join back to:

SELECT *
FROM   [ProductSoldList] PL
       INNER JOIN [Organization] O
       ON PL.LeadSourceID = O.ID AND
          PL.LeadSourceType = 1
       INNER JOIN [Company Office] CO
       ON PL.LeadSourceID = CO.ID AND
          PL.LeadSourceType = 2

Database Normalization – Foreign Key vs Varchar Field

The design choices you describe are not directly related to normalization.

I agree there should be a lookup table.

I think an OrderStatusID value would increase redundancy. The status (text) value presumably already satisfies many of the qualities of a good key: unique, stable, narrow, familiar to users, etc. Referential integrity can be applied to VARCHAR columns, of course! Each application that uses the key can assign it a enum as required and would be responsible for mapping enum values to status (text) values. This would presumably make the lookup table a single column, 'all-key' table (and therefore would satisfy 6NF, the highest normal form ;)

[If OrderStatusID is an attribute in the Order table then it would not be in 6NF but, as I say, I don't think you are actually asking about normalization at all.]

Best Answer

Examples:

Related Solutions

How to design relationships for variant data

Database Normalization – Foreign Key vs Varchar Field

Related Question