Design / Normalization of a Database

database-designsqlite

I’ve got a question concering the design of a SQLite database. I’m building a database for analyzing purposes that constists of works of literature and metadata concering those entities.
The thing I’m asking is wether it makes sense to split my table into several tables, e.g. buliding a relational database. Right now I’ve got one main table in which my entites are stored. The columns are a unique title, one for the author (who may quite well not be qunique in the database), a publication year, publisher (also not unique) and a lot of specialized fields containing tags connected with the work. Those tags are organized in three categories and each entity can be connected with multiple tags (therefore Tag A 1 and so on). One entity would look like this:

Title | Author | Year | Publisher | Tag A 1 | Tag A 2 | Tag A 3 | Tag B 1 | Tag B 1 | Tag B 3 | Tag C 1 | ...

Is it useful to build a table with authors, one for publishers and than a table for each category of tags and than link those back with foreign keys? What would be the advantages besides no more doubling of data? I’m going to query the database for things like "Titles with more than 2 manifestations of Tag A from 1980 to 1990" or "Authors using Tags A and B".

All my columns are filled with atomic values. If I’m correct I’m looking at brining my database into the first normal form by trying to "eliminate repeating groups in individual tables" – is that correct?

Thanks for your help!

Best Answer

Is really up to you if you want to normalize or not, depends largely on the requirements of your implementation and how are you displaying (or not) the data...

I think an alternative approach to the dataset you present would benefit a lot of a structure such as:

      ----------------
      |   Authors    |
      ----------------
      | AuthorId     |<-.
      | Name         |  |
      ----------------  |
                        |
      ----------------  |
      |  Publishers  |  |
      ----------------  |
      | PublisherId  |<-+-.
      | Name         |  | |
      | IsActive     |  | |
      ----------------  | |
                        | |
      ----------------  | |
      |    Tags      |  | |
      ----------------  | |
  .-->|  TagId       |  | |
 | .--|  CategoryId  |  | |
 | |  |  TagValue    |  | |
 | |  ----------------  | |
 | |                    | |
 | |  ----------------  | |
 | |  |  Categories  |  | |
 | |  ----------------  | |
 | `->| CategoryId   |  | |
 |    | CategoryValue|  | |
 |    ----------------  | |
 |                      | |
 |    ----------------  | |
 |    |    Titles    |  | |
 |    ----------------  | |
 | .->| TitleId      |  | |
 | |  | AuthorId     |--` |
 | |  | PublisherId  |----`
 | |  | TitleName    |
 | |  | Year         |
 | |  ----------------
 | |                 
 | |  ----------------
 | |  |   TitleTags  |
 | |  ----------------
 | `--| TitleId      |
 `----| TagId        |
      ----------------

A junction table will store any amount of tags per title and an additional [tag]Category table could group this tags so you'll be able to query for entire groups instead of 'per tag' search.

Note that by, also, separating the Publishers you could add more fields (like IsActive in the example), so then you can have more granular queries, be more specific on what your looking for.

Related Solutions

Representing N:N relation as a functional dependency in a database design

The question you raise has to do with the definition of first normal form (1NF). Whether the answer directly involves functional dependencies depends in part on the definitions you accept. Wikipedia has a fairly simple article about 1NF.

title                                author    year  category   
--
An Introduction to Database Systems  CJ Date   2003  databases, modeling, storage, retrieval

If you look at the column "category" one way, it contains a single value. Depending on your dbms and your design, that value might be the string "databases, modeling, storage, retrieval", or it might be the array "{databases, modeling, storage, retrieval}".

If you look at the column "category" another way, it contains four values. Those values are the four strings "databases", "modeling", "storage", and "retrieval".

In database design, the solution is to use two tables. But I don't think you can decompose the "bad" table by projection (which CJ Date identifies as the decomposition operator), because projection doesn't split the content of a column into multiple rows. (Projection doesn't give you four rows from the single value "databases, modeling, storage, retrieval", which is what you need to do. "Join", the recomposition operator, doesn't yield a single value like "databases, modeling, storage, retrieval", either.)

The inability to decompose by projection suggests that the solution to this problem doesn't have to do with functional dependencies. The resulting table would have three attributes, {title, author, category}, the only key would also be {title, author, category}, and that table would be in 5NF.

Design Tagging module in the database

What you describe in your second paragraph sounds like the Entity Attribute Value pattern.

This is a known pattern for dealing with the type of data you are talking about. When you need to be able to be extensible, drive the types from the data and allow you to start tracking new types as you grow. It allows you stay flexible and agile and be a little more forgiving and generic in your database design and schema.

This pattern works, even in relational databases, but it does take more work to get good performance, and can cause you headaches down the line depending on how many rows, how many inserts you are doing and how frequently you query it. This SO Question talks about some of the pitfalls nicely in the firs answer listed with 20 votes. I have seen this model work well, especially in environments where the new types and entities really do come in and we want to let that be more application driven or user driven but we were fighting with performance more often in that model than a traditional relational database model but we needed the flexibility and we made it work. Definitely look at that SO question and do some more research on EAV and whatever DBMS you are using to find examples from others who have tried the same.

Best Answer

Related Solutions

Representing N:N relation as a functional dependency in a database design

Design Tagging module in the database

Related Question