Sql-server – better way to handle a multi-level ParentId table structure

database-designperformancequery-performancesql server

I work for a publisher, and our products are mainly books and journals. They are most commonly structured as follows:

Book > Chapter

Book Series > Book > Chapter

Book > Volume > Chapter

Book Series > Book > Volume > Chapter

Journal > Volume > Issue > Article

Journal > Volume > Article

We currently have all of these records stored in the same table with Id and ParentId columns. For example, a book with TitleId = 1 that has 3 chapters would have the following rows:

Book: Id = 1, ParentId = 1
Chapter #1: Id = 2, ParentId = 1
Chapter #2: Id = 3, ParentId = 1
Chapter #3: Id = 4, ParentId = 1

All these records, whether they're books, chapters, journals, articles, etc., can have their Id connected to other tables for things like authors, prices, ownership, etc.

The problem this structure creates for us is that the nesting adds tons of overhead in certain situations. For example, if someone tries to access a journal article they purchased, we need to run multiple queries to know if they do, in fact, have access. We have an ownership table containing Id's of owned products, so we not only need to check if the user owns the journal article's Id directly, but we also need to check for ownership of the Id's for the article's parent issue, volume, and journal. (I.e., if the user owns the entire journal, it's implied they own all of the articles within that journal despite not having explicit access.) So our main "ownership" query is rather bulky since we need it to check for ownership at all possible levels of nesting.

There are many other similar situations. For example, if we need to get all the authors associated with a book, we need to run a query for the book directly, then walk down to each chapter and get all the authors for each one, or if the book has volumes, we need to walk down each volume, and get the authors for each volume's chapters.

Another situation is searching, where we need to basically aggregate all these different types to be seen as relatively equal so a search can be performed on them, and the results, whether they're a book, chapter, journal, articles, etc., all need to appear alongside each other in the search results.

What I'm wondering is if there's a way to improve our setup so we can get relational data faster (things like ownership checks, authors, prices, etc.), by either:

  1. Restructuring our main table to get away from the Id/ParentId setup, or,
  2. Building new tables/views off to the side.

I'm interested in any comments/advice/recommendations anyone has about our current setup, or proposing a new setup.

Best Answer

I would always store the intermediate entities, possibly by adding missing ones as dummy entries, so that you can query using always the same query.

E.g., store

Book > Chapter

as

Book Series (dummy) > Book > Volume (dummy) > Chapter

Now, you can query

SELECT * 
FROM
   BookSeries bs
   LEFT JOIN Book b      ON bs.BookSeriesID = b.BookSeriesID 
   LEFT JOIN Volume v    ON b.BookID = v.BookID
   LEFT JOIN Chapter c   ON v.VolumeID = c.VolumeID

where BookSeries and Volume will yield NULL columns, except for primary and foreign keys.

So, always store starting with BookSeries and end at the desired level. So If you want to store a book with no volume and no chapter, this is fine. The query will yield NULL columns for volumes and chapters (inlcuding for primary and foreign keys). This is ensured by the LEFT JOIN.

Journals have a different structure. Here two approaches are possible:

  1. Store journals as independent hierarchy in other tables.
  2. (partially) unify the two hierarchies and use a Type or Kind field as a discriminator to differentiate between books and journals
    BookSeries > BookOrJournal > Volume +--> Issue > Article
                                        |
                                        \--> Chapter