I have an entity with millions of instances. Each instance has to reference an ordered list of items. The lists have to be unique so no list will be stored more than once. But once created, both lists and entity instances are immutable. There will be far more entity instances than lists, and the database has to support fast insertions of entities.
So what's an insert-efficient, robust, way of storing immutable, unique, ordered lists?
Edit: The list items are simple integers, and the typical length is about 5 items. Long lists of, say 10 or 20 items are very unlikely but possible.
Edit: So far, I've considered these approaches:
1)
lists
table has these columns: <list_id> <order> <item>
so if list #5 contains the elements [10,20,30]
the table will contain:
5 1 10
5 2 20
5 3 30
The entity table will have a item_list_id
column that references the lists
table (it's not a foreign key since list_id
is not a unique column in the lists
table – The this can be solved by adding another table with a single column that which contains all valid list_ids
).
- This solution makes inserts a bit tricky
- It also places the responsibility for enforcing the uniqueness of lists on the application, which isn't great.
2)
lists
table has these columns: <list_id> <item1> <item2> <item3> ... <itemN>
so if list #5 contains the elements [10,20,30]
the table will contain:
5 10 20 30
The entity table will have a item_list_id
column that references the lists
table.
- This solution is less robust since list lengths are limited (although this isn't a huge problem for me since my lists are VERY unlikely to contain more than 10 or 20 elements)
- This approach is quite horrible to query ("find all lists in which a particular item appears" has to specify each and every column), and a nightmare to map using an ORM.
- Insertions of new entities is not too bad since my lists are typically 4-5 items long, so I can probably index the first few columns.
- Enforcement of uniqueness is still in the hands of the application.
3)
Use solution #1, but replace the meaningless list_id
with a hash (say SHA-1) on the list in serialized form.
- This way uniqueness is more strictly enforced since lists will have unique hashes
- Insertions are simpler and quicker(?)
- The data integrity enforcement is still in the hands of the application.
Best Answer
I would go with a traditional set-based approach for this. Create entity, list, entityList, and listItem tables to store your data. Individual list and entity tables allow you to create lists without entities and vice versa. Constraints on these tables such as unique constraints and foreign keys will help preserve the integrity of the data.
For the list, create a single stored procedure which is the single point for inserting lists and listItems. This proc has logic to detect if a list already exists and serves as a gateway. This is straightforward to do using set operators like EXCEPT.
Here's a simple demo which shows how something like this might work. I'm using a table-valued parameter (TVP) to handle passing around lists and a scalar function to determine if a list already exists. Scalar functions are evil when used in resultsets (eg SELECT, WHERE clauses, JOIN ) but are ok when used for what they are meant for, returning single values. Spend some time working through it and see if it could work for you:
I'm a big proponent of XML but it is the wrong choice for this design. XML can be appropriate for semi-structured, but your design is actually really structured. The overhead of constantly shredding XML for comparison would be too much, plus you can't directly compare once piece of XML with another easily in SQL Server. Where XML would be useful here, is visualising your lists, as I've done at the end of my demo.
HTH