Sql-server – Grouping data in many to many relationships to improve performance

database-designsql server

I have a "classic" SQL Server three table many-to-many in the form

PC -< [PCHasSoftware] >- Software

I am having issues with data volumes and performance in a very large database scenario, so am exploring a further redesign to improve performance and cut storage, etc.

Many PCs will have the same software titles. So can we cut storage by somehow reorganizing that into groups of titles. Then instead of saying

"this PC has these 16 titles" - needing 16 rows per PC

we have instead:

"this PC has this group of 16 titles" needing 16 rows somewhere and only 1 row per PC.

So if 10,000 PCs have these 16 software titles, I'm storing 10,016 rows instead of 160,000 rows.

Two initial thoughts on that idea.

  1. Build software to identify the 8 most commonly occurring groups, and then retain 8 bit flags on the PC table, and have a template PC which holds what those titles are in the same structure

  2. Organize into groups

    PC -< [PCHasSoftwareGroup] >- [Software Group] -< Software
    

or (edit)

    PC -< [PCHasSoftwareGroup] >- [Software Group] -< GroupHasSoftware >- Software

The difficulties with these models I perceive are:

  • How to identify which groups are most popular. Probably a chunk of SQL operations that runs in a batch job overnight to "groupize" the standard mode. I wondered if any clever "dba+maths" people have come up with any algorithms that could do this

  • When one of those 16 titles is moved off a PC, you suddenly have to remove the group relationships and insert the 15 rows as non-grouped detail row, so potentially a lot of traffic for one update

  • How to combine this with software which does not fall into any logical grouping. I'm not a fan of using union because "SQL" only seems guaranteed to generate good query plans when you keep things like that out, but that doesn't mean I won't use it if its the best solution!

  • There will be some data that is specific to this PC, like DateInstalled, DateLastSeen etc etc. So we'll still have to store some data at the PC-software level, but this could still reduce data volumes compared to the data model described at the top of this post

This may have other applications, and I imagine it would have been done before. e.g. in an order entry system, if its very common for a certain combination of products to appear on orders, you could represent that as a group as a single order line in the database and expand it during querying.

I'm just kicking around ideas here and would appreciate all thoughts. I wasn't even sure what to google for but I've tried for the last week to find other examples of this and failed.

Best Answer

It may be worth looking at things from the other direction.

I suspect that you have a large number of PCs that are set up with exactly (or almost exactly) the same set of software. You may want to consider establishing a set of PC "profiles": a set of software that's going to be standard for hundreds (or even thousands) of PCs.

This has most of the same benefits and drawbacks as your current plan, but might impact more rows more easily. And, it might make certain kinds of changes easier as well: in many cases, a new/updated piece of software will ultimately be deployed to all the machines; you can create a new profile, and move PCs to it as the software change is applied.

Functionally, this would look much the same as what you're talking about; I simply think the perspective helps nail tings down a bit.

If you are going to create application groups, I would not try to generate groups based on what's on PCs today directly. It's useful information, but should be looked at carefully, to identify meaningful groupings, and not just "accidental" ones. It's better to create four different groups where each one actually contains software that has a logical reason to all be on the same machine together, than to create just two groups, each of which contains some things that may be found together, but not on purpose.

Finally, note that this sort of thing always involves a trade-off. You may get better performance because you're dealing with smaller tables, but your queries will involve more tables and/or more joins, and be more complicated. The end result might actually be worse performance.