SQL Server – Database Design for Unused Columns with Same Schema

database-designschemasql serversql-server-2012

I am a web developer so I don't know a lot about databases. The company I joined recently has a very mature Desktop ERP built in .NET and SQL Server, they are providing services to huge corporate clients and there design is working fine. But they didn't develop any web based system. But there database design is quite unusual. Let me explain it and then I will post my questions.

So, Now I joined them to develop them a web based ERP (a replica of there desktop system in web). Since I am building the application from scratch, they have given me liberty to revamp any thing which I think would effect positively.

Now the design is,

They have around 150 tables in database.
Every table has the same schema definition.
They have divided the fields in three categories.
- Strings (so they assign 50 varchar(250) fields in database).
- DateTime (so they assign 15 smalldatetime fields in database).
- Numeric (so they assign 30 Numeric() fields in database).
All columns have names as (these names don't terrify the developers, in a week or two they are accustomed and even remembered many of the fields associations):
- Strings (S1, S2, S3, S4 and so on).
- DateTime (D1, D2, D3, D4 and so on).
- Numerics (N1, N2, N3, N4 and so on).
As I have told you the schema. Every table consist of 95 columns. And only 15-20 columns are actually been used. The remaining 75-80 columns are NULL.
The tables are well normalized and indexes are maintained.
Number of rows in most of the tables are less than 1000. Only the transaction table records touches several hundred thousands.
The precision for numeric columns are by default (1, 0). When any field is selected to be used then the precision is adjusted as per the requirement.
An empty database is of ~4MB.
This design makes there development quite easy. Since they have a number of columns and whenever they need a field. They just select the data type, i.e. String or Numeric or DateTime and the next available column is assigned.

I think this information is quite enough. Now I want to ask

Since, I don't know a lot about SQL. Is this design viable for web environment (web API which will be called from web client as well as from mobile)?
Since, every table has 75-80 NULL columns. Does they cost us a lot of memory in future, when transaction records will touch millions.
What are your suggestions to improve this design?

THANKS.

Best Answer

1) Yes. It's a maintenance/documentation nightmare but technically there's no reason it wouldn't work.

2) In general each null will be one bit of storage. So 80 null fields might be 10 bytes per row. The full answer is that it varies depending on data type but with varchar for the most part it's a good rule of thumb.

Some alternatives where you expect fields to be null, are column sets (see https://msdn.microsoft.com/en-AU/library/cc280521.aspx); where the engine munges those often unused columns into a single XML column. I don't think it's a very good fit for your purposes because it messes up indexing and introduces a bunch of limitations - but it's something to be aware of.

I don't think performance is going to be affected though. It will mean that you'll have additional lookups if queries are doing select * instead of just the columns they need (and so it would be hard to create proper covering indexes). But that's a generic problem and not specific to this design.

3) I think you should probably try to forget refactoring it and just go ahead building your application as best as you can with what you have.

A database developer would probably start by importing one of the existing databases into SSDT (a Visual Studio database project) which will create the schema, then they'd do any minor fixes required to get a build going, and see that it can create a usable empty database for use with the software.

After that they'd go over the source code and start using SSDT's right click refactor functionality to give the tables and columns proper names; which will rename all the procedures/functions etc that use it.

This still leaves any dynamic SQL broken, external reports broken, and anything that lives outside of the database broken (the application, web services, queries from customers with direct access!) This means it's a pretty big and collaborative effort to fix.

I'm almost certain they will run across fields which have multiple meanings - because that's what happens when people build software this way. Luckily if they were doing all of this then they can use that opportunity to split dual-usage fields into separate fields.

But can you do that? I imagine it's outside of your scope, and businesses generally don't like paying for such things when "why would you need to do that, the software works as-is!"

Related Solutions

Database Design – New Table vs New Columns

What you are wrestling with is vertical partitioning. This is a physical database design technique to improve performance. As with any physical database design technique, its applicability depends on the specific queries you are trying to optimize and if this technique will optimize them. From a logical standpoint, if these new fields depend upon the candidate key for your entity then they are facts about it that belong with it. First you should make sure you fully understand the functional dependence of these new fields on your candidate keys to verify they really are facts about daily page views. If they are, deciding to partition them into another table is a performance optimization that should only be done if it achieves your performance goals.

In general, vertical partitioning is useful if you will query these new columns infrequently and distinctly from the other columns in the original table. By placing those columns in another table that shares the same PK as your existing table, you can query it directly when you want those new columns and get much greater through-put as you will have many more rows per page on disk for this new table as all the columns from the original table won't be sitting on those rows. However, if you will always query these columns along with the columns in the original table then a vertical partition wouldn't make much sense as you will always have to outer join to get them. Pages from tables on disk come into the buffer pool of a DBMS independently, never pre-joined, and so that join will have to happen with every query execution even if the data is pinned in the buffer pool. In this scenario making them NULLABLE columns on the original table would enable the DBMS storage engine to store them efficiently when NULL and eliminate the need to join on retrieval.

It sounds to me like your use case is the latter and adding them as NULLABLE to your original table is the way to go. But as with everything else in database design, it depends, and in order to make the right decision you need to know your expected workload and what making a good choice depends on. One good example of a proper use case for vertical partitioning would be a person search panel, where your application has some very rarely populated information about a person that someone might want to search on but rarely does. If you put that information into a different table you have some good options for performance. You can write the search so that you have 2 queries - one that uses the main, always populated information to search (like last name or ssn) only, and one that outer joins the very infrequently populated information only when it is requested for search. Or you could take advantage of the DBMS optimizer if it is smart enough to recognize for a given set of host variables that the outer join is not needed and won't perform it, and thus you only have to create 1 query.

What DBMS platform are you using? The way in which the platform handles NULL column storage, optimizes your query, as well as the availability of sparse column support (SQL Server has this) will impact the decision. Ultimately I would recommend trying out both designs in a test environment with production sized data and workload and seeing which better achieves your performance objectives.

Database with insert data every 2 minutes

Inserting 10 rows each 2 minutes will result in (24*60/2)*10 rows per day (7200) -> that is not a large value to worry about. Also, it's great that you think on the "future - 10 years", but don't lose time with premature optimization.

If your only concern it's about this table where you're inserting data every 2 mins, there's no point in creating additional databases (one per year), so let's stick with this table (Table_A).

Now, because this table will slowly increase in time, and your want your queries to run fast, you have plenty of options:

you can partition the table by certain criteria. Because your using the table for a "weather station", and your data is time series values, your best option would be to partition by [station_id] then by [created]
Create a Table_A_Archive, where you can move data that would be to old to keep in Table_A.
How long do you intend to keep data in Table_A ? would it make sense to delete old rows that become obsolete for you application

... And so on.

Best Bet: Partitioning the existing table by [station_id] and [created], you will have "A" partitions for each station, "B" partitions for each month, and a total of AxB posible number of partitions. Once you partition Table_A, do the same thing for Table_A_Archive, and on the end of each year, move the data from Table_A to Table_A_Archive.

** IMPORTANT:** After you make the partitioning schema, keep in mind that all queries should have in the WHERE clause the conditions necesarly so that the query will hit as little partitions as posible.

Ex.

Select AVG(pressure)
FROM Table_A
WHERE station_id = 123
AND created between '2013-01-01 00:00:00' AND '2013-02-01 00:00:00'

The query above will only touch the partition "123" and the 2 sub partitions for months 1 and 2

Best Answer

Related Solutions

Database Design – New Table vs New Columns

Database with insert data every 2 minutes

Related Question