Sql-server – Database Design guidance required

database-designsql serversql-server-2008sql-server-2008-r2

I have scenario where there is Table T with columns PK,A, B, C,D

TABLE t
(
PK,
A,
B,
C,
D
)

i have to insert data from different sources ie columns A,B,C,D are calculated separately..
and they same primary PK.

I would like to ask if it better to create indivdiaul tables for each column and then join them using PK and insert into final table T

pk|A ,pk|B ,PK|C ,PK|D —-> JOIN AND INSERT INTO TABLE T(pk,a,b,c,d)

it is ok to update ie

first have some thing like T (PK,A,B,C,D)
DATA SOMETHING like (1,a1,NULL,NULL,NULL)

run update statement for all columns B,C,D sequentially and get all the data

T(1,a1,b1,c1,d1)

Best Answer

I'm not sure either of these are really great options. Here's my thoughts:

pk|A ,pk|B ,PK|C ,PK|D ----> JOIN AND INSERT INTO TABLE T(pk,a,b,c,d)

With this scenario you have redundant data. Unless your individual PK tables are just staging tables, I think you could run into some serious issues here, especially with different consumers pulling data from different tables. I wouldn't chance that whatsoever with this design.

Not to mention that if PK:A, PK:B, PK:C, and PK:D are all 1:1 relationships, I don't see any normalization benefit out of this either.

T (PK,A,B,C,D) DATA SOMETHING like (1,a1,NULL,NULL,NULL) run update statement for all columns B,C,D sequentially and get all the data T(1,a1,b1,c1,d1)

I'm not sure I like this design either, although I'd put it at a higher preference then the former. The reason behind this is because what happens if business dictates that columns A, B, C, and/or D really aren't NULLable. Now you can run into a big problem of data conflicting with business constraints.

I think the real solution here may be to take a step back and analyze your data modification. Instead of having a step-based data insert, is there no way to do a single insert of all the required and necessary data into this table? In the application, retrieve all of the required data before each item's insert. That way you won't be defying any business rules, having redundant data, or additional operation overhead of the designs above.

In other words, if data columns A, B, and C are required, then you shouldn't be pushing that data out to this table until you have it. Without all of those required columns you wouldn't have compliant data, and it shouldn't live in that table until it is. And if D is nullable and optional data, then sure that's fine if it doesn't appear until a later time with a different DML operation.

Related Solutions

Database design: how to handle the “archive” problem

It's not clear to me if these requirements are for auditing purposes or just simple historical reference such as with CRM and shopping carts.

Either way, consider have an main and main_archive table for each major area where this is required. "Main" will only have current / active entries whereas "main_archive" will have a copy of everything that ever goes into main. Insert / update into main_archive can be a trigger from insert / update into main. Deletes against main_archive can then run across a longer period of time, if ever.

For the referential issues such as Cust X bought Product Y, the easiest way to solve your referential concern of cust_archive -> product_archive is to never delete entries from product_archive. Generally, churn should be much lower in that table so size shouldn't be too bad of a concern.

HTH.

Database Design – Creating an Application Log Database

I did the following

CREATE TABLE L(
Time_Series_TS TIMESTAMP, 
Channel VARCHAR(10), 
Operation VARCHAR(10), 
Function VARCHAR(10), 
Duration INT);

Then

INSERT INTO L VALUES('2014-06-10 09:00:03.457', 'Channel1', 'Operation3', 'Function15', 15);
INSERT INTO L VALUES('2014-06-10 09:00:08.245', 'Channel2', 'Operation5', 'Function10', 22);
INSERT INTO L VALUES('2014-06-10 09:00:22.005', 'Channel1', 'Operation3', 'Function15', 48);
INSERT INTO L VALUES('2014-06-10 09:01:03.457', 'Channel2', 'Operation3', 'Function15', 296);
INSERT INTO L VALUES('2014-06-10 09:01:08.245', 'Channel2', 'Operation5', 'Function10', 225);
INSERT INTO L VALUES('2014-06-10 09:01:22.005', 'Channel1', 'Operation3', 'Function15', 7);
INSERT INTO L VALUES('2014-06-10 09:01:16.245', 'Channel2', 'Operation5', 'Function10', 10);
INSERT INTO L VALUES('2014-06-10 09:01:47.005', 'Channel1', 'Operation3', 'Function15', 20);

I added a few records to your sample for checking. Then ran this query

SELECT MINUTE(Time_Series_TS) AS Minute, Channel, Operation, Function, 
COUNT(*) AS "Count/min", SUM(Duration) AS Duration 
FROM L
GROUP BY Minute, Channel, Operation, Function
ORDER By Minute, Channel, Operation, Function;

Which gave

+--------+----------+------------+------------+-----------+----------+
| Minute | Channel  | Operation  | Function   | Count/min | Duration |
+--------+----------+------------+------------+-----------+----------+
|      0 | Channel1 | Operation3 | Function15 |         2 |       63 |
|      0 | Channel2 | Operation5 | Function10 |         1 |       22 |
|      1 | Channel1 | Operation3 | Function15 |         2 |       27 |
|      1 | Channel2 | Operation3 | Function15 |         1 |      296 |
|      1 | Channel2 | Operation5 | Function10 |         2 |      235 |
+--------+----------+------------+------------+-----------+----------+

Which appears to be the result you want (note 63 as the 1st duration as per my earlier comment). Is this the result you wanted? You can then use HOUR() and DAYOFMONTH() and even YEAR() to aggregate over these also with this query.

For performance, I did create an index

CREATE INDEX L_Index ON L(Channel, Operation, Function) using BTREE;

and explained the query before and after creating it, but there was no difference. This is hardly a surprise, since the optimizer probably said that there's no point in using one for such a small table. Obviously, I can't test with your data, but there are a couple of points. If you are performing this operation over a large number of records with a large no. of fields, you may run into issues and if you create many indexes, your insert performance will decrease. Is it possible for you to categorise your data in some way to reduce the number of fields - i.e. split your big table into ones with a smaller number of fields? Check out different scenarios, test and see what happens with your data, your queries, your application and your hardware.

[EDIT]

For something more human readable, you might like to try something like

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,
..
..

for your first field.

[EDIT - Response to UPDATE-1]

OK - so in my schema, you are indexing by (Minute, Channel, Operation, Function)? See here for the docco on composite indexes in MySQL. If your queries have a predominatly left-right orientation, i.e you [always | usually] query Channel first and then Operation, then Function, you could try an index on Minute + (the usual three). If it's fairly arbitrary, then you could try using 6 indexes, but this will hit insert performance. How much, I can't say, but if this is a DW type app which performs the analysis, you can batch the inserts and only occasionally take the hit for that. You'll have to do a few tests with realistic data and EXPLAIN your queries - with realistic sample data, as I said earlier, the Optimiser with just a few records ignores indexes because the table is too small. Interestingly, on the MySQL man page given above, there's a hashing strategy which looks interesting - take MD5 hashes of CONCAT(Your_Column_List_Here). One other thing that I can suggest is that instead of using the

SELECT TIME(FROM_UNIXTIME(UNIX_TIMESTAMP(Time_Series_TS) - MOD(UNIX_TIMESTAMP(Time_Series_TS), 60))) AS Minute,...

Just remove the TIME() function and then you'll be storing INTs which appears to be better than indexes on DATETIMES - see here for a benchmark. Also as previously mentioned, you should remove your data from Production and perform the OLAP/DW on another machine. You could also test out the InfiniDB solution that I suggested. It's drop-in compatible with MySQL (no learning curve). Then there are all the NoSQL solutions - we could be here all day :-). Take a look at a few scenarios, evaluate and test and then choose what best fits your budget and requirements. Forgot: Make your OLAP/DW system read only for performing queries - no transactional overhead! Make the OLAP/DW tables MyISAM? This last one is controversial - again, test and see.

Best Answer

Related Solutions

Database design: how to handle the “archive” problem

Database Design – Creating an Application Log Database

Related Question