Help with database design

database-designperformance

I'm quite the beginner when it comes to databases, so please bear that in mind.

I need to store user statistics in several categories for an application.

In the near future I will most likely need to add another application, with different categories, but for the same user.

Both applications will likely change which data is saved over time. perhaps via adding or removing individual fields or even entire categories.

Taking into account performance, maintenance and simplicity of future queries, what's the best approach to database design in this case?

Should I have one table for each application?

Or should I divide everything into smaller tables, maybe even stored in a different database?

Best Answer

You could always do something like this:

Create a users table to store all of your user info:

create table users
(
  id int, 
  user_name varchar(50)
);

insert into users values
(1, 'John'),
(2, 'Jill'),
(3, 'Bob'),
(4, 'Sam');

You state that you might need to apply this type of tracking for multiple applications. So you can create an application table to store the app name and any details.

create table applications 
(
  id int, 
  app_name varchar(50)
);

insert into applications values
(1, 'App 1'),
(2, 'App 2'),
(3, 'App 3');

Next create a categories table to store the info on each of the categories that you want to track by the users. This table will include a column that will contain the application id. This table will contain a foreign key relationship to the applications table. This allows you to have some flexibility in having the same category in multiple apps.

create table categories 
(
  id int,  
  cat_name varchar(100),
  a_id int 
);

insert into categories values
(1, 'Cat 1', 1),
(2, 'Cat 2', 1),
(3, 'Cat 3', 2),
(4, 'Cat 4', 3);

Finally, create the users_categories table which will contain the stats that you want to track, including user, category and then the action_date (the date the event took place).

create table users_categories
(
  u_id int,
  c_id int,
  action_date datetime
);
insert into users_categories values
(1, 1, current_timestamp()),
(2, 3, '2012-01-15 10:15'),
(3, 4, '2012-08-24 12:48'),
(1, 2, '2012-06-08 08:50');

Then to query the data, you would use something similar to this:

select *
from users u
left join users_categories uc
  on u.id = uc.u_id
left join categories c
  on uc.c_id = c.id
left join applications a
  on c.a_id = a.id

And the results would contain (See SQL Fiddle with Demo):

ID | USER_NAME | U_ID   | C_ID   | ACTION_DATE           | CAT_NAME | A_ID    | APP_NAME
------------------------------------------------------------------------------------------
1  | John      | 1      | 1      | '2012-09-28 06:47:53' | Cat 1    | 1       | App 1
1  | John      | 1      | 2      | '2012-06-08 08:50:00' | Cat 2    | 1       | App 1
2  | Jill      | 2      | 3      | '2012-01-15 10:15:00' | Cat 3    | 2       | App 2
3  | Bob       | 3      | 4      | '2012-08-24 12:48:00' | Cat 4    | 3       | App 3
4  | Sam       | (null) | (null) | (null)                | (null)   | (null)  | (null)

A few other suggestions, you could technically remove the a_id from the categories table and create a join table between categories and applications, similar to this:

create table categories_applications
(
  c_id int,
  a_id int
);
insert into categories_applications values
(1, 1),
(2, 3),
(3, 1),
(1, 2);

This would lead to one additional join in your queries, but the results would be the same:

select *
from users u
left join users_categories uc
  on u.id = uc.u_id
left join categories c
  on uc.c_id = c.id
left join categories_applications ca
  on c.id = ca.c_id
left join applications a
  on ca.a_id = a.id

Related Solutions

Topics and posts represented by one table or two

Generally...

A a rule of thumb: don't pre-optimize for performance. I think a lot of developers assume that joins are inefficient and they don't trust DBMS to do what it's built to do.

Start with a properly normalized design. Make sure your indexes and queries are optimized for your particular balance of reads and writes.

If and when you start to find that performance can't keep up with the best hardware you can afford, then start to think about denormalizing.

If you denormalize early you are just setting yourself up for maintenance headaches down the road.

More Specifically...

Looking at your suggested table layouts, I would suggest that you're trying to make TOPIC do too much. Anything which could appear in POST (e.g. poster_id) almost certainly doesn't belong in TOPIC. I suggest you adjust your thinking a little bit. I get the impression you are thinking very much about what topics and posts are going to look like on the page. This may be leading you to thinking of topics as a small super-set of posts, whereas they are probably rather more like subject headings. The fact that you plan on displaying the first post under each subject heading along with the heading isn't a good reason to co-mingle posts and headings.

I think you may want to rethink some of your cumulative total columns too. Thinks like up and down votes may need to be tracked in their own tables. You may need to do this to keep people from up or down voting repeatedly and to allow people to rescind their votes. Similarly, you may want to know all editors, not just the last editor.

Sql-server – How much overhead is involved with querying a DB on the same server

If you just want a readable copy of the data in table A, then there are ways to make this efficient without worrying about whether the copy is in the same database or a different database. Or a different server, for that matter, as long as you can reliably point your application differently depending on whether it is reading or writing, and as long as there isn't a bunch of referential integrity tied to A.

Let's keep things simple and have everything in the same database. Create two new schemas:

CREATE SCHEMA fake   AUTHORIZATION dbo;
CREATE SCHEMA shadow AUTHORIZATION dbo;

Now create a copy of A in the dbo and shadow schemas:

SELECT * INTO shadow.B FROM dbo.A;
SELECT * INTO dbo.B    FROM dbo.A;

(Then add all of the appropriate indexes, etc. to both tables.)

Now, create a stored procedure that does the following:

TRUNCATE TABLE shadow.B;

INSERT shadow.B SELECT * FROM dbo.A;

-- optionally add an UPDATE STATISTICS command here

BEGIN TRANSACTION;

  ALTER SCHEMA fake TRANSFER     dbo.B;
  ALTER SCHEMA dbo  TRANSFER  shadow.B;

COMMIT TRANSACTION;

ALTER SCHEMA shadow TRANSFER fake.B;

-- optionally truncate the now stale copy, though I keep
-- this around for debugging / comparisons:
TRUNCATE TABLE shadow.B;

Then schedule it to run in a job (frequency will depend on your tolerance for how long this process takes and how fresh the copy of the data needs to be). You can always call it manually if you want it to be refreshed in between runs.

As I mentioned above, unless you use the Auto-Close feature, the overhead of accessing a table in a different database on the same instance is negligible compared to accessing the same table in the same database. So the process or any related efficiency doesn't really change that much if you put B in a different database (just an extra prefix on the INSERT/SELECT, and the pointing of the application as mentioned above). The process also doesn't change that much if you move the copy to a different server (though linked servers, distributed transactions etc. make this a little more cumbersome). You will see the process might be a little slower across servers, but users shouldn't notice unless they also now have to join data between the two servers - if they can query the other server independently to pull data from B they won't be able to tell the difference, unless the other server is severely underpowered.

If you have referential integrity you may find that you will need to drop/re-create and/or disable/re-enable constraints in order to schema transfer (I've almost always used this trick for relatively static lookup data and, as a copy primarily used for reporting, did not have DRI). Also both copies of A will have their own statistics, so half the time you'll have one set of statistics, and half the time you'll have the other set. I haven't done extensive testing of what happens with query plans etc. after the schema transfer.

Of course if you store B in a different database then perhaps the DRI issues etc. are not all that important anyway, but I thought I would mention them.

I have, however, blogged a bit more about how I arrived at this technique:

http://www.sqlperformance.com/2012/08/t-sql-queries/t-sql-tuesday-schema-switch-a-roo

Best Answer

Related Solutions

Topics and posts represented by one table or two

Sql-server – How much overhead is involved with querying a DB on the same server

Related Question