Auto number for big data

auto-incrementdatabase-design

I'm working with application that can have millions of data. Just wondering if it is ok to use auto_number as primary key. I have read that bigint, int etc has limit. So for example site is on live for 10 years. Is it ok to use auto_number in it. Does big websites like facebook also use auto_number as their primary key?

Best Answer

Assuming you're using a 64 bit system, an INT(eger), 2^64 = 18,446,744,073,709,551,616, so if you have an insert every nanosecond, your system will run out of numbers after approximately 600 years - I wouldn't worry too much! :-)

Related Solutions

MySQL – Database Design for Changing Number of Fields

KEY-VALUE NO!!!

A table for phone numbers -- sure. It would have userid, phone_num, and (if you like) a phone type, such as ENUM('fax', 'home', ...). Then JOIN to the main table.

To keep unlimited, unsearchable data, have a column with a bunch of key-value stuff. I like to do it in JSON, then compress it (in the app), and store it into a BLOB or MEDIUMBLOB. That makes it easily accessible by the app, reasonable compact, and quite open-ended.

In the table, have only columns that you need to search on; put the rest into the extra JSON column.

More discussion: http://forums.mysql.com/read.php?125,428546,428769#msg-428769 http://forums.mysql.com/read.php?125,402095,402218#msg-402218

Another approach is MariaDB's "dynamic columns". This even lets you index randomly added 'columns'.

2000-3000 customers -- Yawn.

Sql-server – SQL Server database design for “archived but available” data

For convenience, assume that the live database is called LiveDb and the achive database is called ArchiveDb

Add a UNION ALL view in LiveDb pointing to the tables in ArchiveDb database via a synonym (There is no need to do a combined db with synonyms)
"Partition" on visit.date and denormalise this column to visit_payments too if it isn't there already (this improves co-located join performance)
Only archive the two large tables if possible (reduces chance of tripping up the optimiser). Keep the UNION ALL view and the other tables in LiveDb so all joins to the smaller tables are kept local
Add a CHECK constraint on the tables in both LiveDb and ArchiveDb that describes the range of visit.date contained in the table. This helps the optimiser eliminate the archive table from both seeks and scans that contain the column visit.data. You will have to periodically update this constraint.
In the UNION ALL view, add a WHERE criteria which filters on visit.data. This is in addition to the hint you already provided in the check constraint. This maximises the chance of filters being pushed down
If you have EE, partition the table in the archive database (But NOT in the live database). If you want to get really fancy, use filegroup level backup/restore of the archive databases to save on backup times.
Consider putting AchiveDb in SIMPLE recovery mode if it isn't already. You are not likely to need transaction log backups of ArchiveDb
Use INSERT... WITH (TABLOCK) SELECT ... WITH (ROWLOCK) to force minimal logging on the destination when moving data between LiveDb and ArchiveDb

All of the above does not guarantee that the optimiser will eliminate the archive tables from seeks and scans, but it makes it more likely.

When the elimination doesn't happen. These are the effect you may see (this list may be incomplete). For seeks, you will get an additional seek on every query (this drives up IOPS). For scans, the results could be disastrous for performance as you may end up scanning both the archive and live tables. Here are the typical ways you can trip up the optimiser:

If you join the visit% tables together and don't include the visit.data in the join criteria (this is why you want to denormalise). Because of this, you may wish to modify some of your queries
If you get a hash join between visit.data and another table (for example a date dimension), you may not get the right elimination of tables
If you try to aggregate data over the archived tables
If you filter on anything BUT the visit.data, for example a seek directly on the key of the view.

For the last scenario, you can guard yourself against the worst effects by adding another check constraint on the cid - if this is possible. You did mention that the sequence of cid not "clean" with respect to the dates and progression of rows in the table. However, could you maintain a table that contains the information: "There are no cid above this number since this visit.data" or similar? This could then drive an additional constraint.

Another thing to be careful about is that parallel queries may spawn a LOT more threads once you query the partitioned view (as both "sub-tables" will be exposed to the same parallel optimisations). For that reasons, you may want to limit MAXDOP on the server or the queries that are parallel.

By the way, if you know the queries well - you may not even need the same indexes in the two databases (this assumes you are 100% sure you will get the right elimination of tables). You could even consider using column stores for ArchiveDb.

Best Answer

Related Solutions

MySQL – Database Design for Changing Number of Fields

Sql-server – SQL Server database design for “archived but available” data

Related Question