I'm working with application that can have millions of data. Just wondering if it is ok to use auto_number as primary key. I have read that bigint, int etc has limit. So for example site is on live for 10 years. Is it ok to use auto_number in it. Does big websites like facebook also use auto_number as their primary key?
Auto number for big data
auto-incrementdatabase-design
Related Solutions
KEY-VALUE NO!!!
A table for phone numbers -- sure. It would have userid, phone_num, and (if you like) a phone type, such as ENUM('fax', 'home', ...). Then JOIN to the main table.
To keep unlimited, unsearchable data, have a column with a bunch of key-value stuff. I like to do it in JSON, then compress it (in the app), and store it into a BLOB or MEDIUMBLOB. That makes it easily accessible by the app, reasonable compact, and quite open-ended.
In the table, have only columns that you need to search on; put the rest into the extra JSON column.
More discussion: http://forums.mysql.com/read.php?125,428546,428769#msg-428769 http://forums.mysql.com/read.php?125,402095,402218#msg-402218
Another approach is MariaDB's "dynamic columns". This even lets you index randomly added 'columns'.
2000-3000 customers -- Yawn.
For convenience, assume that the live database is called LiveDb
and the achive database is called ArchiveDb
- Add a UNION ALL view in
LiveDb
pointing to the tables inArchiveDb
database via a synonym (There is no need to do a combined db with synonyms) - "Partition" on
visit.date
and denormalise this column tovisit_payments
too if it isn't there already (this improves co-located join performance) - Only archive the two large tables if possible (reduces chance of tripping up the optimiser). Keep the UNION ALL view and the other tables in
LiveDb
so all joins to the smaller tables are kept local - Add a CHECK constraint on the tables in both
LiveDb
andArchiveDb
that describes the range ofvisit.date
contained in the table. This helps the optimiser eliminate the archive table from both seeks and scans that contain the columnvisit.data
. You will have to periodically update this constraint. - In the UNION ALL view, add a WHERE criteria which filters on
visit.data
. This is in addition to the hint you already provided in the check constraint. This maximises the chance of filters being pushed down - If you have EE, partition the table in the archive database (But NOT in the live database). If you want to get really fancy, use filegroup level backup/restore of the archive databases to save on backup times.
- Consider putting
AchiveDb
in SIMPLE recovery mode if it isn't already. You are not likely to need transaction log backups ofArchiveDb
- Use INSERT... WITH (TABLOCK) SELECT ... WITH (ROWLOCK) to force minimal logging on the destination when moving data between
LiveDb
andArchiveDb
All of the above does not guarantee that the optimiser will eliminate the archive tables from seeks and scans, but it makes it more likely.
When the elimination doesn't happen. These are the effect you may see (this list may be incomplete). For seeks, you will get an additional seek on every query (this drives up IOPS). For scans, the results could be disastrous for performance as you may end up scanning both the archive and live tables. Here are the typical ways you can trip up the optimiser:
- If you join the
visit%
tables together and don't include thevisit.data
in the join criteria (this is why you want to denormalise). Because of this, you may wish to modify some of your queries - If you get a hash join between
visit.data
and another table (for example a date dimension), you may not get the right elimination of tables - If you try to aggregate data over the archived tables
- If you filter on anything BUT the
visit.data
, for example a seek directly on the key of the view.
For the last scenario, you can guard yourself against the worst effects by adding another check constraint on the cid
- if this is possible. You did mention that the sequence of cid
not "clean" with respect to the dates and progression of rows in the table. However, could you maintain a table that contains the information: "There are no cid
above this number since this visit.data
" or similar? This could then drive an additional constraint.
Another thing to be careful about is that parallel queries may spawn a LOT more threads once you query the partitioned view (as both "sub-tables" will be exposed to the same parallel optimisations). For that reasons, you may want to limit MAXDOP on the server or the queries that are parallel.
By the way, if you know the queries well - you may not even need the same indexes in the two databases (this assumes you are 100% sure you will get the right elimination of tables). You could even consider using column stores for ArchiveDb
.
Related Question
- Database Design – Store as Rows vs Varchar Column for Better Performance
- Should running number be separated column from primary key in database
- Sql-server – Reasons for avoiding large ID values
- Mysql – Zerofill Auto Increment – theSql
- Database Design – Multiple Tables with Same Primary Key vs One Big Data Table
- MySQL Product Prices – How to Record Prices for Product Sales in MySQL
Best Answer
Assuming you're using a 64 bit system, an INT(eger), 2^64 = 18,446,744,073,709,551,616, so if you have an insert every nanosecond, your system will run out of numbers after approximately 600 years - I wouldn't worry too much! :-)