It depends on your engine. Common wisdom is that reads are cheap, a few bytes here and there will not significantly impact the performance of a small to medium size database.
More importantly, it depends on the uses to which you will put the primary key. Integer serials have the advantage of being simple to use and implement. They also, depending on the specific implementation of the serialization method, have the advantage of being quickly derivable, as most databases just store the serial number in a fixed location, rather than deriving it with Select max(ID)+1 from foo
on the fly.
The question becomes: how does a 5 character key present a "meaningful value" to you and to the application? How is this value created, and does it take more or less time than finding an incrementing serial number. While there is a trivial amount of space saved in some integers, the vast majority of systems will ignore this space savings.
There are no performance implications, save that the character scheme requires that there never be an automatic engine, as your "keys" are underivable. For your specific domain, don't bother with artificial keys, and just use Chinese, Japanese and Thai as key names. While you cannot guarantee uniqueness over any possible application, in your scope it is much more reasonable to use them instead of horrible and forced 5-character abbreviations. There are no significant performance impacts until you get to the millions of tuples.
Alternatively, if you're just tracking by country of origin, and not specific regional cuisines (Cantonese, Sichuan, Sicilian, Umbrian, Calabrian, Yucatecan, Oaxacan, etc.), you could always just use ISO 3166 codes.
If I have 10,000 recipes doesn't the difference between a 5-character and 20-character key start to add up?
Space is cheap. When you're talking 10,000,000 recipes that you're doing OLAP operations on, then, maybe. With 10k recipes, you're looking at 150k of space.
But again, it depends. If you have many millions of records, and are doing joins on them, then it makes sense to denormalize the lookup for something this trivial (into a materialized view). For all practical purposes, the relative join efficiency on a modern machine between a 5 character key and variable length key is so similar to be identical. Happily, we live in a world of plentiful CPU and plentiful disk. The nasty ones are too many joins and query inefficiency, rather than character-by-character comparison. With that said, always test.
P&T things of this level are so database-dependent that generalizations are extremely difficult. Build two sample models of the database, populate them with the estimated numbers of records, then see which one is faster. In my experience, character length doesn't make a huge difference compared with good indexes, good memory configurations, and other critical performance tuning elements.
By default, your usage of mysqldump
will lock each table as it dumps it and there will not be any consistency among tables. You will also not get any triggers, stored procedures, stored functions, or events backed up, and if any of your view definitions are invalid (referencing nonexistent tables or columns that you may have dropped or renamed since the view was defined), the dump process will terminate when that error is encountered.
If all of your tables are InnoDB, you can get a consistent snapshot by adding this:
--single-transaction
This option disables the default --lock-tables
which locks each table as it is dumped, then unlocks it when done dumping it, and instead issues the following statements to the server at the beginning of the dump (which you can observe by enabling the general query log):
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION WITH CONSISTENT SNAPSHOT;
Of course, this only helps with InnoDB. If your tables are MyISAM or any other non-transactional engine, you only have one option:
--lock-all-tables
...which locks all of the tables for the entire dump, since that's the only way to ensure consistency without transactions to assist.
To back up your stored procedures, events, and triggers, add:
--routines --events --triggers
To prevent mysqldump
from failing when there's an invalid view definition
--force
To capture your binlog position and store it in the backup file, add:
--master-data=2
This option doesn't work if your machine isn't a slave.
Your question was about backing up the database "with rsnapshot" although to put a fine point on it, you're proposing backing up the database with mysqldump and then backing up the dumpfile with rsnapshot... It's hard to tell from a brief look at their site whether that's going to be a good choice, because rsnapshot appears to have file-level granularity. It doesn't appear to store diffs, but instead stores full files for "each" file that changes between backup cycles:
"The amount of space required is roughly the size of one full backup, plus a copy of each additional file that is changed."
-- http://www.rsnapshot.org/rsnapshot.html
I welcome correction if I'm wrong on this point. In this case, you only have 1 file, your dump file from mysqldump, which will of course change every time.
Whatever you do, don't entertain the thought of trying to back up the actual files (.frm, .ibd, .MYD, etc.) underlying your tables in MySQL. That does not work with the server running even though under rare conditions you might get a backup that seems to be intact. It isn't.
Update:
There's an important distinction between what mylvmbackup is doing and "backing up the actual files." To use mylvmbackup, you have to be using LVM, which can be visualized as a layer of abstraction between your filesystem and your hard drive. With LVM, you have the ability to freeze the file system and make a point-in-time snapshot of the entire filesystem.
It looks like mylvmbackup is doing a FLUSH TABLES
followed by a FLUSH TABLES WITH READ LOCK
prior to looking up the replication coordinates (if specified in configuration), which is the same process mysqldump does when --single-transaction
and --master-data
are both used. This stablizes MySQL but it does not fully quiesce InnoDB, so when a backup like this is restored, InnoDB will think it crashed and will do crash recovery... but it should be a clean recovery.
If your web site's data files were in the same filesystem, then an LVM snapshot would contain everything consistent to a single point in time, but whether having your other web site files in the same filesystem as the database is good practice (if, indeed running your web server on the same server is good practice) is another discussion.
Whatever the approach, it's vitally important in any backup strategy to periodically verify that your backups can actually be restored (on a different machine). The truly safest method would be to be sure innodb_fast_shutdown
= 0 and actually shut down MySQL but that's probably not practical.
Of course, if you want with something like rsnapshot, you could always snapshot the web site files, then snapshot the database, then snapshot the web site files again. If nothing changes in the site files while the database is being backed up, rsnapshot will use very little space for that second run. If things do change, it should be fairly trivial to understand the differences if you needed to recover.
Best Answer
My recommendation is, as Max has suggested, to store on the filesystem and store a path in the db as varchar. This has the advantage of not having to deal with the files in the db. Even on other db's I think storing frequently used binary files in the db is asking for performance issues (storing less frequently used files in the db, OTOH, is a good thing).
However it does have the disadvantage of requiring multiple backups to get all the data your data needs. A secondary option might be to store the files in the db as a blob (in a separate table) and then store it on the filesystem too. This way you can just recreate the file from the db if it is missing and you can count on your database to be a complete set of the files for backup purposes. This may be needless complexity though.
Also if you are storing video files in the db, one real reason to go to PostgreSQL would be the stream-based LOB interface which would allow you to chunk the output.