MySQL – Should UUIDs or Integer Primary Keys Be Used for Massive Writes?

auroraMySQL

I am working on a computer vision data pipeline and am unsure of how to structure my database to optimize for writes.

I have massive amounts of image data that is being collected on an ongoing basis. Image frames are used to build 1-3 second video clips which are to be labelled by a remote workforce. Workers label each clip (using a web application I built) with various properties (does the clip contain object x?).

My current pipeline generates the video clips and sends them to S3. An Amazon Aurora (with MySQL compatibility) database is used to track each image frame, clip, and associated tags.

The 'frames' table contains an entry for every single image frame, with associated metadata.

The 'clips' table contains an entry for every clip, has a field 'start_frame_id', which is a foreign key defining the first frame from the 'frames' table in the given clip. The associated clips are accessed from S3 by the remote workforce, using the sha256 hash of the clip as a key.

The 'labels' table contains an entry for each label created by a worker, and is related to the 'clips' table.

Both the 'clips' and 'frames' table contain a sha256 hash of the original file.

This database needs to be heavily optimized for writes, as the number of frames and clips will be massive (approximately 500K frames will be added per day, clips are 20fps). All uploads to S3 and writes to the database are done from local machines.

The prototype that I have built uses auto-incrementing integers for primary keys. However, this requires the client to execute database writes in small chunks. Since each clip needs to have a reference to its start frame, it is necessary that I commit all the frames for a given clip in order to obtain the primary key of the first frame, before I can commit the clip. This solution also makes it tricky/impossible to later add insert-only write replicas. For this reason, I am debating using UUIDs instead of integers, but I know this can cause performance problems with joins.

Should I use UUIDs or integers?

Best Answer

UUIDs are useful when you have clients independently generating unique identifiers.

id INT UNSIGNED AUTO_INCREMENT is smaller, faster, 'ordered', etc.

Use UUIDs only if you don't have a viable alternative. More discussion: http://mysql.rjweb.org/doc.php/uuid

In my opinion, sha256 is overkill for a 'digest'.

500K rows INSERTed per day? That's 6/second? Not a problem. When you get to 100/sec, we should talk further.

Related Solutions

MySQL – Character vs Integer Primary Keys

It depends on your engine. Common wisdom is that reads are cheap, a few bytes here and there will not significantly impact the performance of a small to medium size database.

More importantly, it depends on the uses to which you will put the primary key. Integer serials have the advantage of being simple to use and implement. They also, depending on the specific implementation of the serialization method, have the advantage of being quickly derivable, as most databases just store the serial number in a fixed location, rather than deriving it with Select max(ID)+1 from foo on the fly.

The question becomes: how does a 5 character key present a "meaningful value" to you and to the application? How is this value created, and does it take more or less time than finding an incrementing serial number. While there is a trivial amount of space saved in some integers, the vast majority of systems will ignore this space savings.

There are no performance implications, save that the character scheme requires that there never be an automatic engine, as your "keys" are underivable. For your specific domain, don't bother with artificial keys, and just use Chinese, Japanese and Thai as key names. While you cannot guarantee uniqueness over any possible application, in your scope it is much more reasonable to use them instead of horrible and forced 5-character abbreviations. There are no significant performance impacts until you get to the millions of tuples.

Alternatively, if you're just tracking by country of origin, and not specific regional cuisines (Cantonese, Sichuan, Sicilian, Umbrian, Calabrian, Yucatecan, Oaxacan, etc.), you could always just use ISO 3166 codes.

If I have 10,000 recipes doesn't the difference between a 5-character and 20-character key start to add up?

Space is cheap. When you're talking 10,000,000 recipes that you're doing OLAP operations on, then, maybe. With 10k recipes, you're looking at 150k of space.

But again, it depends. If you have many millions of records, and are doing joins on them, then it makes sense to denormalize the lookup for something this trivial (into a materialized view). For all practical purposes, the relative join efficiency on a modern machine between a 5 character key and variable length key is so similar to be identical. Happily, we live in a world of plentiful CPU and plentiful disk. The nasty ones are too many joins and query inefficiency, rather than character-by-character comparison. With that said, always test.

P&T things of this level are so database-dependent that generalizations are extremely difficult. Build two sample models of the database, populate them with the estimated numbers of records, then see which one is faster. In my experience, character length doesn't make a huge difference compared with good indexes, good memory configurations, and other critical performance tuning elements.

Mysql – How to do data entry with Foreign Key integer IDs

I think the easiest solution for you would be to create a stored procedure that took @GameName and @ConsoleName.

CREATE PROCEDURE InsertGame @GameName VARCHAR(50), @ConsoleName VARCHAR(50)
AS
BEGIN
DECLARE @ConsoleID INT

SELECT @ConsoleID = ConsoleID FROM Console WHERE ConsoleName = @ConsoleName

IF @ConsoleID IS NULL
BEGIN
INSERT Console (Name) SELECT @ConsoleName
SELECT @ConsoleID = scope_identity()
END

INSERT Game (Name, Console) SELECT @GameName, @ConsoleID

END

If you entered in a @ConsoleName that didn't already exist in the Console table, it would INSERT it there. Next the stored procedure would look up the ID for the console name you provided it, and INSERT a record into the Game table with @GameName and the @ConsoleID.

Then you'd call that stored procedure for each Game you were planning to enter:

EXEC InsertGame 'Grand Theft Auto 5', 'PS4'
EXEC InsertGame 'Grand Theft Auto 5', 'PC'
EXEC InsertGame 'Final Fantasy XV', 'PS4'
EXEC InsertGame 'Overwatch', 'PC'
EXEC InsertGame 'Civilization 5', 'PC'

And so on until you've inserted all of your games.

Please note that the exact syntax I've used is for the database platform I am most familiar with, but the underlying logic should be usable in any database platform.

Also, as I write this I am struck by the idea that what you really want is to support a many-to-many relationship between game titles and platforms, since a platform can have many games and a game can be available on multiple platforms. That would take a little more work, but would be helpful for efficient storage and would give you the ability to easily determine which platform(s) a given game was available for without doing a text-match.

So, at a high level you'd need another table. The existing Game table would not have a ConsoleID column. Instead there would be a new table, likely called Console_Game, that would include only ConsoleID and GameID as keys. Then the stored procedure would need to check for both the existence of ConsoleName, and the existence of GameName. Inserting into both tables as necessary, and then finally write a record to Console_Game to show that a given game is available on a given console.

This will be more work upfront, but it is a more scalable solution than what you currently have. If this is for a job and not a hobby you should probably go that route.

Best Answer

Related Solutions

MySQL – Character vs Integer Primary Keys

Mysql – How to do data entry with Foreign Key integer IDs

Related Question