Sql-server – Designing a table for large amount of rows. No relationships

database-designperformancequery-performancesql server

I am looking to design a database that'll allow me to store references to files. The table is pretty simple:

Id (GUID)(PK)
Path (nvarchar(250))
IsPublic (bit)
FileType (int)

The files can be accessed via url by putting the GUID in the url, such as files/{GUID}, so if request comes in, I can query the table by the GUID as Primary Key.

Do you think this is a valid approach? What insert/query performance am I looking at with hundreds of millions of records?

The database engine is Microsoft SQL Server.

Some queries I am predicting:

SELECT * FROM Files where ID = {GUID}

INSERT INTO Files () VALUES ()

UPDATE Files SET IsPublic = true WHERE ID = {GUID}

UPDATE Files SET Label = {NewLabel} WHERE ID = {GUID}

DELETE FROM Files WHERE ID = {GUID}

Most work will happen in select and insert. There will be very few updates and deletes. Maybe 10 inserts per second at its peak.

Each service stores the FileIds next to the tables that use it. Items service has ItemFiles which will then retrieve the Files from the service/table.

Best Answer

ID - I don't like GUIDs where GUIDs can be avoided. They're 16 bytes and thrown randomly into your tables causing fragmentation and page splits. A 4-byte INT or 8-byte BIGINT as a sequence or identity column would provide the same functionality, be smaller, and insert into the table sequentially.

Path - Does this really need to be NVARCHAR? This will take twice the amount of storage as VARCHAR unless you use row-level compression. Going from 500 bytes to 250 bytes doesn't seem like a lot for one row, but for a million rows you're at 500,000,000 to 250,000,000, which is 250 MB according to your friendly hard drive marketing department.

FileType - This is really a small nitpick. Do you really need to have a 4-byte INT here? This allows you to have billions of file types. Chances are a 1-byte TINYINT limited to 255 choices would be more than enough.

The biggest thing from your concerns is getting the GUID out of there. On an insert-intensive table you don't want SQL Server writing all over the place doing page splits. I mention the size of the columns which is a nice-to-have, but its the sequential writes that are going to make you the happiest with your performance.

However, sequential numbers are a bit of a security risk if someone is allowed to view some files in the system but not others and you rely on them not being able to guess a GUID to avoid them finding other files. If they bring up file 1000 then they can be pretty sure that files 999 and 1001 probably exist and view those as well. I don't know if this is a concern based on what you wrote.