Sql-server – Are there any performance differences with a GUID vs INT clustered indexed column, especially with joins on those columns

clustered-indexindexperformancequery-performancesql serveruuid

Basically what the title says: What are the differences in performance (if any) between using a clustered index on a GUID column vs a clustered index on an INT column, especially when those columns are part of your join predicates?

Edit: To clarify, I was always under the assumption GUIDs typically perform worse than INTs when used in join predicates.
I understand a GUID is 4x the data size of an INT, but I was asking my question to really understand why and what's going on "under the hood" that causes GUIDs to perform worse, even after the indexes are rebuilt and there's minimal fragmentation differences between the two types.

Best Answer

I understand a GUID is 4x the data size of an INT, but I was asking my question to really understand why and what's going on "under the hood" that causes GUIDs to perform worse, even after the indexes are rebuilt and there's minimal fragmentation differences between the two types.

I would say that the issue is primarily just what you started out with here, in what I quoted: UNIQUEIDENTIFIER is 4x bigger than INT (16 bytes compared to 4). This hurts performance in three ways (I am ignoring fragmentation since the question explicitly accounts for that):

  1. Every bit compared is another thing to do. More things to do for the CPU takes more time. That time won't be noticed on modern systems when looking at 100 rows. But over 1 million rows? Now you are talking about (in terms of bytes), comparing 16 million things vs. comparing only 4 million things.

  2. Memory is not free: neither in terms of money, nor in terms of time spent allocating and deallocating. In order to join rows, data pages containing those rows must be read into memory (the buffer pool). There is not an infinite amount of memory (especially if you are using Standard Edition or Express Edition). The more memory you take up in one operation is that much less left over for other concurrent operations. But even if memory were infinite, it still takes time to load those values into memory so that they can then be compared. Loading 16 million bytes naturally takes longer than loading 4 million bytes.

  3. Disk is not free: even with SSD, physical I/O is still a cost you pay, which is why things are cached in memory. Data pages are only 8kb. If you have a 16 byte value, it will reduce the number of rows that can fit on that 8kb data page as compared to having a 4 byte value. The more data pages that a table takes up, the more physical I/O is needed to read and write the same number of rows.

    A related issue is with indexes. You will likely index the columns that you are joining on. Same concern as with the main table: fewer index rows per page means more pages required for the index, meaning more time spent reading and writing to every index affected by any given operation (new row in a table will need to create that new row in all indexes on that table (minus filtered indexes that would exclude it), and assuming we are talking about the clustered index, the clustered index key is copied to all of the non-clustered indexes)