Mysql int vs varchar as primary key (InnoDB Storage Engine

innodbMySQLperformanceprimary-key

I am build a web application (project management system) and I have been wondering about this when it come to performance.

I have an Issues table an inside it there are 12 foreign keys linking to various other tables. of those, 8 of them I would need to join to get the title field from the other tables in order for the record to make any sense in a web application but then means doing 8 joins which seems really excessive especially since I am only pulling in 1 field for each of those joins.

Now I have also been told to use a auto incrementing primary key (unless sharding is a concerns in which case I should use a GUID) for permanence reasons but how bad is it to use a varchar (max length 32) performance wise? I mean most of these table are probably not going to have at many records (most of them should be under 20). Also if I use the title as the primary key, I won't have to do joins 95% of the time so for 95% of the sql, I would even occur any performance hit (I think). The only downside I can think of is that I have is I will have higher disk space usage (but down a day is that really a big deal).

The reason I am use lookup tables for a lot of this stuff instead of enums is because I need all of these values to be configurable by the end user through the application itself.

What are the downsides of using a varchar as the primary key for a table not excepted to have many records?

UPDATE – Some Tests

So I decided to do some basic tests on this stuff. I have 100000 records and these are the base queries:

Base VARCHAR FK Query

SELECT i.id, i.key, i.title, i.reporterUserUsername, i.assignedUserUsername, i.projectTitle, 
i.ProjectComponentTitle, i.affectedProjectVersionTitle, i.originalFixedProjectVersionTitle, 
i.fixedProjectVersionTitle, i.durationEstimate, i.storyPoints, i.dueDate, 
i.issueSecurityLevelId, i.creatorUserUsername, i.createdTimestamp, 
i.updatedTimestamp, i.issueTypeId, i.issueStatusId
FROM ProjectManagement.Issues i

Base INT FK Query

SELECT i.id, i.key, i.title, ru.username as reporterUserUsername, 
au.username as assignedUserUsername, p.title as projectTitle, 
pc.title as ProjectComponentTitle, pva.title as affectedProjectVersionTitle, 
pvo.title as originalFixedProjectVersionTitle, pvf.title as fixedProjectVersionTitle, 
i.durationEstimate, i.storyPoints, i.dueDate, isl.title as issueSecurityLevelId, 
cu.username as creatorUserUsername, i.createdTimestamp, i.updatedTimestamp, 
it.title as issueTypeId, is.title as issueStatusId
FROM ProjectManagement2.Issues i
INNER JOIN ProjectManagement2.IssueTypes `it` ON it.id = i.issueTypeId
INNER JOIN ProjectManagement2.IssueStatuses `is` ON is.id = i.issueStatusId
INNER JOIN ProjectManagement2.Users `ru` ON ru.id = i.reporterUserId
INNER JOIN ProjectManagement2.Users `au` ON au.id = i.assignedUserId
INNER JOIN ProjectManagement2.Users `cu` ON cu.id = i.creatorUserId
INNER JOIN ProjectManagement2.Projects `p` ON p.id = i.projectId
INNER JOIN ProjectManagement2.`ProjectComponents` `pc` ON pc.id = i.projectComponentId
INNER JOIN ProjectManagement2.ProjectVersions `pva` ON pva.id = i.affectedProjectVersionId
INNER JOIN ProjectManagement2.ProjectVersions `pvo` ON pvo.id = i.originalFixedProjectVersionId
INNER JOIN ProjectManagement2.ProjectVersions `pvf` ON pvf.id = i.fixedProjectVersionId
INNER JOIN ProjectManagement2.IssueSecurityLevels isl ON isl.id = i.issueSecurityLevelId

I also ran these query with the following additions:

  • Select specific item (where i.key = 43298)
  • Group by i.id
  • Order by (it.title for int FK, i.issueTypeId for varchar FK)
  • Limit (50000, 100)
  • Group and limit together
  • Group, order, and limit together

The results for these where:

QUERY TYPE : VARCHAR FK TIME / INT FK TIME


Base query: ~4ms / ~52ms

Select specific item : ~140ms / ~250ms

Group by i.id : ~4ms / ~2.8sec

Order by : ~231ms / ~2sec

Limit : ~67ms / ~343ms

Group and limit together : ~504ms / ~2sec

Group, order, and limit together : ~504ms /~2.3sec

Now I don't know what configuration I could make to make one or the other (or both) faster but it seems like the VARCHAR FK see faster in queries for data (sometimes a lot faster).

I guess I have to choice whether that speed improvement is worth the extra data/index size.

Best Answer

I follow the following rules for primary keys:

a) Should not have any business meaning - they should be totally independent of the application you are developing, therefore I go for numeric auto generated integers. However if you need additional columns to be unique then create unique indexes to support that

b) Should perform in joins - joining to varchars vs integers is about 2x to 3x slower as the length of the primary key grows, so you want to have your keys as integers. Since all computer systems are binary, I suspect its coz the string is changed to binary then compared with the others which is very slow

c) Use the smallest data type possible - if you expect your table to have very few columns say 52 US states, then use the smallest type possible maybe a CHAR(2) for the 2 digit code, but I would still go for a tinyint (128) for the column vs a big int which can go up to 2billion

Also you will have a challenge with cascading your changes from the primary keys to the other tables if for example the project name changes (which is not uncommon)

Go for sequential auto incrementing integers for your primary keys and gain the inbuilt efficiencies that database systems provide with support for changes in the future