First, each time you UPDATE
the status column, you are having to update the index as well (source). Evaluate your indexing to see if you really need the index on the status column. My guess is no, since it has an extremely low cardinality and MySQL probably won't use it anyway.
If you ignore me and think you do need it, follow the advice in the article to drop the index before your loop and re-add it after you're done.
Here are some other things you might do if that doesn't help:
You are taking all the columns from the data
but only using number
. Don't do a SELECT *
, but instead a SELECT number
. That won't help your writes, but it is a good performance practice. Only select the columns you're using.
Your number
index isn't getting used at all. This means it is not unique enough to be useful for updating. (Slight tangent: how many rows does a single UPDATE
affect?) I would drop it, or at least add it to process
index.
It looks like process
is unique enough for MySQL to whittle the amount of rows down to 16k, instead of 1 million. In light of this, I would add AND process=x
to your update statement (I'm assuming you know process from the original SELECT
statement):
-- FAILED--
UPDATE data SET status = 2, error='$error' WHERE process=X AND number = $data['number']
-- SUCCESS --
UPDATE data SET status = 1 WHERE process=X AND number = $data['number']
A hint about unnecessary indexes in InnoDB. InnoDB is using a hidden 'primary key' (since you don't have one defined) and is using that when it writes the indexes. So for each Index you're using, you add the size of the index + the size of the hidden primary key to the data file. If you're not using the index (or MySQL can't use it), you are wasting space and adding overhead each time you insert a new number
(same for status
, as discussed earlier)
I concur with your speculation that it's bad design practice. I'd recommend against it. Any time you do this sort of thing, it almost inevitably comes back to bite and the performance difference compared to a properly designed schema should be small.
You say that as it stands now, your process "would involve searching through the first table for all question_ids answered by the user, then searching the topics_and_questions table" but that's not really a good analysis of what should happen behind the scenes.
Properly designed, this process would involve checking an index on the first table for all questions answered by the user, joining to the second table to find the topics of the questions (and possibly a third table to find the names of the topics), but all in one query and with no full table scans.
If your first table (no table name provided) has (user_id, question_id) as its primary key (as it probably should), then finding the questions answered by each user will be a fast operation, no matter how many records are in this table. Rows in InnoDB are stored in primary key order, so there's no "searching though" the entire table that needs to happen... the storage engine can go directly to where that user's records are without a full table scan. If (user_id,question_id) isn't the primary key, then adding an index on (user_id,question_id) will optimize this part of the query.
In topics_and_questions
either (question_id) or (question_id,topic_id) should probably be the primary key, depending on whether a question can be in multiple topics. If it can, of course, then counting questions answered becomes more complicated.
Since you appear to be allowing a user to answer the same question more than once, there may be yet another issue to consider: you actually have to duplicate part of what you're trying to avoid, every time you update this new proposed table: You have to check whether the question just answered was one that's been answered before, or not, and only increment this counter, if it is, I would assume.
Best Answer
A connection survives until one of these:
Normally, you should code your app to connect once near the beginning and disconnect when finished. (There are rare use cases for having multiple connections open at once.)
Most APIs have an obscure setting that determines whether you get all the rows (of a
SELECT
) in one "round trip" to the server, or whether they are fed to you individually (maybe in small chunks??) in many round trips. There are advantages and disadvantages of either. 99.9% of the time, I am happy getting the entire resultset on one round trip, even if it is 100K rows.