First, each time you UPDATE
the status column, you are having to update the index as well (source). Evaluate your indexing to see if you really need the index on the status column. My guess is no, since it has an extremely low cardinality and MySQL probably won't use it anyway.
If you ignore me and think you do need it, follow the advice in the article to drop the index before your loop and re-add it after you're done.
Here are some other things you might do if that doesn't help:
You are taking all the columns from the data
but only using number
. Don't do a SELECT *
, but instead a SELECT number
. That won't help your writes, but it is a good performance practice. Only select the columns you're using.
Your number
index isn't getting used at all. This means it is not unique enough to be useful for updating. (Slight tangent: how many rows does a single UPDATE
affect?) I would drop it, or at least add it to process
index.
It looks like process
is unique enough for MySQL to whittle the amount of rows down to 16k, instead of 1 million. In light of this, I would add AND process=x
to your update statement (I'm assuming you know process from the original SELECT
statement):
-- FAILED--
UPDATE data SET status = 2, error='$error' WHERE process=X AND number = $data['number']
-- SUCCESS --
UPDATE data SET status = 1 WHERE process=X AND number = $data['number']
A hint about unnecessary indexes in InnoDB. InnoDB is using a hidden 'primary key' (since you don't have one defined) and is using that when it writes the indexes. So for each Index you're using, you add the size of the index + the size of the hidden primary key to the data file. If you're not using the index (or MySQL can't use it), you are wasting space and adding overhead each time you insert a new number
(same for status
, as discussed earlier)
You write:
Each customer can have multiple sites, but only one should be
displayed in this list.
Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site
is to be picked.
Either way, it does not matter much here. Your EXPLAIN
shows only 5026 rows for the site
scan (5018 for the customer
scan). So hardly any customer actually has more than one site. Did you ANALYZE
your tables before running EXPLAIN
?
From the numbers I see in your EXPLAIN
, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?
Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.
If this query is your bottleneck, I would suggest you implement a materialized view.
After you provided more information I find that my diagnosis pretty much holds.
The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.
If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id
.
Then you can query:
SELECT *
FROM materialized_view
WHERE mv_id >= 2700
AND mv_id < 2800;
This will perform very fast. LIMIT
/ OFFSET
cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.
pgAdmin timing
When you execute a query from the query tool, the message pane shows something like:
Total query runtime: 62 ms.
And the status line shows the same time. I quote pgAdmin help about that:
The status line will show how long the last query took to complete. If
a dataset was returned, not only the elapsed time for server execution
is displayed, but also the time to retrieve the data from the server
to the Data Output page.
If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE
or the built in Shift + F7
keyboard shortcut or Query -> Explain analyze
. Then, at the bottom of the explain output you get something like this:
Total runtime: 0.269 ms
Best Answer
Having thought a bit more about the problem, I realise there's actually a relatively simple solution. Since each process only updates its 'own' fields the solution is to split the data into two tables allowing each half to be updated by it's own process and any SELECTs can be done by joining on the common primary key.