I have records like
A5
A4
Z1
B2
C7
C1A
C11A
B1
B4
I want them to be sorted in this manner
A4
A5
B1
B2
B4
C1
C11A
C7
Z1
using the ORDER BY
clause.
I want them to be sorted by alphabets and then by numeric values.
natural sortorder-bypostgresqlsorting
I have records like
A5
A4
Z1
B2
C7
C1A
C11A
B1
B4
I want them to be sorted in this manner
A4
A5
B1
B2
B4
C1
C11A
C7
Z1
using the ORDER BY
clause.
I want them to be sorted by alphabets and then by numeric values.
Let me try to explain why you should not do that, why you should never assume that an SQL-product will return a result set in a specific order, unless you specify so, whatever indices - clustered or non-clustered, B-trees or R-Trees or k-d-trees or fractal-trees or whatever other exotic indices a DBMS is using.
Your original query tells to the DBMS to search the SensorValues
table, find rows that match the 3 conditions, order those rows by Date
descending, keep only the first row from those and - finally - select and return only the SensorValue
column.
SELECT TOP 1 SensorValue
FROM SensorValues
WHERE SensorId = 53
AND DeviceId = 3819
AND Date < 1339225010
ORDER BY Date DESC ;
These are very specific orders you have given to the DBMS and the result will most probably be the same every time you run the query (there is a chance it might not, if you have more than one row that match the conditions and have the same max Date
but different SensorValue
but lets assume for the rest of the conversation that no such rows exist in your table).
Does the DBMS have to do this, to run this query, the exact way I describe it above? No, of course not and you know that. It may not read the table but read from an index. Or it may use two indexes if it thinks it's better (faster). Or three. Or it may use a cached result (not SQL Server but other DBMS cache query results). Or it may use parallel execution one time and not the next time it runs. Or ... (add any other feature that affects execution and execution plans).
What is guaranteed though is that it will return the exact same result, every time you run it - as long as no rows are inserted, deleted or updated.
Now lets see what your suggestion says:
SELECT TOP 1 SensorValue
FROM SensorValues
WHERE SensorId = 53
AND DeviceId = 3819
AND Date < 1339225010 ;
This query tells to the DBMS to search the SensorValues
table, find rows that match the 3 conditions, order those rows by , don't care about the order, keep only one row and - finally - select and return only the Date
descending,SensorValue
column.
So, it basically tells the same as the first one, except that it tells that you want one result only that matches the conditions and you don't care which one.
Now, can we assume that it will give always the same result because of the clustered index?
- If it does use this clustered index every time, yes.
But will it use it?
- No.
Why not?
- Beacuse it can. The query optimizer is free to choose a path of execution every time it runs a statement. Whatever path it sees fit at that time for that statement.
But isn't using the clustered index the best/fastest way to get results?
- No, not always. It might be the first time you run the query. The second time, it may use a cached result (if the DBMS has such a feature, not SQL Server*). The 1000th time the result may have been removed from the cache and another result may exist there. Say, you had executed this query just before:
SELECT TOP 1 SensorValue
FROM SensorValues
WHERE SensorId = 53
AND DeviceId = 3819
AND Date < 1339225010
ORDER BY Date ASC ; --- Notice the `ASC` here
and the cached result (from the above query) is another, different one that still matches your conditions but is not the first in your (wanted) ordering. And you have told the DBMS not to care about the order.
OK, so only cache can affect this?
- No, many other things, too.
*: SQL Server does not cache query results but the Enterprise Edition does have an Advanced Scanning feature which is kind of similar in that you may get different results because of concurrent queries. Not sure exactly when this kicks in though. (thnx @Martin Smith for the tip.)
I hope you are convinced that you should never rely that an SQL query will return results in a specific order, unless you specify so. And never use TOP (n)
without ORDER BY
, unless of course you just want n rows in the result and you don't care which ones are returned.
The LEFT JOIN
in @dezso's answer should be good. An index, however, will hardly be useful (per se), because the query has to read the whole table anyway - the exception being index-only scans in Postgres 9.2+ and favorable conditions, see below.
SELECT m.hash, m.string, count(m.method) AS method_ct
FROM methods m
LEFT JOIN nostring n USING (hash)
WHERE n.hash IS NULL
GROUP BY m.hash, m.string
ORDER BY count(m.method) DESC;
Run EXPLAIN ANALYZE
on the query. Several times to exclude cashing effects and noise. Compare the best results.
Create a multi-column index that matches your query:
CREATE INDEX methods_cluster_idx ON methods (hash, string, method);
Wait? After I said an index wouldn't help? Well, we need it to CLUSTER
the table:
CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;
Rerun EXPLAIN ANALYZE
. Any faster? It should be.
CLUSTER
is a one-time operation to rewrite the whole table in the order of the used index. It is also effectively a VACUUM FULL
. If you want to be sure, you'd run a pre-test with VACUUM FULL
alone to see what can be attributed to that.
If your table sees a lot of write operations, the effect will degrade over time. Schedule CLUSTER
at off-hours to restore the effect. Fine tuning depends of your exact use-case. The manual about CLUSTER
.
CLUSTER
is a rather crude tool, needs an exclusive lock on the table. If you can't afford that, consider pg_repack
which can do the same without exclusive lock. More in this later answer:
If the percentage of NULL
values in the column method
is high (more than ~ 20 percent, depending on actual row sizes), a partial index should help:
CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;
(Your later update shows your columns to be NOT NULL
, so not applicable.)
If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without CLUSTER
if the planner can utilize index-only scans. Only applicable under favorable conditions: No write operations that would effect the visibility map since the last VACUUM
and all columns in the query have to be covered by the index. Basically read-only tables can use this any time, while heavily written tables are limited. More details in the Postgres Wiki.
The above mentioned partial index could be even more useful in that case.
If, on the other hand, there are no NULL
values in column method
, you should
1.) define it NOT NULL
and
2.) use count(*)
instead of count(method)
, that's slightly faster and does the same in the absence of NULL
values.
If you have to call this query often and the table is read-only, create a MATERIALIZED VIEW
.
Exotic fine point:
Your table is named nostring
, yet seems to contain hashes. By excluding hashes instead of strings, there is a chance that you exclude more strings than intended. Extremely unlikely, but possible.
Best Answer
For your request:
I assume (deriving from your sample data) you want to
ORDER BY
:The first letter, treated as
text
.The first number (consecutive digits), treated as
integer
.The whole string to break remaining ties, treated as
text
. May or may not be needed.SQL Fiddle.
More details: