Why does a SELECT with no WHERE cause a full table scan, (or does it)

optimizationsqlite

I have the following table with ~2000000 rows

CREATE TABLE TableA (idA integer PRIMARY KEY, idB TEXT UNIQUE)

From time to time, I need to process each rows, the order does not mater.

So I process 200 rows at a time, (around a transaction)

SELECT iaA, idB FROM TableA  ORDER BY ROWID asc LIMIT 200

but when I checked the query with EXPLAIN QUERY PLAN I noticed that it was doing a table scan.

0    0    0    SCAN TABLE TableA USING INTEGER PRIMARY KEY (~1000000 rows)

I am especially curious about the ~1000000 rows

It does not matter if I select 1 row or 1000 rows, I still do a table scan with ~1 million rows.

Is this not the most efficient way of querying the table for 'any' data?

As I said, the order does not matter, the data has to be processed.

Edit: If I run the query with no ORDER BY

SELECT * FROM TableA LIMIT 200

I get fairly similar results.

SCAN TABLE TableA USING COVERING INDEX sqlite_autoindex_TableA_1 (~1000000 rows)

Best Answer

EXPLAIN QUERY PLAN shows only an estimate, and the LIMIT clause is ignored when doing this estimation. (And this number is so misleading that recent versions of SQLite do not show it.)

A plain SELECT * FROM TableA LIMIT x is the fastest way to get x rows from a table. SQLite computes result rows on demand, so it will not read the entire table but stop scanning after x rows.

Table scans are likely to be inefficient when you are searching for a single row or a few rows, or when you have a filter that greatly reduces the number of rows to be returned. But when the query is going to return all rows anyway, it is not possible to reduce the number of rows that actually need to be read from disk, so the best way to reduce disk I/O is to avoid touching anything but the table itself. (In this case, a covering index can replace a table.)

Related Solutions

PostgreSQL Optimization – How to Optimize Multiple ORDER BYs

A few things you can do:

Use enums or lookups keyed by integer values, or a simple "char" field, instead of varchar sort keys where possible. I'd use an enum because you can control the sort order easily.

The only serious downside with an enum is that you can't currently drop values from an enum type. You can add them (including inserting them in the middle of the sort order) but not remove them. If that's a problem, you'll want to use a lookup table, or just a field declared "char" that has single character codes.

Also, if you don't need proper language collation, specify COLLATE "C" for character fields, e.g.

CREATE INDEX itemdesc_c ON supplier_management (item_description ASC COLLATE "C");

and then:

ORDER BY ...
    itemdesc COLLATE "C",
    ...

Important things to note:

Pg can combine indexes for predicates (WHERE clauses etc) but not sorting. You can't use a bitmap index scan for a sort. So it can use at most one of the candidate indexes, then it has to sort the rows within each group.
Low-selectivity indexes are a waste of time. If the values aren't widely distributed, don't index the column.
Pg's doing an on-disk sort. Throw more memory at the problem - try SET work_mem = '20MB' to start with. But see my comments below re thrashing with high max_connections. Use a connection pool.
Use a connection pool.
Indexes have a cost - they slow down insert/update/delete and increase vacuum work. So if the index isn't being used lots, get rid of it.
pg_catalog.pg_stat_user_indexes will help you tell which indexes are used.
pg_stat_statements (in contrib) and pg_stat_plans (the latter is an external module) are very useful for capturing data about query patterns, slow queries, etc.
Learn to love the auto_explain module.

Also, if you always do this sort, creating a composite index to match it will help.

CREATE INDEX bigindex ON supplier_management (
      item_description DESC,
      item_number DESC,
      order_type ASC,
      possession_date DESC,
      shipment_type DESC,
      store_type DESC
);

... but be aware that it's only useful for this particular sort, and it'll be a big index so it's only worth having if you do this a lot. In fact, you might as well add supplier_management.buyer_purchase_order_id too, so it can do an index-only scan:

CREATE INDEX bigindex ON supplier_management (
      item_description DESC,
      item_number DESC,
      order_type ASC,
      possession_date DESC,
      shipment_type DESC,
      store_type DESC,
      buyer_purchase_order_id
);

SQL Server – Table Variable Forcing Index Scan vs Temp Table Using Seek

The reason for the behavior is that SQL Server can't determine how many rows will match to ForeignKey, since there is no index with RowKey as the leading column (it can deduce this from statistics on the #temp table, but those don't exist for table variables/UDTTs), so it makes an estimate of 100,000 rows, which is better handled with a scan than a seek+lookup. By the time SQL Server realizes there is only one row, it's too late.

You might be able to construct your UDTT differently; in more modern versions of SQL Server you can create secondary indexes on table variables, but this syntax is not available in 2008 R2.

BTW you can get the seek behavior (at least in my limited trials) if you try to avoid the bitmap/probe by hinting a nested loops join:

DECLARE @Keys TABLE (RowKey INT PRIMARY KEY); -- can't hurt

INSERT @Keys (RowKey) VALUES (10);

SELECT 
     t.RowKey
    ,t.SecondColumn
FROM
    dbo.Test t 
INNER JOIN 
    @Keys k
ON
    t.ForeignKey = k.RowKey
    OPTION (LOOP JOIN);

I learned this trick from Paul White several years ago. Of course, you should be careful about putting any kind of join hints in production code - this can fail if people make changes to the underlying objects and that specific type of join is no longer possible or no longer most optimal.

For more complex queries, and when you move to SQL Server 2012 or above, it's possible that trace flag 2453 could help. That flag didn't help with this simple join, though. And the same disclaimers would apply - this is just an alternative thing you shouldn't generally do without a ton of documentation and rigorous regression testing procedures in place.

Also, Service Pack 1 is long out of support, you should get on Service Pack 3 + MS15-058.

Best Answer

Related Solutions

PostgreSQL Optimization – How to Optimize Multiple ORDER BYs

SQL Server – Table Variable Forcing Index Scan vs Temp Table Using Seek

Related Question