Sqlite: strange behavor on index , about group by operation

sqlite

CREATE TABLE [newtable](
  [id], 
  [HLevel], 
  [LeftBower], 
  [RightBower], 
  [NodeNumber], 
  [NodeCount], 
  [SortPath], 
  [ID:1] INT, 
  [parent] INT, 
  [name] TEXT, 
  [folder] NUM, 
  [size] INT);

CREATE TABLE [tally]([N]);

This is the index I used:

explain query plan select pid ,newtable.size from (
select id as pid, leftbower as lft, rightbower as rgt from newtable where parent =1 and folder = 1
)
as base join newtable on newtable.id> base.lft and newtable.id<base.rgt

output:
id  parent  notused detail
5   0   0   SEARCH TABLE newtable USING INDEX vdsew (folder=?)
12  0   0   SEARCH TABLE newtable USING INDEX dd (id>? AND id<?)

but with group by:

explain query plan select pid ,newtable.size from (
select id as pid, leftbower as lft, rightbower as rgt from newtable where parent =1 and folder = 1
)
as base join newtable on newtable.id> base.lft and newtable.id<base.rgt group by pid;

output:
id  parent  notused detail
8   0   0   SCAN TABLE newtable
10  0   0   SEARCH TABLE newtable USING INDEX vdsew (folder=?)
23  0   0   USE TEMP B-TREE FOR GROUP BY

As rediculous consequence, If I want to get the group by result, create a temp table to save the result, then select from temp group by. This 2-step takes only about 200ms and the combination takes much much longer, depending on what index I am using. For instance, in this case(the indices I posted here).

This query:

create table temp as select pid ,newtable.size from (
select id as pid, leftbower as lft, rightbower as rgt from newtable where parent =1 and folder = 1
)

as base join newtable on newtable.id> base.lft and newtable.id<base.rgt ;
select pid, sum(size) from temp group by pid;

takes only 250ms to get the result, while

select pid ,sum(size) from (
select id as pid, leftbower as lft, rightbower as rgt from newtable where parent =1 and folder = 1
)

as base join newtable on newtable.id> base.lft and newtable.id<base.rgt 
group by pid

takes very very long time, depends on which Index I used, and got the same out. In this case, it freezes forever.(It is because it use folder index. After I delete folder index, the query will use parent index and the time is about 14seconds. Anyway, this is another question. Let's focus why the behavoir of using indecies by group by operation is so strange first.

Best Answer

It's probably because your index doesn't include the size field so it doesn't cover it so when aggregating over size with a GROUP BY the operation cost determined by the query optimizer thinks using a TEMP B-TREE is the better choice. See the Temporary Sorting B-Trees section of the "EXPLAIN QUERY PLAN" doc.

You can create an index on newtable that includes the size field as well and re-check the EXPLAIN. More information on a similar issue in this DBA.StackExchange answer.

Related Solutions

SQLite – When to Use Index Types Other Than Integer (e.g., String)

There are a lot of places internally where it is possible to reference data using a user id, which is better than a string. For example, if you had 4 tables referencing the user table, they'd each have a string column which is less efficient than 4 integer indexed columns. This is the reason why most schemas use this layout; you're more likely to want to reference a user than to refer to the user by name.

However, you can also set username to be an index, which you might want to do if you are looking it up a lot.

To answer your edited question, a 24 character text field will require roughly 24 bytes. Depending on whether it's a varchar or a char it will be stored differently, but the full amount will be needed for searching, so let's consider them both to be 24.

An integer field will require 4 bytes.

You can see how this will add up over time. For comparisons, assume a linear time comparison, meaning 24 operations to compare two a string field, while only 4 operations to compare an integer field. In practice, they will probably take roughly the same amount of time, though the string comparison cannot be faster.

If you want to search by username, create a table with the schema of username | user_id. That'll let you search by username and join with the other data you are looking for while still being efficient.

SQLite – How to Select All Rows with a Minimum Value

As you have seen, a simple GROUP BY will not work because it would return only one record per group.

Your join works fine. For a large table, it will be efficient only if there is an index on the join columns (num and text).

Alternatively, you could use a correlated subquery:

SELECT *
FROM t
WHERE num = (SELECT MIN(num)
             FROM t AS t2
             WHERE t2.text = t.text);

SQLFiddle

When being executed, this query does not require a temporary table (your query does for the result of u), but will execute the subquery for each record in t, so text should be indexed. (Or use an index on both text and num to get a covering index.)

This is the index I used:

Best Answer

Related Solutions

SQLite – When to Use Index Types Other Than Integer (e.g., String)

SQLite – How to Select All Rows with a Minimum Value

Related Question