How to Check if Specific Number of Rows Exist Without COUNT(*) in PostgreSQL

postgresqlpostgresql-performance

I know we can check if at least one row of a table exists using EXISTS keyword, for example:

select * from A where exists(select from B where B.id = A.id)

What if I want to check if at least N rows exist, without using COUNT?

Performance matters to me. I don't want the whole table to be scanned to count matching rows. I just need my query to stop when N rows are found. I mean, the existence of at least N rows is what matters, and not the total count. That could save time if table is large.

Is there such feature in PostgreSQL?

Best Answer

You can add this in the EXISTS subquery:

LIMIT 1 OFFSET (N-1)

In other words, the subqery first tries to finds (N-1) rows, skip them and return the next (Nth) row. So, it will return something (one row) if and only if there are N+ rows.

You need appropriate indexes, i.e. an index on B(id) in this case, to make it as fast as possible, obviously.

Related Solutions

Postgresql – the purpose of the keyword “ID” in PostgreSQL

In PostgreSQL, inserting a column named

"ID"

is one thing. Inserting a column named

ID

is another.

create table test (
  "ID" integer not null
);

insert into test values (1);

select *
from test
where ID = 1;

ERROR: column "id" does not exist

Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case.

Postgresql – Searching a big table for a few results without scanning the whole table

but using the "limit" keyword cant prevent the search over the whole set of data(is it correct?)

Correct; it'll just limit how many results are returned.

For some kinds of query it can also limit how many are scanned in the first place, but you can't rely on that in the general case.

does PostgreSQL database have to create this 1M search result rows temporarily in memory and then give me my needed 100 results?

Create, yes. In memory, not necessarily. It'll often spill to disk, or discard unneeded results as it goes so it only keeps the top 100. It depends on the details of the query.

or is there any way to tell PostgreSQL to stop searching as soon as it find 100 records?

Write a query where that is possible. You haven't shown the query, or schema, so this part is impossible to usefully answer.

In general, if you have an ORDER BY on some kind of search relevance field then the whole data set must be searched then filtered for the top-n results.

In short: "It depends". Specifically, it depends on the schema, available indexes, and on exactly what you're searching for and what results you want.

This:

SELECT id 
FROM mytable 
WHERE somefield > 100 
ORDER BY somefield DESC
LIMIT 100;

would only scan the needed rows if there was an index on mytable(somefield DESC).

This:

SELECT id, title
FROM mytable
WHERE title LIKE '%something%'
ORDER BY calculate_relevance(title, 'something')
LIMIT 100;

would always scan the whole table, both because an infix text pattern match (LIKE '%blah%') can't use a b-tree index, and because you can't create an expression index on a function taking a literal parameter like the case of the imaginary calculate_relevance function above.

So... it depends totally on what you're searching for and how.

Is this answer too hand-wavey and general for you? So's the question. Supply specifics, and you'll get more specific answers. A http://sqlfiddle.com/ of the schema with sample data, plus your PostgreSQL version, a table of the expected results, and a description of what you want to find is usually the minimum for this kind of thing.

Update: A few problems here.

There's no index on tag1 and/or tag2 so Pg must do a seqscan of the table, searching it until it finds 100 results. This is extremely inefficient.
There's no ORDER BY, so PostgreSQL can return whatever results it feels like for a LIMIT and return them in any order.
Your schema shows signs of being unnecessarily denormalized. tag1, tag2, etc usually suggest bad table design. I strongly recommend that you study relational database design and normalization.

In this specific case, you'd be able to avoid the full table scan if you CREATE INDEX con_tag1_tag2_idx ON con(tag1, tag2). But that'll only work for searches that use tag1 and optionally also tag2. It won't help (much) with searches for just tag2, searches for ref1, etc. (Thanks Erwin for the important note on multicolumn index use with the first column not included, it was a surprise to me).

Don't create lots of indexes. Each one uses disk space and takes disk I/O to insert, update and delete.

The answer is often proper normalization.

Overall I think you need to buy/borrow a couple of good introductory books on relational database design and basic to intermediate SQL, preferably PostgreSQL-focused, and do some study.

Best Answer

Related Solutions

Postgresql – the purpose of the keyword “ID” in PostgreSQL

Postgresql – Searching a big table for a few results without scanning the whole table

Related Question