PostgreSQL ts_rank – Why Is ts_rank Returning a Numeric/Decimal/Real That Fails a Greater Than Check?

full-text-searchpostgresql

The following yields (seemingly exactly) 0.1:

select ts_rank_cd(to_tsvector('cat'), to_tsquery('cat'));

Yet the following yields t:

select ts_rank_cd(to_tsvector('cat'), to_tsquery('cat')) > 0.1;

I would expect it to yield t if the operation was >=, not > How can I get the "real" value of the ts_rank_cd so that it returns f?

As background, I'm writing a full text search query that uses keyset pagination with ts_rank, so being able to order results based on a concrete, precise value of ts_rank is important. With keyset pagination, the client provides the last result from the current page as a parameter and asks for the next page. Currently, when my client passes back 0.1, the above > behavior causes erroneous results

Best Answer

ts_rank_cd yields a real. If you cast this to double precision, the result becomes:

SET extra_float_digits = 3;

SELECT ts_rank_cd(to_tsvector('cat'), to_tsquery('cat'))::double precision;

     ts_rank_cd      
---------------------
 0.10000000149011612
(1 row)

That explains what you observe, because the real is cast to double precision if you compare it to a double precision number.

The solution is to explicitly cast 0.1 to real, so that real comparison is used:

SELECT ts_rank_cd(to_tsvector('cat'), to_tsquery('cat')) = 0.1::real;

 ?column? 
----------
 t
(1 row)

Related Solutions

PostgreSQL – Select Rows Where SUM(T.quantity) is Greater Than Argument

If you don't need the "optimal fill", you could use a window function and something like this:

select *
from (
  select id, name, price, category, 
         sum(price) over (order by price, date_created) as so_far
  from my_table
  where price > 40 
    and category = 'buy' 
) t
where so_far <= 12;

If you are mixing "groups" of orders, you might need a partition by clause as well. Change the order by inside the window function to meet your ordering criteria.

Postgresql – Real time much greater than `EXPLAIN ANALYZE`’s execution time (index scan)

Firs the displayed query plan says:

Limit (cost=0.43..44.98 rows=17 width=553) ...

Obviously you added a LIMIT clause to the query, which is not in the query you show. If you didn't do the same in you second test, you may already have your answer: The same query with LIMIT (and without ORDER BY) can stop as soon as enough rows are found and can be much faster. Can also lead to a different query plan to begin with.

In principle

EXPLAIN ANALYZE measures the time spent by the Postgres process. Measurement is taken on the server. It adds a bit of overhead as compared to just executing the SQL command alone. For very cheap commands, that overhead may be more than the query cost. Overhead also depends on used options to some degree. You can minimize overhead with EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF). (The TIMING option of EXPLAIN is in no way related to psql's \timing; used to toggle additional measurements per plan node.)

The manual:

The ANALYZE option causes the statement to be actually executed, not only planned. Then actual run time statistics are added to the display, including the total elapsed time expended within each plan node (in milliseconds) and the total number of rows it actually returned.

About the TIMING option:

Include actual startup time and time spent in each node in the output. The overhead of repeatedly reading the system clock can slow down the query significantly on some systems, so it may be useful to set this parameter to FALSE when only actual row counts, and not exact times, are needed. Run time of the entire statement is always measured, even when node-level timing is turned off with this option. This parameter may only be used when ANALYZE is also enabled. It defaults to TRUE.

\timing in psql, on the other hand, measures the time between sending the command and receiving results - including network latency. Measurement is taken on the client. That can be many times the execution alone. Seems to be your case.

Due to the mentioned overhead above, the opposite might be observed in cases with low latency and minimal data volume sent over the wire (especially with psql executed on the same server): \timing can report faster times.

There is a third good option: use the shell command time (also measuring on the client, obviously). See:

How can I time SQL-queries using psql from command line?

Asides:

When I actually execute it (with \timing) ...

You are aware that EXPLAIN ANALYZE does actually execute the given query, with all side effects - as opposed to just EXPLAIN.

I tried to programmatically loop over the ids and run a separate query for each and it produces better results (around 1.5s)

A loop can produce query plans customized to each actual input value, based on statistics for the value. While a large IN or a large input array will produce a plan based on generic estimates. See:

How to use index for simple time range join?

You can achieve similar effects as with your loop with a LATERAL join from a (derived) table or with many SELECT statements tied together with UNION ALL or with a loop in a (plpgsql) server-side function - without adding many times the network latency and/or planing overhead. Related: