Postgresql – What does ‘fully determines’ mean in the context of postgres functional dependency statistics

postgresql

I am following the example of how to use 'create statistics' in postgresql from this page.

CREATE TABLE tbl (
                     col1 int,
                     col2 int
);

INSERT INTO tbl SELECT i/10000, i/100000
FROM generate_series (1,10000000) s(i);

CREATE STATISTICS s1 (dependencies) on col1, col2 from tbl;
ANALYZE tbl;

SELECT stxname, stxkeys, stxdependencies
FROM pg_statistic_ext
WHERE stxname = 's1';

returns:

stxname,stxkeys,stxdependencies
s1,1 2,{"1 => 2": 1.000000}

However, I'm confused by this line

Looking at this, we can see that Postgres realizes that col1 fully
determines col2 and therefore has a coefficient of 1 to capture that
information.

In the example there are many records in col2 that are repeated. So how does col1 'fully determine' col2? What does this mean?

Best Answer

It means that if col1 is identical for two rows, then col2 will also be identical.

In other words, if the functional dependency were perfect, you could remove col2 from the table and use a lookup table that maps col1 to col2. But don't forget that these are only statistical data from a sample of the rows, so even if you see a dependency of 1 in the extended statistics, that is no proof that it is really always so (only very often).

Related Solutions

Postgresql – Convert units of measurement

This looks ugly:

  with uu(unit, coefficient, u_ord) as (
    select
     unit, 
     coefficient,
     case 
      when log(u.coefficient) < 0 
      then floor (log(u.coefficient)) 
      else ceil(log(u.coefficient)) 
     end u_ord
    from
     unit_conversion u 
  ),
  norm (label, norm_qty) as (
   select
    s.label,
    sum( uc.coefficient * s.quantity ) AS norm_qty
  from
    unit_conversion uc,
    substance s
  where
    uc.unit = s.unit
  group by
    s.label
  ),
  norm_ord (label, norm_qty, log, ord) as (
   select 
    label,
    norm_qty, 
    log(t.norm_qty) as log,
    case 
     when log(t.norm_qty) < 0 
     then floor(log(t.norm_qty)) 
     else ceil(log(t.norm_qty)) 
    end ord
   from norm t
  )
  select
   norm_ord.label,
   norm_ord.norm_qty,
   norm_ord.norm_qty / uu.coefficient val,
   uu.unit
  from 
   norm_ord,
   uu where uu.u_ord = 
     (select max(uu.u_ord) 
      from uu 
      where mod(norm_ord.norm_qty , uu.coefficient) = 0);

but seems to do the trick:

|   LABEL | NORM_QTY | VAL |       UNIT |
-----------------------------------------
| mercury |   1.5e-8 |  15 | microlitre |
|   water |  0.00112 | 112 | centilitre |

You don't really need the parent-child relationship in the unit_conversion table, because the units in the same family are naturally related to each other by the order of coefficient, as long as you have the family identified.

PostgreSQL – Understanding [FROM x, y] Syntax

It creates an implicit CROSS JOIN. It's the SQL-89 syntax.

Here I use values(1) and values(2) to create pseduo-tables (value tables) merely for examples. The thing after them t(x), and g(y) are called FROM-Aliases the character inside the parenthesis is the alias for the column (x and y respectively). You could just as easily create a table to test this.

SELECT *
FROM (values(1)) AS t(x), (values(2)) AS g(y)

Here is how you'd write it now.

SELECT *
FROM (values(1)) AS t(x)
CROSS JOIN (values(2)) AS g(y);

From there you can make this an implicit INNER JOIN by adding a conditional.

SELECT *
FROM (values(1)) AS t(x)
CROSS JOIN (values(1)) AS g(z)
WHERE x = z;

Or the explicit and newer INNER JOIN syntax,

SELECT *
FROM (values(1)) AS t(x)
INNER JOIN (values(1)) AS g(z)
  ON ( x = z );

So in your example..

FROM apod, to_tsquery('neutrino|(dark & matter)') query

This is essentially the same as the newer syntax,

FROM apod
CROSS JOIN to_tsquery('neutrino|(dark & matter)') AS query

which is actually the same, in this case, because to_tsquery() returns a row and not a set as,

SELECT title, ts_rank_cd(
  textsearch,
  to_tsquery('neutrino|(dark & matter)')
) AS rank
FROM apod
WHERE to_tsquery('neutrino|(dark & matter)') @@ textsearch
ORDER BY rank DESC
LIMIT 10;

However, the above could potentially cause to_tsquery('neutrino|(dark & matter)') to occur twice, but in this case it doesn't -- to_tsquery is marked as STABLE (verified with \dfS+ to_tsquery).

STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements. This is the appropriate selection for functions whose results depend on database lookups, parameter variables (such as the current time zone), etc. (It is inappropriate for AFTER triggers that wish to query rows modified by the current command.) Also note that the current_timestamp family of functions qualify as stable, since their values do not change within a transaction.

For a more complete comparison of the differences between SQL-89, and SQL-92, see also my answer here

Best Answer

Related Solutions

Postgresql – Convert units of measurement

PostgreSQL – Understanding [FROM x, y] Syntax

Related Question