sql - Counting distinct rows using recursive cte over non-distinct index -

given following schema:

create table identifiers (   id text primary key );  create table days (   day date primary key );  create table data (   id text references identifiers   , day date references days   , values numeric[]  );  create index on data (id, day);

what best way count distinct days between 2 timestamps? i've tried following 2 methods:

explain analyze select count(distinct day)  data  day between '2010-01-01' , '2011-01-01';                                                                         query plan                                                                         ----------------------------------------------------------------------------------------------------------------------------------------------------------  aggregate  (cost=200331.32..200331.33 rows=1 width=4) (actual time=1647.574..1647.575 rows=1 loops=1)    ->  index scan using data_day_sid_idx on data  (cost=0.56..196942.12 rows=1355678 width=4) (actual time=0.348..1180.566 rows=1362532 loops=1)          index cond: ((day >= '2010-01-01'::date) , (day <= '2011-01-01'::date))          heap fetches: 0  total runtime: 1647.865 ms (5 rows)  explain analyze select count(distinct day)  days day between '2010-01-01' , '2011-01-01';                                                            query plan                                                            --------------------------------------------------------------------------------------------------------------------------------  aggregate  (cost=18.95..18.96 rows=1 width=4) (actual time=0.481..0.481 rows=1 loops=1)    ->  index scan using days_pkey on days  (cost=0.28..18.32 rows=252 width=4) (actual time=0.093..0.275 rows=252 loops=1)          index cond: ((day >= '2010-01-01'::date) , (day <= '2011-01-01'::date))          heap fetches: 252  total runtime: 0.582 ms (5 rows)

the count(distinct day) against days runs well, requires me keep secondary table (days) keep performance reasonable. in general sense, i'd test if recursive cte allow me achieve similar performance without maintaining secondary table. query looks this, doesn't run yet:

explain analyze recursive cte (    (select day data order 1 limit 1)    union    (  -- parentheses required    select d.day      cte  c    join   data d on d.day > c.day    order  1 limit 1    ) ) select day  cte day between '2010-01-01' , '2011-01-01';

updates

thanks ideas. looks maintaining trigger-based table of distinct days best way go, both storage , performance-wise. @erwin's update, recursive cte in running. useful.

with recursive cte (    (  -- parentheses required because of limit    select day      data     day >= '2010-01-01'::date  -- exclude irrelevant rows    order  1    limit  1    )     union    select (select day data             day > c.day            ,    day < '2011-01-01'::date  -- see comments below            order  1            limit  1)      cte c     day not null  -- necessary because corr. subq. returns row    ) select count(*) ct   cte  day not null;                                                                               query plan --------------------------------------------------------------------------------------------------------------------------------------------------------------------  aggregate  (cost=53.35..53.36 rows=1 width=0) (actual time=18.217..18.217 rows=1 loops=1)    cte cte      ->  recursive union  (cost=0.43..51.08 rows=101 width=4) (actual time=0.194..17.594 rows=253 loops=1)            ->  limit  (cost=0.43..0.46 rows=1 width=4) (actual time=0.191..0.192 rows=1 loops=1)                  ->  index scan using data_day_idx on data data_1  (cost=0.43..235042.00 rows=8255861 width=4) (actual time=0.189..0.189 rows=1 loops=1)                        index cond: (day >= '2010-01-01'::date)                        heap fetches: 0            ->  worktable scan on cte c  (cost=0.00..4.86 rows=10 width=4) (actual time=0.066..0.066 rows=1 loops=253)                  filter: (day not null)                  rows removed filter: 0                  subplan 1                    ->  limit  (cost=0.43..0.47 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=252)                          ->  index scan using data_day_idx on data  (cost=0.43..1625.59 rows=52458 width=4) (actual time=0.060..0.060 rows=1 loops=252)                                index cond: ((day > c.day) , (day < '2011-01-01'::date))                                heap fetches: 0    ->  cte scan on cte  (cost=0.00..2.02 rows=100 width=0) (actual time=0.199..18.066 rows=252 loops=1)          filter: (day not null)          rows removed filter: 1  total runtime: 19.355 ms (19 rows)

and discussed exists query

explain analyze select count(*) ct   generate_series('2010-01-01'::date, '2010-12-31'::date, '1d'::interval) d(day)  exists (select 1 data day = d.day::date);                                                                    query plan -----------------------------------------------------------------------------------------------------------------------------------------------  aggregate  (cost=674.32..674.33 rows=1 width=0) (actual time=95.049..95.049 rows=1 loops=1)    ->  nested loop semi join  (cost=0.45..673.07 rows=500 width=0) (actual time=12.438..94.749 rows=252 loops=1)          ->  function scan on generate_series d  (cost=0.01..10.01 rows=1000 width=8) (actual time=9.248..9.669 rows=365 loops=1)          ->  index scan using data_day_idx on data  (cost=0.44..189.62 rows=6023 width=4) (actual time=0.227..0.227 rows=1 loops=365)                index cond: (day = (d.day)::date)                heap fetches: 0  total runtime: 95.620 ms (7 rows)

several notes:

simple query on table `day`

select count(distinct day)    days  day between '2010-01-01' , '2011-01-01';

day defined pk, distinct waste in case.

recursive cte correlated suquery

this alternative if there no day table unique entries. technique pays if there multiple many rows per day, equivalent of loose index scan faster simple distinct on base table:

with recursive cte (    (  -- parentheses required because of limit    select day      data    where  day >= '2010-01-01'::date  -- exclude irrelevant rows    order  1    limit  1    )     union    select (select day data             day > c.day            and    day < '2011-01-01'::date  -- see comments below            order  1            limit  1)      cte c     day not null  -- necessary because corr. subq. returns row    ) select count(*) ct   cte  day not null;

index

only makes sense in combination matching index on data:

create index data_day_idx on data (day);

day must leading column. index have in question on (id, day) can used too, far less efficient:

notes

it cheaper exclude irrelevant rows early. integrated predicate query.
detailed explanation:
- optimize group query retrieve latest record per user
the case @ hand simpler - simplest possible actually.
your original time frame day between '2010-01-01' , '2011-01-01' - not intended. between .. , .. includes upper , lower bound, way of 2010 plus 2011-01-01. seems more want exclude last day. using d.day < '2011-01-01' (not <=).

`exists` special case

since testing range of enumerable days (as opposed range infinite number of possible values), can test alternative exists semi-join:

select count(*) ct   generate_series('2010-01-01'::date, '2010-12-31'::date, '1d'::interval) d(day)  exists (select 1 data day = d.day::date);

again, same simple index essential.

sql fiddle demonstrating both queries big test table of 160k rows.

Search This Blog

Print F