High Cardinality Time Series Database Design

database-designtime-series-database

I have data that is being collected every 30 minutes from a website. This data looks like the following. The "epoch" row labels are placeholders for actual epoch timestamps of each sample. The columns of names along the top would expand indefinitely as more unique names are found.

table n (multiple tables of data like this)
-------------------------------------------
        James   Tom   Jeff   Leon   Sarah  [...]
epoch1  1       44    22     NULL   3
epoch2  4       33    23     1      NULL
epoch3  5       32    22     2      NULL
...
epochn  x       x     x      x      x

What I'm doing is scraping the website for anything classified as a name by an NLP library, and recording frequencies for the top 40 names found. So I scrape the entire "page" of this website and find that there are 22 instances of the name "Jeff", then record it in the table at Jeff/Epoch1 with frequency 22.

I already have the name tagger working, but it's just a matter of finding something that could handle this sort of data and pull metrics like "10 most popular names in table 4 at epoch between 24 hours ago and now". I've looked at PostgreSQL but it has a conservative column limit. Ideally columns would be added for each new name found, but this doesn't seem feasible with the databases I've found so far.

Anyone know of any open-source software that could store this data, or best practices?

Best Answer

Just normalize it.

crawl-
crawl_id|crawltime
1       |2018-01-02 2:47 PM

terms-
crawl_id | term | count
1        | Joe  | 15
Related Question