Postgresql – Database scheme for web log analytics

database-designpostgresqlschema

I analyse web logs (nginx) and want store (organize) result in DB.
After analyse and aggregate my result looks like:

  • Date and Time (Truncate it by hours)
  • Country
  • City
  • URL (Page on the website)
  • Referrer (source website)
  • UTM_Medium
  • UTM_Campaign
  • c_sessions
  • c_views
  • avg_session_time
  • c_js_event_1
  • c_js_event_2
  • c_js_event_3

And so on, about 30 columns.

For that data I prepare "big table"

CREATE TABLE public.cca_data_hours
(
  datetime timestamp without time zone NOT NULL,
  id_geo_country integer NOT NULL,
  id_geo_city integer NOT NULL,
  id_web_page bigint NOT NULL,
  id_web_referrer bigint NOT NULL,
  id_utm_campaign bigint NOT NULL,
  id_utm_medium bigint NOT NULL,
  c_sessions integer,
  c_views integer,
  avg_sd integer,
  c_jse_1 integer,
  c_jse_2 integer,
  c_jse_3 integer,
  ...
  CONSTRAINT cca_data_hours_pri PRIMARY KEY (datetime, id_web_page, id_web_referrer, id_utm_campaign, id_utm_medium, id_geo_country, id_geo_city)
)

I expect 100 000 000 records.

My selections

  • I want see all data by countries, cities
  • I want see data per page
  • I want see data per source (referrer)
  • I want see data per page, source and utm campaign
  • Some other combinations

My questions

  • Is it normal use so many columns as primary key (index)? Primary key now contains 7 columns, it may up to 10-15 columns in the future.

  • I think this is a simple solution how to organize this data in one object-relational DB table. Do you know better solution, practice?

Best Answer

So many columns in a primary key sounds strange, at least. That means to me that the data is not really adhering to a sensible uniqueness definition - this is more often than not like this in the case of logs. For this reason, I wouldn't even try to find a PK here, just define indexes based on the queries used against the table. One could call this opportunistic indexing ;)

Note that in general, in a relational models PKs are very important. Here it is more the case the data is not really relational, only stored in a RDBMS.