I analyse web logs (nginx) and want store (organize) result in DB.
After analyse and aggregate my result looks like:
- Date and Time (Truncate it by hours)
- Country
- City
- URL (Page on the website)
- Referrer (source website)
- UTM_Medium
- UTM_Campaign
- c_sessions
- c_views
- avg_session_time
- c_js_event_1
- c_js_event_2
- c_js_event_3
And so on, about 30 columns.
For that data I prepare "big table"
CREATE TABLE public.cca_data_hours
(
datetime timestamp without time zone NOT NULL,
id_geo_country integer NOT NULL,
id_geo_city integer NOT NULL,
id_web_page bigint NOT NULL,
id_web_referrer bigint NOT NULL,
id_utm_campaign bigint NOT NULL,
id_utm_medium bigint NOT NULL,
c_sessions integer,
c_views integer,
avg_sd integer,
c_jse_1 integer,
c_jse_2 integer,
c_jse_3 integer,
...
CONSTRAINT cca_data_hours_pri PRIMARY KEY (datetime, id_web_page, id_web_referrer, id_utm_campaign, id_utm_medium, id_geo_country, id_geo_city)
)
I expect 100 000 000 records.
My selections
- I want see all data by countries, cities
- I want see data per page
- I want see data per source (referrer)
- I want see data per page, source and utm campaign
- Some other combinations
My questions
-
Is it normal use so many columns as primary key (index)? Primary key now contains 7 columns, it may up to 10-15 columns in the future.
-
I think this is a simple solution how to organize this data in one object-relational DB table. Do you know better solution, practice?
Best Answer
So many columns in a primary key sounds strange, at least. That means to me that the data is not really adhering to a sensible uniqueness definition - this is more often than not like this in the case of logs. For this reason, I wouldn't even try to find a PK here, just define indexes based on the queries used against the table. One could call this opportunistic indexing ;)
Note that in general, in a relational models PKs are very important. Here it is more the case the data is not really relational, only stored in a RDBMS.