I'm sorry for the vague question description, but I'm honestly a noob in DB design, as I've only really worked with small databases. This might change in the near future though. I have a spider which periodically crawls a website that sells merchandise. The objetive of this spider is to:
1. Obtain essential info from each object being sold, such as name, product_code, url, description, etc.
2. Get access to the price history of said object (by periodically rerunning the crawler), with respective timestamp.
These two requirements lead to two possible (in my limited db knowledge) database designs:
-
First Option:
AnINSERT
only table called products, with noUNIQUE
fields (except row number), containing all fields mentioned above. Every repeated run of the crawler will duplicate existing rows.On each product found by the crawler, one simple (and guaranteed) insert on the product table.
When consulting price history, simply group by product_code.
-
Second Option:
TwoINSERT
only tables, one called products, containing all essential data (description, name, etc). This table also contains aUNIQUE
field called product_code. The second table, called prices, contains only the fields product_code, timestamp and price. This table contains noUNIQUE
fields, except for the row id itself.
On each product by the crawler, two inserts:- One simple and guaranteed insert on the prices table.
- An insert attempt, which will only actually go through when a never before inserted product is found.
When consulting price history, simply group by product_code on prices table.
My opinion
I believe the second option will scale better in disk usage, as many fields will not be repeated needlessly (as in the first option).
The first option is probably faster though as it's simply one guaranteed INSERT
. How much faster though? How bad is it, in the second option, to try and constantly fail to insert?
EDIT – Note when considering the above point, thats in the second option the products table will be much smaller than the prices table.
Please go ahead and suggest other designs if you think it may be better! I'm completely welcome to discussion.
Best Answer
Use both together.
A common design strategy in screen-scraping is to separate the task of fetching data from the task of processing data.
Have one thread focused on just fetching the data. Store that data in rather raw form on one "scrape" table. You can let these accumulate as they may be useful in debugging problems. Occasionally truncate the table when old rows are no longer of interest.
Use another thread or another app to process those data samples. As you process each sample, mark the "scrape" row as processed, parse the raw scraped data, and normalize the data to put the repeating product info (description, etc) in a parent "product" table. Put the changing data (price, date-time sample taken) in a child "price" table.
Use the product code as the key fields linking the two tables if you are certain that is reliable: the code is truly unique and always will be. In my experience such conditions are never true, screw-ups happen and such codes are recycled or mistakenly re-assigned or eventually change when companies are re-orged or merged. So I always use a surrogate key of type UUID, but you may decide to stick with natural keys. Reasonable people disagree on this.
If you are truly concerned about scaling, specify your expected volume of data. A serious enterprise database such as Postgres can easily handle millions of rows, even on low-end computer hardware such as a Mac mini. (Though, for critical data you would of course be using a better machine with error-correcting memory and RAID storage.)