My goal
I want to record data for a market basket analysis (https://en.wikipedia.org/wiki/Affinity_analysis) on computer software and query this this data frequently.
Data description
One user may own one item and use it for a certain time. Usage is saved as an integer in minutes. There is a distinction between usage_forever
and usage_frequent
. usage_frequent
gives insights about the usage up to one week week in the past for one product. useage_forever
with a zero indicates that the user bought the product but never used it. I used MySQL as the database mangement system.
SQL fiddle link with sample data: http://sqlfiddle.com/#!9/92d4bc
ownership table
+----------------------+-------------------+-------------+------------------------+------------------------+
| userID | productID | modified | usage_forever | usage_recent |
+----------------------+-------------------+-------------+------------------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | timestamp#1 | mediumint(8)unsigned#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+-------------+------------------------+------------------------+
CREATE TABLE product(
productID INT UNSIGNED NOT NULL PRIMARY KEY,
product_name VARCHAR(191)
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE TABLE product_user(
userID BIGINT UNSIGNED NOT NULL PRIMARY KEY,
display_name VARCHAR(32) NOT NULL,
products_in_account SMALLINT NOT NULL DEFAULT 0 -- this is a derived value, denormalized for convenience
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE TABLE ownership (
userID BIGINT UNSIGNED NOT NULL,
productID INT UNSIGNED NOT NULL,
modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP NOT NULL,
usage_forever MEDIUMINT UNSIGNED NULL,
usage_recent MEDIUMINT UNSIGNED NULL,
PRIMARY KEY(userID, productID),
UNIQUE INDEX ux_ownership_productID_userID (productID, userID),
CONSTRAINT fk_ownership_productUser_userID
FOREIGN KEY (userID)
REFERENCES product_user(userID)
ON UPDATE CASCADE,
CONSTRAINT fk_ownership_product_productID
FOREIGN KEY (productID)
REFERENCES product(productID)
ON UPDATE CASCADE
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
The problem
The last time I stored this data there were huge performance issues. My product
table contained 35,000 items and my user
table 35,000 rows. This concluded in 1.4 million entries in my ownership table. Users own at most 11,000+ items but most of the users own in the range between 100 to 1000 items. Queries to this table were painfully slow so that I sliced it down to a tiny selection in order to perform my analysis.
Use cases. Ordered by most important task first
- Frequent select queries to specific users or specific items. This is
the most important use case. - update queries to usages to 50,000 users a day
- insertion queries of new users up to 1,000 times a day
- I want to be able to perform market basket analysis by user, group of users, products, or group of products
- very rare delete queries
My solution
My new proposal for this table: Split it into ownership_forever
and ownership_recent
:
ownership_forever, PRIMARY KEY(userID, proudct ID)
+----------------------+-------------------+------------------------+
| userID | productID | usage_forever |
+----------------------+-------------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+------------------------+
ownership_recent, PRIMARY KEY(userID, proudctID); PRIMARY KEY for archive(userID, proudctID, modified)
+----------------------+-------------------+-------------+------------------------+
| userID | productID | modified | usage_recent |
+----------------------+-------------------+-------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | timestamp#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+-------------+------------------------+
This way many empty rows are omitted because many users never used their product. The fact however remains that these tables will become huge:
- I will have to work with at least 300,000 users. If I extrapolate
based on my sample my newownership_forever
table will contain 12
million rows. - In the future I want this table to scale up to 10 million users. If I extrapolate based on my sample
ownership_forever
will contain 400
million rows. - I plan to periodically dump
ownership_recent
into a partitioned archive table.
Questions
I am not experienced enough to estimate if this is a feasible solution. Can you assist in answering the following questions?
- Does this approach make sense? Are there alternatives that I have
missed? - Should I also use partitions for my non-archive tables?
- I could convert the bigint
userID
into a smaller int(10). Does this have any effect on performance? - What do you propose in order to query these large tables in an efficent way?
- Can MySQL handle this "kind" of data? Did I make a big mistake in my first sample or why did I have such slow performance?
- How should I set up the indexes?
- Is this viable with a relational database on one machine, a distributed database or do I already need to set up a data warehouse?
- My new data is supplied in a JSON format. This hints that NoSQL / document-based databases might be applicable. I never worked with any of those. Can comment comment if I should look into those?
The JSON format:
{"reply": {
"products_owned": 89,
"products": [
{
"productID": 11111,
"usage_forever": 11
},
{
"productID": 222222,
"usage_forever": 0
},
{
"productID": 333333,
"usage_forever": 0,
"usage_recent" : 69
}]}}
Best Answer
tl;dr - MySQL is fine for this scenario. I recommend a slightly changed design.
In my opinion, I would take a slightly different approach.
I have put together a revised schema which I believe is better structured. There is no need for both a product_user and an ownership table - I have combined them in a table called
portfolio
.Now, it's with the portfolio table that things become interesting. You have your user_id and product_id as you'd expect. Then, you have a usage type (CHAR(1)) which means that the
usage_forever
andusage_recent
fields in your schema disappear. You could make these three fields (user, product, usage_type) the naturalPRIMARY KEY
- I've just made themUNIQUE
with a surrogate.The owned_since field would be better (IMHO) as the number of seconds from '1970-01-01 00:00:00' (use FROM_UNIXTIME/TO_UNIXTIME to convert back and forth). UNIXTIME is more of a standard - or why not just a
TIMESTAMP
(MySQL restriction on the number of those per table?)?The usage_type field means that for one record in your fiddle, there are two in mine (see below) - but I believe that this gives you extra flexibility and will make querying easier in the long run. As a rule of thumb (very generally), tables are "better" when they have more records and fewer fields.
@RickJames (a regular contributor here) says this about partitioning in his MySQL (ex-employee) rules of thumb:
So, I would say no. Focus on indexing - check your execution plans if things run slow - normally it sticks out like a sore thumb (what is it with thumbs today? :-) )
Small performance impact, but it's no harm to always have your
INTEGER
ranges set to the smallest size possible, so for 35,000 users, you'll want an[UNSIGNED MEDIUMINT][2]
.I find that SQL queries are a pretty good way to, err..., query the data!
Sorry about that :-), but this is what SQL is designed for! Obviously, you won't want to be doing
SELECT * FROM huge_table;
very often, but for point and/or small range queries - with good indexes and a decent decent disk subsystem (what is yours BTW?) - you should be fine!Yes, MySQL can handle this magnitude of data - esp. since your tables are not very wide and don't have
BLOB
s. As for your previous performance issues, I'm afraid that we'd need way more information (DDL and DML, Query Plans, Hardware info...). Also, you say thatQueries to this table were painfully slow
- we'd need definitions of "slow".For your portfolio table - on po_user_id, on po_product_id, on usage_type if it's not too skewed. (po_user_id, po_product_id) and vice-versa. You also mention product groups - that will add a layer of complexity. You may need to tweak your indexing strategy at that point. For better or worse, there are no silver bullets - if there were, they'd be the default and guys like me would be out of a job! Depending on your queries, you might want some covering indexes - can't say in advance.
So far, your scenario doesn't appear very complicated, so my suggestion is to use an RDBMS for the time being (I recommend PostgreSQL if you're going Open Source - vastly superior to MySQL). Get a decent disk sub-system. If you spend any money at all, it should be there. Use RAID 10 (NOT RAID 5 or anything else).
I would keep the number of records in my main portfolio table to a minimum - try and send no-longer-queried/rarely-queried historical records (archived) off the system alltogether or keep them in a separate table or tables.
Relational databases work best with relational data. JSON (apart from being flavour du jour) is for transmission and presentation layers - take your JSON and convert it to relational and back if necessary. PostgreSQL has some fantastic JSON capabilities, but MySQL is getting them to - but nowhere near as sophisticated as PostgreSQL.
tl;dr - stick with an RDBMS with decent disks (SSD maybe - $1300 will get you one of these babies - you have 400,000,000 records at maybe 100 Bytes/record = 40GB. You could have blisteringly fast performance on your database with decent design and minimal outlay!
A final (and very imporant point) about your data:
My translation of your SQLFiddle into my schema gives this (for user1 - id = 201)
As you can see, in my data, there are records with 0 quantity. Some of the records in your fiddle have
usage_forever
andusage_recent
both set to 0 as well. Now, for me, unless I've misunderstood something, it makes no sense to keep track of a quantity of 0 ownership.In my schema, it is irrational to keep the records with 0 quantity, and in your schema it is also illogical to keep ones which have both
usage_forever
andusage_recent
= 0 (which results in two 0 QTY records for me).I hope I've made some sense and that this has been helpful! The fiddle can be found here.