Mysql – Database design / database tuning for market basket analysis on ownership of products

database-designMySQLpartitioning

My goal

I want to record data for a market basket analysis (https://en.wikipedia.org/wiki/Affinity_analysis) on computer software and query this this data frequently.

Data description

One user may own one item and use it for a certain time. Usage is saved as an integer in minutes. There is a distinction between usage_forever and usage_frequent. usage_frequent gives insights about the usage up to one week week in the past for one product. useage_forever with a zero indicates that the user bought the product but never used it. I used MySQL as the database mangement system.

SQL fiddle link with sample data: http://sqlfiddle.com/#!9/92d4bc

ownership table
+----------------------+-------------------+-------------+------------------------+------------------------+
|        userID        |     productID     |  modified   |     usage_forever      |      usage_recent      |
+----------------------+-------------------+-------------+------------------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | timestamp#1 | mediumint(8)unsigned#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+-------------+------------------------+------------------------+

CREATE TABLE product(
    productID INT UNSIGNED NOT NULL PRIMARY KEY,
    product_name VARCHAR(191)
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE product_user(
    userID BIGINT UNSIGNED NOT NULL PRIMARY KEY,
    display_name VARCHAR(32) NOT NULL,
    products_in_account SMALLINT NOT NULL DEFAULT 0 -- this is a derived value, denormalized for convenience
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

CREATE TABLE ownership (
    userID BIGINT UNSIGNED NOT NULL,
    productID INT UNSIGNED NOT NULL,
    modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP NOT NULL,
    usage_forever MEDIUMINT UNSIGNED NULL,
    usage_recent MEDIUMINT UNSIGNED NULL,
    PRIMARY KEY(userID, productID),
    UNIQUE INDEX ux_ownership_productID_userID (productID, userID),
    CONSTRAINT fk_ownership_productUser_userID
        FOREIGN KEY (userID) 
        REFERENCES product_user(userID) 
        ON UPDATE CASCADE,
    CONSTRAINT fk_ownership_product_productID 
        FOREIGN KEY (productID) 
        REFERENCES product(productID)
        ON UPDATE CASCADE
) ENGINE=INNODB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

The problem

The last time I stored this data there were huge performance issues. My product table contained 35,000 items and my user table 35,000 rows. This concluded in 1.4 million entries in my ownership table. Users own at most 11,000+ items but most of the users own in the range between 100 to 1000 items. Queries to this table were painfully slow so that I sliced it down to a tiny selection in order to perform my analysis.

Use cases. Ordered by most important task first

Frequent select queries to specific users or specific items. This is
the most important use case.
update queries to usages to 50,000 users a day
insertion queries of new users up to 1,000 times a day
I want to be able to perform market basket analysis by user, group of users, products, or group of products
very rare delete queries

My solution

My new proposal for this table: Split it into ownership_forever and ownership_recent:

ownership_forever, PRIMARY KEY(userID, proudct ID)
+----------------------+-------------------+------------------------+
|        userID        |     productID     |     usage_forever      |
+----------------------+-------------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+------------------------+

ownership_recent, PRIMARY KEY(userID, proudctID); PRIMARY KEY for archive(userID, proudctID, modified)
+----------------------+-------------------+-------------+------------------------+
|        userID        |     productID     |  modified   |      usage_recent      |
+----------------------+-------------------+-------------+------------------------+
| bigint(20)unsigned#1 | int(10)unsigned#1 | timestamp#1 | mediumint(8)unsigned#1 |
+----------------------+-------------------+-------------+------------------------+

This way many empty rows are omitted because many users never used their product. The fact however remains that these tables will become huge:

I will have to work with at least 300,000 users. If I extrapolate
based on my sample my new ownership_forever table will contain 12
million rows.
In the future I want this table to scale up to 10 million users. If I extrapolate based on my sample ownership_forever will contain 400
million rows.
I plan to periodically dump ownership_recent into a partitioned archive table.

Questions

I am not experienced enough to estimate if this is a feasible solution. Can you assist in answering the following questions?

Does this approach make sense? Are there alternatives that I have
missed?
Should I also use partitions for my non-archive tables?
I could convert the bigint userID into a smaller int(10). Does this have any effect on performance?
What do you propose in order to query these large tables in an efficent way?
Can MySQL handle this "kind" of data? Did I make a big mistake in my first sample or why did I have such slow performance?
How should I set up the indexes?
Is this viable with a relational database on one machine, a distributed database or do I already need to set up a data warehouse?
My new data is supplied in a JSON format. This hints that NoSQL / document-based databases might be applicable. I never worked with any of those. Can comment comment if I should look into those?

The JSON format:

{"reply": {
        "products_owned": 89,
        "products": [
            {
                "productID": 11111,
                "usage_forever": 11
            },
            {
                "productID": 222222,
                "usage_forever": 0
            },
            {
                "productID": 333333,
                "usage_forever": 0,
                "usage_recent" : 69
            }]}}

Best Answer

tl;dr - MySQL is fine for this scenario. I recommend a slightly changed design.

Does this approach make sense? Are there alternatives that I have missed?

In my opinion, I would take a slightly different approach.

I have put together a revised schema which I believe is better structured. There is no need for both a product_user and an ownership table - I have combined them in a table called portfolio.

CREATE TABLE product
(
    product_id   INT NOT NULL PRIMARY KEY,
    product_name VARCHAR(191)
);

CREATE TABLE user -- there was no user table in your fiddle!
(
  user_id INT NOT NULL PRIMARY KEY,
  user_name VARCHAR(25) NOT NULL -- make this unique
);

CREATE TABLE portfolio
(
  portfolio_id  INT NOT NULL,
  po_user_id    INT,     -- FK references user table
  po_product_id INT,     -- FK references product table
  usage_type    CHAR(1), -- either 'F' (forever) or 'R' (recent)
  quantity      INT,
  owned_since   INT,     -- no. of minutes (UNIXTIME? see discussion
  last_modified TIMESTAMP,
  CONSTRAINT portfolio_pk PRIMARY KEY (portfolio_id),
  CONSTRAINT portfolio_uq UNIQUE (po_user_id, po_product_id, usage_type) 
);

ALTER TABLE portfolio ADD CONSTRAINT portfolio_user_fk    FOREIGN KEY (po_user_id)    REFERENCES user (user_id);
ALTER TABLE portfolio ADD CONSTRAINT portfolio_product_fk FOREIGN KEY (po_product_id) REFERENCES the_product (product_id);

Now, it's with the portfolio table that things become interesting. You have your user_id and product_id as you'd expect. Then, you have a usage type (CHAR(1)) which means that the usage_forever and usage_recent fields in your schema disappear. You could make these three fields (user, product, usage_type) the natural PRIMARY KEY - I've just made them UNIQUE with a surrogate.

The owned_since field would be better (IMHO) as the number of seconds from '1970-01-01 00:00:00' (use FROM_UNIXTIME/TO_UNIXTIME to convert back and forth). UNIXTIME is more of a standard - or why not just a TIMESTAMP (MySQL restriction on the number of those per table?)?

The usage_type field means that for one record in your fiddle, there are two in mine (see below) - but I believe that this gives you extra flexibility and will make querying easier in the long run. As a rule of thumb (very generally), tables are "better" when they have more records and fewer fields.

INSERT INTO product (product_id, product_name)
VALUES(1, 'product#1'), (2, 'product#2'), (3, 'product#3'), (4, 'product#4');

INSERT INTO user VALUES (201, 'user#1'), (202, 'user#2'), (203, 'user#3'), (204, 'user#4');

-- There are two records for every one in your sample, 
-- i.e. I have separated out the F's and the R's!

-- 8 records here for user 201 (your user 1000000004805148)
-- one each for Forever (F) i.e. usage_forever, and Recent (R) - usage_recent

INSERT INTO portfolio (po_user_id, po_product_id, usage_type, quantity, owned_since, last_modified) 
VALUES
--              QTY   SINCE(MINS)    MODIFIED
  (201, 1, 'F',  122,      100000,   '2018-05-15 00:00:00'),
  (201, 1, 'R',  122,        1000,   '2018-05-15 00:00:00'),
  (201, 2, 'F',  122,        1000,   '2018-05-15 00:00:00'),
  (201, 2, 'R',    0,         500,   '2018-05-15 00:00:00'),
  (201, 3, 'F',    0,        1000,   '2018-05-15 00:00:00'),
  (201, 3, 'R',    0,         600,   '2018-05-15 00:00:00'),
  (201, 4, 'F',    0,        1000,   '2018-05-15 00:00:00'),
  (201, 4, 'R',    0,         500,   '2018-05-15 00:00:00'),

-- for discussion of records with QTY = 0, see below!
-- these quantities are based on the OP's fiddle!

  (202, 1, 'F',    1,         500,   '2018-05-15 00:00:00'),
  (202, 1, 'R',    0,         500,   '2018-05-15 00:00:00'),
  (202, 2, 'F',    2,         500,   '2018-05-15 00:00:00'),
  (202, 2, 'R',    1,         500,   '2018-05-15 00:00:00'),

  (203, 1, 'F',    0,       10000,   '2018-05-15 00:00:00'),
  (203, 1, 'R',    0,         500,   '2018-05-15 00:00:00'),

  (204, 3, 'F', 1555,       10000,   '2018-05-15 00:00:00'),
  (204, 3, 'R',  100,         500,   '2018-05-15 00:00:00')
;

Should I also use partitions for my non-archive tables?

@RickJames (a regular contributor here) says this about partitioning in his MySQL (ex-employee) rules of thumb:

It is so tempting to believe that PARTITIONing will solve performance problems. But it is so often wrong.

PARTITIONing splits up one table into several smaller tables. But table size is rarely a performance issue. Instead, I/O time and indexes are the issues.

So, I would say no. Focus on indexing - check your execution plans if things run slow - normally it sticks out like a sore thumb (what is it with thumbs today? :-) )

I could convert the bigint userID into a smaller int(10). Does this have any effect on performance?

Small performance impact, but it's no harm to always have your INTEGER ranges set to the smallest size possible, so for 35,000 users, you'll want an [UNSIGNED MEDIUMINT][2].

What do you propose in order to query these large tables in an efficent way?

I find that SQL queries are a pretty good way to, err..., query the data!

Sorry about that :-), but this is what SQL is designed for! Obviously, you won't want to be doing SELECT * FROM huge_table; very often, but for point and/or small range queries - with good indexes and a decent decent disk subsystem (what is yours BTW?) - you should be fine!

Can mySQL handle this kind of data? Did I make a big mistake in my first sample or why did I have such slow performance?

Yes, MySQL can handle this magnitude of data - esp. since your tables are not very wide and don't have BLOBs. As for your previous performance issues, I'm afraid that we'd need way more information (DDL and DML, Query Plans, Hardware info...). Also, you say that Queries to this table were painfully slow- we'd need definitions of "slow".

How should I set up the indexes?

For your portfolio table - on po_user_id, on po_product_id, on usage_type if it's not too skewed. (po_user_id, po_product_id) and vice-versa. You also mention product groups - that will add a layer of complexity. You may need to tweak your indexing strategy at that point. For better or worse, there are no silver bullets - if there were, they'd be the default and guys like me would be out of a job! Depending on your queries, you might want some covering indexes - can't say in advance.

Is this viable with a relational database on one machine, a distributed database or do I already need to set up a data warehouse?

So far, your scenario doesn't appear very complicated, so my suggestion is to use an RDBMS for the time being (I recommend PostgreSQL if you're going Open Source - vastly superior to MySQL). Get a decent disk sub-system. If you spend any money at all, it should be there. Use RAID 10 (NOT RAID 5 or anything else).

I would keep the number of records in my main portfolio table to a minimum - try and send no-longer-queried/rarely-queried historical records (archived) off the system alltogether or keep them in a separate table or tables.

My new data is supplied in a JSON format. This hints that NoSQL / document-based databases might be applicable. I never worked with any of those. Can comment comment if I should look into those?

Relational databases work best with relational data. JSON (apart from being flavour du jour) is for transmission and presentation layers - take your JSON and convert it to relational and back if necessary. PostgreSQL has some fantastic JSON capabilities, but MySQL is getting them to - but nowhere near as sophisticated as PostgreSQL.

tl;dr - stick with an RDBMS with decent disks (SSD maybe - $1300 will get you one of these babies - you have 400,000,000 records at maybe 100 Bytes/record = 40GB. You could have blisteringly fast performance on your database with decent design and minimal outlay!

A final (and very imporant point) about your data:

My translation of your SQLFiddle into my schema gives this (for user1 - id = 201)

--              QTY   SINCE(MINS)    MODIFIED
  (201, 1, 'F',  122,      100000,   '2018-05-15 00:00:00'),
  (201, 1, 'R',  122,        1000,   '2018-05-15 00:00:00'),
  (201, 2, 'F',  122,        1000,   '2018-05-15 00:00:00'),
  (201, 2, 'R',    0,         500,   '2018-05-15 00:00:00'),
  (201, 3, 'F',    0,        1000,   '2018-05-15 00:00:00'),
  (201, 3, 'R',    0,         600,   '2018-05-15 00:00:00'),
  (201, 4, 'F',    0,        1000,   '2018-05-15 00:00:00'),
  (201, 4, 'R',    0,         500,   '2018-05-15 00:00:00'),

As you can see, in my data, there are records with 0 quantity. Some of the records in your fiddle have usage_forever and usage_recent both set to 0 as well. Now, for me, unless I've misunderstood something, it makes no sense to keep track of a quantity of 0 ownership.

In my schema, it is irrational to keep the records with 0 quantity, and in your schema it is also illogical to keep ones which have both usage_forever and usage_recent = 0 (which results in two 0 QTY records for me).

I hope I've made some sense and that this has been helpful! The fiddle can be found here.

Best Answer

Related Solutions

Mysql – Are two indexes needed

Mysql – Finding rows for a specified date range

Related Question