Questioning a DB Consultant and their recommendations

database-design

We have hired a systems design consultant to help use architect a new CRM system, bringing together several business needs into one system, instead of the several we currently have.

It's the typical project that a company spends a fortune on, and then discovers it doesn't do what it thought it would. I'm trying to make sure that doesn't happen to us.

Could I have some community feedback on some parts of the solution our consultant has recommended? Personally I'm unsure, but I'd defer to their expertise, and to this community's collective knowledge.


Our User Tables

Some of our users have verified accounts with us. Some will have bought tickets through a third party provider. Others will simply reach out to our customer service, and may not be customers yet.

One way of doing this would be to use three tables: VerifiedUsers (data from our verified user accounts — email, name, post code, etc.), TicketPurchasers (data pulled from our ticket provider – also verified — email, name, post code, etc.), and Individuals (essentially only an email addresses used to contact our customer service).

Obviously it would be great to make connections between these tables, so we can pull together as much information about our customers/users as possible.

This all seems fine to me so far. Where I get a little unsure is that the consultant has suggested the Individuals table should be the master table.

This table is populated with the most "dirty" data: The least verified data sent from people with the lowest at stake (potentially not even customers, sometimes just so they can rant at our customer service team).

Also, if someone has several email addresses they use to reach out to customer service, how are we supposed to make a reliable trustworthy connection between the TicketPurchasers and VerifiedUser tables? It just seems likely that one person could easily have several rows of data in the Individuals table, and we would never really know.

Question: Does this make sense, even just from a database integrity point of view?


Building our own analytics

The consultant has also recommended we start creating our own analytics. This ostensibly makes sense as well: It would be great to be able to collate every touch point our customers/users make.

The idea would be that a table would be filled with a user_id (presumably taken from the Individuals table), an action_id (to reference the action they took — eg. "Logged into their account", "Contacted customer service") and a timestamp.

This data could be potentially interesting to look at (although I'd like to sit down and ensure it's useful, too), but after a few quick calculations it seems would could fill 1,000,000 rows in a year.

If we continue growing as a company (as we wish to), this table of analytics could easily hit 10,000,000 rows in a few years, and just keep growing.

This scares me.

For a start, I don't like the idea of a table just growing and growing. It makes me uneasy. Secondly, running live operations on a table of millions of rows (which is what would be required), could be prohibitively resource intensive. (Yes, I guess it depends how much my organisation is prepared to spend on this.)

Question: Are my concerns valid, or is it my inexperience?

Edit: In terms of size, it looks like I need to relax: https://stackoverflow.com/a/1995078/199700 Phew.

Thanks for any help!

Best Answer

If we continue growing as a company (as we wish to), this table of analytics could easily hit 10,000,000 rows in a few years, and just keep growing.

This scares me.

10M rows isn't a matter to be worried about. Even 10B rows isn't a matter for a single host. The main criteria is the throughput of the DB/engine. To prevent hangups and outages you need to be able to ingest/return the data at the rate at least 3x faster than your actual (and potential) average load. 10x is better of course. This goal can be achieved in two ways. First is the DB/queries optimization. In most cases results are more than acceptable. But every engine has its limitation that can be overcomed by some kind of scaling. Another way is the clustering, segmentation, load balancing etc. Here the DBA should pass the problem to the system architect. This is a different set of competencies. Here is the good explanation how it works:
https://github.com/donnemartin/system-design-primer