Mysql – Need help on best database design for quick retrieval of data

MySQL

I'm in the process of designing one database that will replace lots of CSV files that I currently use for data storage, which are starting to get messy and inconsistent. I am using C# in Visual Studio. It is for Bricklink/Lego data, and I'll I'll just explain the section that I need help with below, simplifying the actual figures:

The part of the database I'll focus on has 3 tables: -Parts contains PartID and about 30 other fields (e.g. mass, averageSalePrice). There are 50,000 Parts. -Store contains StoreID and about 10 other fields. There are 1000 Stores. -StoreParts links the two in a many-many relationship. It contains PartID, StoreID, Date, Price and Notes.

Now here is my issue: Each store has 10,000 parts. So there would be about 10 million records in StoreParts (more if I record multiple dates). One query that I am likely to run would need to retrieve all of the parts for a given store and compare their Price to the averageSalePrice in Parts. I feel this may run very slow as it would be going through 10 million records of StoreParts to find the 10 thousand parts.

When I was using CSV files to store the data, I had one file for each store/date, so it only had to open that file with the 10,000 parts. I feel this would be more efficient than having to find one store's parts in the list of 10 million or more.

Is there a way I can set up my database so there is a separate table for each store? I feel that this would be more efficient to search, but from my experience does not fit with best practice for database design, as I would have 1000 store tables. If I consider recording store data on different dates (e.g. 1 store has the price of all its parts on 100 different dates), then things could get way too big and slow.

I would welcome any advice on this, as I would love to do this properly and not have to have CSV files sitting around all over the place as I currently have. Thank you.

Best Answer

A modern relational database should be able to handle 10M records in a table with no particular problem. You may have to spec the hardware memory up if your queries aren't performant. I think the most expensive part of this would be the initial ingest, where you have to parse, organize, and import the CSVs. If you get new data regularly in the form of CSVs, that may be a concern, but that can usually be addressed by writing the ETL code in something efficient like Go.

My next suggestion would be using something non-relational. I see where you do joins in these datasets, but a non-relational dataset that just had records of a specific part in a specific store might not be unreasonable. I know MongoDB or other NoSQL options would likely work for this use case. NoSQL databases work very well with large amounts of data, but don't let you do much in terms of joins, but if this is the extent of your dataset, that very well might work for you.

I tend to lean toward Elasticsearch for datasets like these. Calling it a database might start a fight in certain circles, and this isn't the primary intended use for Elasticsearch, but I've found it work well as a sort of NoSQL database. It's search API is easy to integrate with applications and it's very easy to scale up to the performance you need.

tl;dr: Don't knock a relational database with three tables like you describe--I think it'd work fine. But if that's the only join you'll ever need, flatten it and put it in a NoSQL database like Mongo, which should make it easier to achieve high performance. Depending on what you do with the data, Elasticsearch might be worth looking into.

I realize this is a rather general answer, but it's a rather high-level question, so I apologize.