Mysql – Need help on best database design for quick retrieval of data

MySQL

I'm in the process of designing one database that will replace lots of CSV files that I currently use for data storage, which are starting to get messy and inconsistent. I am using C# in Visual Studio. It is for Bricklink/Lego data, and I'll I'll just explain the section that I need help with below, simplifying the actual figures:

The part of the database I'll focus on has 3 tables: -Parts contains PartID and about 30 other fields (e.g. mass, averageSalePrice). There are 50,000 Parts. -Store contains StoreID and about 10 other fields. There are 1000 Stores. -StoreParts links the two in a many-many relationship. It contains PartID, StoreID, Date, Price and Notes.

Now here is my issue: Each store has 10,000 parts. So there would be about 10 million records in StoreParts (more if I record multiple dates). One query that I am likely to run would need to retrieve all of the parts for a given store and compare their Price to the averageSalePrice in Parts. I feel this may run very slow as it would be going through 10 million records of StoreParts to find the 10 thousand parts.

When I was using CSV files to store the data, I had one file for each store/date, so it only had to open that file with the 10,000 parts. I feel this would be more efficient than having to find one store's parts in the list of 10 million or more.

Is there a way I can set up my database so there is a separate table for each store? I feel that this would be more efficient to search, but from my experience does not fit with best practice for database design, as I would have 1000 store tables. If I consider recording store data on different dates (e.g. 1 store has the price of all its parts on 100 different dates), then things could get way too big and slow.

I would welcome any advice on this, as I would love to do this properly and not have to have CSV files sitting around all over the place as I currently have. Thank you.

Best Answer

A modern relational database should be able to handle 10M records in a table with no particular problem. You may have to spec the hardware memory up if your queries aren't performant. I think the most expensive part of this would be the initial ingest, where you have to parse, organize, and import the CSVs. If you get new data regularly in the form of CSVs, that may be a concern, but that can usually be addressed by writing the ETL code in something efficient like Go.

My next suggestion would be using something non-relational. I see where you do joins in these datasets, but a non-relational dataset that just had records of a specific part in a specific store might not be unreasonable. I know MongoDB or other NoSQL options would likely work for this use case. NoSQL databases work very well with large amounts of data, but don't let you do much in terms of joins, but if this is the extent of your dataset, that very well might work for you.

I tend to lean toward Elasticsearch for datasets like these. Calling it a database might start a fight in certain circles, and this isn't the primary intended use for Elasticsearch, but I've found it work well as a sort of NoSQL database. It's search API is easy to integrate with applications and it's very easy to scale up to the performance you need.

tl;dr: Don't knock a relational database with three tables like you describe--I think it'd work fine. But if that's the only join you'll ever need, flatten it and put it in a NoSQL database like Mongo, which should make it easier to achieve high performance. Depending on what you do with the data, Elasticsearch might be worth looking into.

I realize this is a rather general answer, but it's a rather high-level question, so I apologize.

Related Solutions

MySQL Database Design Help

admin and active is just a bunch of 0's and 1's so can I do something like this? Since there will be 99% active and only a couple that will be inactive.

admin - id, user_id
inactive - id, user_id
social - id, user_id
facebook_link, twitter_link
member_forgot - id, user_id, forgot_code

Actually no, I wouldn't do this and keep it in the same table. The flag should be enough to distinguish between these two. The only difference in your queries is a active = true. Personally, I wouldn't consider this a bad design choice (altough your idea is not wrong, I would not use it in this case with - I'm guessing - a small set of data). Since only on set of data belongs to a single user, JOINs would make not that much sense.

I really want to practice splitting it up in different tables even if it is not a huge database, or is that a bad idea? So if I delete a user from the users table, it will delete all of the other records in the other tables associated with that user_id?

Again, I wouldn't do that personally(!). I don't see a benefit of seperating user centric data from the user. If you fear losing the data on deletion, introduce a deleted flag. This way, you will never lose data and your application just respects the deleted flag in your queries (this would also apply if you split the data accross tables).

Credit_Cards = 1110 format (visa, mastercard, discover, amex)

Storing credit cards is always very sensitive, according to PCI (as far as i remember) you are not allowed to store the CVV code (hope someone can correct me on that). But since this seems to be a fictional application, that should not be a problem.

Mysql – Help the database isn’t performing fast enough! 100M Merge with 6M need < 1 hour!

If your table has a table with more than 1000 columns, it cannot be converted to InnoDB. In that case, run this query

SELECT CEILING(SUM(index_length)/POWER(1024,2)) num
FROM information_schema.tables WHERE engine='MyISAM';

This will give you the correct size for key_buffer_size in MB.

Since you are doing an UPSERT, you should set concurrent_insert to 2 to make INSERTs go faster. You may want to consider changing the table's row format to Fixed. I wrote about why to do both in StackOverflow. In essence, if you make the table's row format Fixed, all table rows are the same size. Thus, INSERTs and UPDATEs would operate on the exact same length of data. Management of row access is far more reasonable.

Since MyISAM only caches indexes (in the key buffer), all data must be read from disk. anything you can do to getting better RAID performance (as asked by @TomTom) would help your cause as well.

Best Answer

Related Solutions

MySQL Database Design Help

Mysql – Help the database isn’t performing fast enough! 100M Merge with 6M need < 1 hour!

Related Question