Mysql – Lots of Indexes – MySQL vs MongoDB – Migrating

indexmongodbMySQL

So, I've never used MongoDB, I just read a lot about it and I think it's going to be good for my project. Also, I do not have lots of experience with MySQL and to be even more honest I have no clue of what I'm about to ask.

Scenario:

MySQL table profile:

  • id = [pk, auto_increment, smallint]
  • user_id = [pk, fk, varchar]
  • category_id = [pk, fk, smallint]
  • role_id = [pk, fk, tinyint]
  • country_id = [pk, fk, smallint]
  • state_id = [pk, fk, smallint]
  • legal_document = [pk, varchar, ?unique]
  • name = [pk, varchar, ?unique]
  • type = [pk, boolean]
  • last_activity = [pk, date]

All the fk you see there are MySQL tables of course. Then I'm thinking to use MongoDB to store the profile info, profile_info collection should contain documents like:

{
  '_id' : 1 (profile_id),
  'address': 'Some street in some state of some country :P',
  'phone': [5555555, 5555555],
  'email': 'example@example.com',
   etc...
}

I'm planning to use mongodb because my project needs to be public ASAP and we may add several new profile info and other things and I do not want to do alter table and migrate from tools with lots of rows.

With that said, we may have to add a few more pks to the MySQL table, so I was thinking of migrating all the project to MongoDB and I do not know if that is a good move.

Questions:

  • It's better to keep the pks in MySQL and trivial info in MongoDB, or would it be fine if I move all to MongoDB?

  • May MongoDB be faster if I just migrate the whole project but I keep the structure like that? I mean like having profile and profile_info collections instead of just profile.

  • I'm worried about how many resources MongoDB could use for a 'table/collection' with that many indexes, I want to keep disk space and ram at minimum use. Is there a critical difference between MySQL and MongoDB?

PS: SSDs are going to be used in the system.

PS II: All tables are just planned, nothing is written yet. I'm a bit of plan very ahead person so please be patient with me.

Best Answer

Given your description I strongly suggest against using MongoDB. Not because it would necessarily a bad choice (although I believe it is in your case for reasons other than pure technical ones).

Here are the points that caught my eye.

Data modeling

Trying to use MongoDB with relational data model with no adaptations almost always leads to tears and misery except for the most trivial use cases. And that is the better end of the story. The worse end is loosing money, potentially big time.

The reason for this is that with SQL, you identify your entities and their attributes and relations and then bang your head against the wall for a few hours to get your upper left above and beyond JOINS right to get your questions derived from your use cases answered. All while avoiding data redundancy like the devil holy water.

Data modeling with MongoDB works different. You identify the use cases and the questions derived from them and model your data in a way so that those questions can be answered in the most efficient way.

Since this is a bit abstract, let me give you an

Example

Let us assume you have a web application called "chirper", with users doing chirps. With SQL you would now model your data, coming up with something like a user table and a chirp table.

The first use case you encounter is that you want the latest 10 chirps to be displayed on the applications home page, together with the username of the chirper. With SQL it is easy enough, you do a join on chirps and users, sort the result descending by data and limit to 10 records.

With MongoDB, you'd have a careful look at what you need. For various reasons you do not want to embed the chirps into the user collection. But since you do need to relate the chirps to a user anyway, you decide to do so by username and come up with a "schema" for your chirps collection like this:

{
  _id: ObjectId("570b87a56931b8f21a8bf25c"),
  user: "jdoe",
  date: ISODate("2016-04-11T11:17:08.415Z"),
  text: "Chirp!!!"
}

and, miraculously enough, all you have to do is to do a

db.chirps.find({}).sort({date:-1}).limit(10)

which gives you the same result without a JOINS at the expense of relatively cheap disk space.

Resource limitation

I want to keep disk space and ram at minimum use

MongoDB is a lot, but surely not resource friendly. It was never meant to be a replacement for MySQL – which by the way was specifically designed to be a relatively lightweight general purpose RDBMS. I strongly advice against running MongoDB and anything else on the same server for production purposes. The reasons are manifold, but here are the most important ones:

  1. MongoDB will take up to 85% - 90% of your available, physical RAM. This is because the working set (the indices and a subset of the data) are kept in RAM.
  2. Depending on the storage engine and its configuration you use for MongoDB, even CPU utilization can be considerable.
  3. A heavily loaded MongoDB needs a lot of disk IO. I have seen systems with SSDs in a RAID0 in which the IO rate was at its limit.

Leaving out the petty details for the moment, this would mean that during situations where you do not want it the most (your application is taking off and has a lot of users), MongoDB would battle with the other parts of the application over resources.

Do not get me wrong: MongoDB is not a resource eating monster per se. But if it does what it is intended to do, namely dealing with huge amounts of data and data changes, you do not want to have the parts of your application battle for resources.

That being said: There are ways to limit MongoDBs resource allocation and make sure those limits are obeyed. Probably the most well known as of today is Docker. You have to decide wether it is worth it to run MongoDB in a Docker container.

As for the data file size: There is transparent compression available for the default wiredTiger storage engine. You can choose from either snappy(default) or the better known zlib compression. Both come at the expense of a higher CPU utilization, as mentioned above.

Time to market

Forgive me, but you demonstrate a lack of basic knowledge about MongoDB. Since you are not sure what to do, the best you can do is to research every step carefully, make a decision, rinse and repeat. And I am not even taking into account that you would need to remodel your data and most likely make massive changes to your applications persistence layer. If I were you and wanted/needed a fast time to market, I'd go with what I have as of now. In case problems occur during production, I'd get myself a consultant specializing on NoSQL to find out wether one of the various NoSQL databases suits your needs, identify said DBMS (which may or may not be MongoDB), get myself a specialist for that and do the migration only then. With a specialist on your side.

Administration

A mistake commonly made by people is that they think it is as easy to manage a MongoDB deployment as it is to getting it to work. It is not. Any DBA, Sysadmin or even DevOps (even when used in the wrongest sense of the word) should be able to get a sharded cluster running. Choose the proper dimensions, identify problems, decide when to scale, deal with non-trivial problems and errors? Not so much. And note that the latter is highly subjective and prone to the skills of the DBA in question. Do you really want to store your production data on a system you barely understand?

Conclusion

Imho, MongoDB does not fit the requirements you stated. Changing now will most likely increase your time to market, and very much so since you do not understand the intricacies and pitfalls of MongoDB. In the worst case you built up technical debts while simultaneously diminishing the performance of your application.

MongoDB is less well suited for your resource limitation requirements, and finding a sweet spot for resource limitations that give you an acceptable performance requires an experienced DBA and quite some time, during which your application might run with sub-par performance.

Let me put it in a bit of context: MongoDB might very well be suitable for your use cases, and even excel. But I do not think it is feasible to change your persistence technology as of now, given your level of knowledge and the stated requirements.

hth