Mysql – Store arrays in table

arraydatabase-designMySQLPHP

I'm not sure if this is the best SE section to ask my question. If not, I kindly ask you to help me to find out the best one.

Developing a web application in php and MySql I'm in doubt how to store a huge quantity of data from arrays.

Scenario

The application receives text messages from clients around the world. The number of clients might reach several thousands.
Each one sends few times per day a message with a text payload (json).

The json contains few objects to identify the client itself and an array of about 10 objects, in example:

{
    "id_client": "abcdefgh",
    "id_msg": 3245,
    "datetime": "20180308101130",
    "data": [
        { "temp": 20.5 },
        { "humy": 45.1 },
        { "batt": 12.0 },
        { "cnt1": 123456 },
        ...
    ]
}

I need to store this information in the database.

My thoughts

The keys of the dictionary passed in the data array are stored into a separate table, along with their description, units, etc…

The first three fields (id_client, id_msg, datetime) are required to identify exact message received. Hence, when I receive such a message, currently, I create a row in a table for each entry in the dictionary:

+-----------+--------+----------------+------+--------+
| id_client | id_msg |    datetime    | key  | value  |
+-----------+--------+----------------+------+--------+
| abcdefgh  |   3245 | 20180308101130 | temp | 20.5   |
| abcdefgh  |   3245 | 20180308101130 | humy | 40.1   |
| abcdefgh  |   3245 | 20180308101130 | batt | 12.0   |
| abcdefgh  |   3245 | 20180308101130 | cnt1 | 123456 |
+-----------+--------+----------------+------+--------+

I wondering if this is the right approach for such a situation.
I'm afraid about the number of lines: if I have 1000 clients that send a message with 10 entries 2 times per day, I will append:

  • 1000 * 10 * 2 = 20.000 rows per day
  • 20.000 * 30 = 600.000 rows per months
  • 20.000 * 365 = 7.300.000 rows per year

The data should be online at least year-by-year in order to make queries and statistics.

What about this approach? Is there something better for this kind of applications?

Best Answer

You are duplicating the common data on multiple rows which can lead to consistency issues. Look up "normal form" for a discussion about the standard ways to deal with this and help ensure your data is consistent and efficient. Read around database normalisation for more about this.

To break that duplicated data out you could do something like:

                    Message             MessageData
Client              ==============      ================
==============      id_msg    (PK) <--- id_msg (FK) (PK)
id_client (PK) <--- id_client (FK)      key         (PK)
client_name         time_stamp          value
contact_no

You potentially have a second problem though: the message data table presented above is still a property bag or Entity-Attribute-Value (EAV) model which while flexible can be problematical in a number of ways. You'll find many discussions online (including here on DBA.SE) and in print (Bill Karwin's excellent "SQL Anitpatterns" has a relevant chapter) about the pros and cons of EAV structures so I'll not delve into more detail here.

If those array keys in your JSON example are fixed then a better layout might be:

                    Message        
Client              ============== 
==============      id_msg    (PK) 
id_client (PK) <--- id_client (FK) 
client_name         time_stamp     
contact_no          temperature
                    humidity
                    battery
                    cnt1

Or if those keys are common in the data but there are potentially others that and you need to allow any key/value pair, you can use a hybrid of fixed columns and property bag:

                    Message             OtherMessageData
Client              ==============      ================
==============      id_msg    (PK) <--- id_msg (FK) (PK)
id_client (PK) <--- id_client (FK)      key         (PK)
client_name         time_stamp          value
contact_no          temperature
                    humidity
                    battery

7.300.000 rows per year

With appropriate table and index design 7.3 million rows is not a lot at all.