MySQL – Maintain History of Records but Using Most Recent

database-designMySQLquery

Let's say I'm developing a database for a transportation service which owns certain vehicles and hires drivers which are assigned to specific vehicles. The database comprises of the following tables:

vehicles

drivers

vehicle_drivers

id | vehicle_id | driver_id | date_assigned

daily_vehicle_reports

Where daily_vechicle_reports records the state of the vechile at the end of each day.

At the end of each month, I would generate data on all reports for the given month. Now my problem is, given that the possibility that at any given day, the driver of any vehicle can be changed, the data on the previous drive remains in the database, and normally, to generate the data I would use a query similar to the following:

SELECT
    dvr.*,
    v.number,
    v.date_registered,
    d.passport_number,
    d.first_name,
    d.last_name,
    d.dob,
    d.address
FROM daily_vehicle_reports dvr
INNER JOIN vehicles v ON(v.id = dvr.vehicle_id)
INNER JOIN vehicle_drivers vd ON(vd.vehicle_id = v.id)
INNER JOIN drivers d ON(d.id = vd.driver_id)
WHERE MONTH(dvr.report_data) = 4
AND YEAR(dvr.report_data) = 2019

This would give me repeated data for the reports are multiple drivers would be found for the same vehicle. How can I modify the query so as to include ONLY the last driver assigned by that date (the date of the report)?

Best Answer

This is well known problem "Max by group" that can be solved by subselect:

SELECT
       dvr.*,
       v.number,
       v.date_registered,
       d.passport_number,
       d.first_name,
       d.last_name,
       d.dob,
       d.address
  FROM daily_vehicle_reports AS dvr
  JOIN vehicles              AS v  ON v.id = dvr.vehicle_id
  JOIN vehicle_drivers       AS vd ON vd.vehicle_id = v.id

  JOIN ( SELECT vehicle_id                            -- here is additional join with subselect
              , MAX(date_assigned) AS date_assigned   -- last assignment for each vehicle
           FROM vehicle_drivers
          GROUP BY vehicle_id
       ) AS vdm  ON vdm.vehicle_id = vd.vehicle_id    -- joined to the full table `vehicle_driver`
                AND vdm.date_assigned = vd.date_assigned -- to exclude NOT LAST rows

  JOIN drivers               AS d ON d.id = vd.driver_id
 WHERE MONTH(dvr.report_data) = 4      -- Too bad, can't be indexed
   AND  YEAR(dvr.report_data) = 2019   -- The same thing

To speedup the filtering by date you have to use the following approach:

WHERE dvr.report_data BETWEEN '2019-04-01' AND '2019-04-30'

This kind of date range filtering is covered by index (if exists) and way more faster

Related Solutions

Mysql – Database design suggestions for a data scraping/warehouse application

These are general recommendations, as you do not show the full extent of your queries to be performed (which kind of analytics you plan to do).

Assuming you do not need real time results, you should just denormalize your data at the end of the period, precalculate once your aggregated results on all necessary timeframes -by day, by week, by month-, and work only with summary tables. Depending on the queries you intend to do, you may not even need the original data.

If durability is not a problem (you can always recalculate statistics as raw data is elsewhere), you can use a caching mechanism (external, or MySQL 5.6 includes memcache), which works great for writing and reading key-value data on memory.

Use partitioning (can also be done manually), as with these kind of applications, usually the most frequently accessed rows are also the most recent. Delete or archive old rows to other tables to use our memory efficiently.

Use Innodb if you want durability, high concurrent writes and your most frequent accessed data is going to fit into memory. There is also TokuDB- it may not be raw faster, but it scales better when dealing with insertions on huge, tall tables and allows for compression on disk. There are also analytic-focused engines like Infobright.

Edit:

23 insertions/second is feasible in any storage with a bad disk but:

You do not want to use MyISAM- it cannot do concurrent writes (except on very specific conditions) and you do not want to have huge tables that become corrupted and lose data
InnoDB is fully durable by default, for better performance you may want to reduce the durability or have a good backend (disk caches). InnoDB tends to get slower on insertion with huge tables. The definition of huge is "the upper parts of the Primary key/other unique indexes must fit into the buffer pool" to check for uniqness. That can vary depending on the memory available. If you want scalability beyond that you have to partition (as I suggested above)/shard or use any of the alternative engines I mentioned before (TokuDB).

SUM() statistics do not scale on normal MySQL engines. An index increases performance, again, because most of the operations can be done on-memory, but one entry for each row has to still be read, in a single thread. I mentioned design alternatives (summary tables, caching) and alternative engines (column-based) as a solution to that. But if you do not need real-time result, but report-like queries, you shouldn't worry too much about that.

I suggest you to do a quick load test with fake data. I've had many clients doing analytics on MySQL of social networks without problems (well, at least, after I helped them :-) ), but you decision may depend on your actual non-functional requisites.

MYSQL Including Missing Values Using the Previous Most Recent Record

A correct form of the query would be:

SELECT
    t1.date,
    (SELECT v1.type_id FROM tbl_values v1 where v1.type_id = 100 AND v1.date <= t1.date ORDER BY v1.date desc limit 1) as `type`,
    (SELECT v1.`value` FROM tbl_values v1 where v1.type_id = 100 AND v1.date <= t1.date ORDER BY v1.date desc limit 1) as `value`

FROM tbl_calendar t1
having `type` IS NOT NULL

with result as:

2016-12-02  100 1.00
2016-12-03  100 1.00
2016-12-04  100 2.00
2016-12-05  100 2.00
2016-12-06  100 3.00
2016-12-07  100 3.00
2016-12-08  100 4.00
2016-12-09  100 4.00
2016-12-10  100 5.00

but, again look at your query and your expected result please.

What do you want to have in the result (type, date, value) with:

SELECT 
    c.date,

    o1.value_id AS "o1.value_id",
    o2.value_id AS "o2.value_id",

    o1.type_id AS "o1.type_id",
    o2.type_id AS "o2.type_id",

    o1.date AS "o1.date",
    o2.date AS "o2.date"

Best Answer

Related Solutions

Mysql – Database design suggestions for a data scraping/warehouse application

MYSQL Including Missing Values Using the Previous Most Recent Record

Related Question