These are general recommendations, as you do not show the full extent of your queries to be performed (which kind of analytics you plan to do).
Assuming you do not need real time results, you should just denormalize your data at the end of the period, precalculate once your aggregated results on all necessary timeframes -by day, by week, by month-, and work only with summary tables. Depending on the queries you intend to do, you may not even need the original data.
If durability is not a problem (you can always recalculate statistics as raw data is elsewhere), you can use a caching mechanism (external, or MySQL 5.6 includes memcache), which works great for writing and reading key-value data on memory.
Use partitioning (can also be done manually), as with these kind of applications, usually the most frequently accessed rows are also the most recent. Delete or archive old rows to other tables to use our memory efficiently.
Use Innodb if you want durability, high concurrent writes and your most frequent accessed data is going to fit into memory. There is also TokuDB- it may not be raw faster, but it scales better when dealing with insertions on huge, tall tables and allows for compression on disk. There are also analytic-focused engines like Infobright.
Edit:
23 insertions/second is feasible in any storage with a bad disk but:
You do not want to use MyISAM- it cannot do concurrent writes (except on very specific conditions) and you do not want to have huge tables that become corrupted and lose data
InnoDB is fully durable by default, for better performance you may want to reduce the durability or have a good backend (disk caches). InnoDB tends to get slower on insertion with huge tables. The definition of huge is "the upper parts of the Primary key/other unique indexes must fit into the buffer pool" to check for uniqness. That can vary depending on the memory available. If you want scalability beyond that you have to partition (as I suggested above)/shard or use any of the alternative engines I mentioned before (TokuDB).
SUM()
statistics do not scale on normal MySQL engines. An index increases performance, again, because most of the operations can be done on-memory, but one entry for each row has to still be read, in a single thread. I mentioned design alternatives (summary tables, caching) and alternative engines (column-based) as a solution to that. But if you do not need real-time result, but report-like queries, you shouldn't worry too much about that.
I suggest you to do a quick load test with fake data. I've had many clients doing analytics on MySQL of social networks without problems (well, at least, after I helped them :-) ), but you decision may depend on your actual non-functional requisites.
First, the condition WHERE date_field >= (CURDATE()-INTERVAL 1 MONTH)
will not restrict your results to the current month. It will fetch all dates from 30-31 days ago up to the current date (and to the future, if there are rows with future dates in the table).
It should be:
WHERE date_field >= LAST_DAY(CURRENT_DATE) + INTERVAL 1 DAY - INTERVAL 1 MONTH
AND date_field < LAST_DAY(CURRENT_DATE) + INTERVAL 1 DAY
Now, to the main question, to create 28-31 dates, even if the table has not rows for all the dates, you could use a Calendar
table (with all dates, say for years 1900 to 2200) or create them on the fly, with something like this (the days
table can be either a temporary table or you can even make it a derived table, with a somewhat more complicated query than this one):
CREATE TABLE days
( d INT NOT NULL PRIMARY KEY ) ;
INSERT INTO days
VALUES (0), (1), (2), ....
..., (28), (29), (30) ;
SELECT
cal.my_date AS date_field,
COALESCE(t.val, 0) AS val
FROM
( SELECT
s.start_date + INTERVAL (days.d) DAY AS my_date
FROM
( SELECT LAST_DAY(CURRENT_DATE) + INTERVAL 1 DAY - INTERVAL 1 MONTH
AS start_date,
LAST_DAY(CURRENT_DATE)
AS end_date
) AS s
JOIN days
ON days.d <= DATEDIFF(s.end_date, s.start_date)
) AS cal
LEFT JOIN my_table AS t
ON t.date_field >= cal.my_date
AND t.date_field < cal.my_date + INTERVAL 1 DAY ;
The above should work for any type of the date_field
column (date, datetime, timestamp). If the date_field
column is of type DATE
, the last join can be simplified to:
LEFT JOIN my_table AS t
ON t.date_field = cal.my_date ;
Best Answer
Before going into the main problem, there are a few more issues with the query:
the use of string literals (
'1907'
'1'
) for values that are compared with columns that seem to have integer type (SiteID
,UtilityID
). If the columns are integers, use integers, not strings:using
BETWEEN
andLAST_DAY()
for datetime/timestamp comparisons. This - assuming that thew.Timestamp
column is indeed aTIMESTAMP
will give you incorrect results, unless all your timestamps have00:00:00
time part. TheLAST_DAY('2016-01-01')
will be'2016-01-31 00:00:00'
and you lose the whole last day of the month, except the very first second.If you change that to
BETWEEN '2016-06-01' AND '2016-07-01
is somewhat better but still wrong as you you'll get a few results from the next day (in July)!One way that works correctly and with all datatypes (
DATE
,DATETIME
,TIMESTAMP
) is to use inclusive-exclusive range checks, with>=
and<
:If the
t.Timestamp
column is ofDATE
type, then ok,BETWEEN
can be used although I prefer the consistency of the above suggested code.using old ANSI syntax without
JOIN
. This is not an error but it's harder for debugging when there are many tables and error-prone (we might forget a joining clause):It's better (in my opinion) to use the new (since 1992 ;) explicit
JOIN
syntax. It's also easier to change anINNER
join to aLEFT
orRIGHT
outer join:Now for the main problem, the solution is to start from the
Trasmitter
table and thenLEFT JOIN
the details (Daily
) table: