I am working on a project where I need to show whether users were logged in or not over a given interval (currently within a 24-hour period over the course of a month, but could change).
I have a table with, among other things, the user's login and a timestamp (in milliseconds)
CREATE TABLE tableA (user_login varchar(64), tstamp double, ...)
(NOTE: I don't want to store unix time seconds or date in sql – as I lose precision of 4 places – here in this case it is not relevant but project is big. I expect to need this information, with full precision, elsewhere. I don't need to worry if I store unix time – seconds as date/timestamp divide by 1000 or multiply etc.)
From this, I need create a "map" to see if each user has logged in daily ( so the step will be one day in milliseconds ). The start and end will vary by days in month. In addition to a simple present/absent Boolean value, I'd like to produce an aggregated daily login count as an additional value.
Now, I could query the table, select back all the data, and put this together easily enough in my preferred coding language. However, we'll have thousands of rows per user; transporting all this over the network is a waste of resources for the client. So, I decided it made more sense to do this at the database server end.
So: Given the following data:
base data ( input table A ):
user_login | tstamp | ...
---------------------------------------------------------------
a | 1 | ...
a | 3 |
a | 5 |
a | 6 |
And the following definition of the intervals we're interested in:
- start of analysis period (
dstart
) = 1 - end of analysis period (
dend
) = 10 - interval size = 3
How do I get the following results?
login|dt|presence | count (facultative column with aggregated count)
-----------------
a |4 |1 | 2
a |7 |1 | 2
a |10|0 | 0
where:
– dt = the end of each interval
– presence = was there at least one row found for the user within the interval (1 = yes, 0 = no)
– count = the total number of rows fond for this user in this interval
Additional notes:
– if a tstamp
falls on the boundary between two intervals, it's OK for it to be counted in both intervals.
Best Answer
So, at the heart of things, we want to take a list of users, combine that with a list of ranges, and report how many records each user has in each range (1 or more means "present", 0 means "absent").
First (since you explain that you're trying to track whether users have logged on to the system every day for a month), I should note that there are a number of good ways to do this simply using dates, and a
date
table (a list of all possible dates in your range), and that matching date to date (note, date, not datetime or timestamp) might be faster than checking intervals. However, from our conversations, I think you need times down to milliseconds for other aspects of your project, and maintaining both is pointless. This approach probably makes more sense than converting the milliseconds to a date to compare to another date.You have said you want to use three values to define the intervals:
@dstart
)@dend
)@interval
)Again, coming out of our discussion, your expectation is that these values will always relate as follows:
where N is the number of intervals to be checked. However, it's always a good idea to account for mistakes in the input. If the above statement is not true, then we have a partial interval to account for at the end of the analysis period. This can be handled three ways: add another full interval (that goes past @dend); add the partial interval (ending at @dend); or ignore the partial interval. You agreed that ignoring the partial interval made sense.
For our initial test data , we'll use the following:
IN fact, this gives us an example of a partial interval. With our sample values, we should be checking three intervals: from 1 through 3; from 4 through 6; and from 7 through 9. Any
tstamp
s set to 10 will be ignored.My first suggestion is to generate the interval ranges (the start and end values for each interval) with a recursive Common Table Expression, or CTE.
At its base, a CTE lets you define a temporary table for the duration of a single query statement. The expression can be used multiple times within the query, just like a actual table. It has the additional advantage of allowing you to build the "table" by creating an initial data set, and then using the existing set recursively to build the full data set. This is frequently used for assembling hierarchical data (like organization charts), but can also be used to build something like a list of sequential dates, or our intervals.
Here's the CTE to build the interval list:
We start with our initial interval:
Then, using that as our starting point, we add new intervals, starting each new interval with the end value from the previous interval:
The
WHERE
clause tells us when to stop:However, this didn't work out for you; it evidently requires a newer version of MariaDB than what you have available.
So, for option two, I went with using a "numbers table". If you're not acquainted with those, it's a table (or, in our case, a view) with one column, holding integer values (in most cases stepping by one, and starting at 0 or 1). This allows a set-based solution to tasks that might otherwise require a loop or a cursor.
I've borrowed the code to do this from the accepted answer to this question. Check that answer (and the article it links to) for more info on numbers tables.
We start by manually creating a view with the numbers from 0 to 15 (which require up to 4 bits to represent). We can then expand quickly by using a neat little trick. If take the bit representation of the numbers 0-15, shift the bits 4 places to the left, and "concatenate" each of these with our original numbers (i.e., a cross join), we get all numbers represented in 8 bits (the "concatenation is actually a bitwise OR). Do thsi once, and you get 0 - 255; do it again with those results, and you get 0 - 65,535; again, and you get 0 - 4,294,967,296; etc. We shouldn't need that many numbers, but I'll include the views in the full code at the end.
So, our first interval will run from
@dstart
to@dstart + @interval - 1
; our second from@dstart + @interval
to@dstart + (2 * interval) - 1
. More generally, the start of interval X is@dstart + ((X - 1) * @interval)
, and the end if@dstart + (X * @interval) - 1)
. Since our generator starts us at 0 instead of 1,X
will ben + 1
in our actual query:And, of course, we stop when the next
endRange
would be bigger than@dend
. I'll include a query to show you the intervals below, as well.Once we have the intervals, we simply need to match each interval up to the data for each user in our table, and see how many rows have a
tstamp
value that falls within the current range:The subquery aliased as
z
does all the real work here. We match every record intblA
with every interval; then, we group the records togather based onuser_login
andstartRange
, and count the records that fall into our range (with theSUM
medCASE
statement). Since each interval is compared with all recods for theuser_login
, we do get a row out even if none of the records fall within the interval; we do need to see cases where the user wasn't present.The outer part of the query uses the count to set
presence
(it's 0 if the count is zero, 1 otherwise); we could probably do this all at once, but it would make the query that much harder to understand, and we'd just be duplicating effort.So, here's the full code. I tested it using the MySQL 5.5 engine at db-fiddle.com (most of the code was tested in pieces using the MariaDB 10.2 RC engine at dbfiddle.uk, but I haven't been able to get that site to respond for the past couple of days).
Here's the results using the test data shown:
Note that user
x
only has 3 logins counted; his login with timestamp10
matches our@dend
value, but@dend
doesn't match up with the end of an interval, so anything more than '9' is ignored.I also generated some test data that should be more similar to what would actually be seen, using these values:
Results were: