MariaDB – Query to Show User Presence Within Interval

mariadb

I am working on a project where I need to show whether users were logged in or not over a given interval (currently within a 24-hour period over the course of a month, but could change).

I have a table with, among other things, the user's login and a timestamp (in milliseconds)

CREATE TABLE tableA (user_login varchar(64), tstamp double, ...)

(NOTE: I don't want to store unix time seconds or date in sql – as I lose precision of 4 places – here in this case it is not relevant but project is big. I expect to need this information, with full precision, elsewhere. I don't need to worry if I store unix time – seconds as date/timestamp divide by 1000 or multiply etc.)

From this, I need create a "map" to see if each user has logged in daily ( so the step will be one day in milliseconds ). The start and end will vary by days in month. In addition to a simple present/absent Boolean value, I'd like to produce an aggregated daily login count as an additional value.

Now, I could query the table, select back all the data, and put this together easily enough in my preferred coding language. However, we'll have thousands of rows per user; transporting all this over the network is a waste of resources for the client. So, I decided it made more sense to do this at the database server end.

So: Given the following data:

base data ( input table A ):

user_login     |   tstamp                 |  ...
---------------------------------------------------------------
    a          | 1                        |  ...
    a          | 3                        |
    a          | 5                        |
    a          | 6                        |

And the following definition of the intervals we're interested in:

  • start of analysis period (dstart) = 1
  • end of analysis period (dend) = 10
  • interval size = 3

How do I get the following results?

login|dt|presence | count (facultative column with aggregated count)
----------------- 
a    |4 |1        | 2
a    |7 |1        | 2
a    |10|0        | 0 

where:
– dt = the end of each interval
– presence = was there at least one row found for the user within the interval (1 = yes, 0 = no)
– count = the total number of rows fond for this user in this interval

Additional notes:
– if a tstamp falls on the boundary between two intervals, it's OK for it to be counted in both intervals.

Best Answer

So, at the heart of things, we want to take a list of users, combine that with a list of ranges, and report how many records each user has in each range (1 or more means "present", 0 means "absent").

First (since you explain that you're trying to track whether users have logged on to the system every day for a month), I should note that there are a number of good ways to do this simply using dates, and a date table (a list of all possible dates in your range), and that matching date to date (note, date, not datetime or timestamp) might be faster than checking intervals. However, from our conversations, I think you need times down to milliseconds for other aspects of your project, and maintaining both is pointless. This approach probably makes more sense than converting the milliseconds to a date to compare to another date.

You have said you want to use three values to define the intervals:

  1. The minimum value we want to look at (@dstart)
  2. The maximum value we want to look at (@dend)
  3. The size of each interval we want to examine (@interval)

Again, coming out of our discussion, your expectation is that these values will always relate as follows:

@datart + (N * @interval) - 1 = @dend

where N is the number of intervals to be checked. However, it's always a good idea to account for mistakes in the input. If the above statement is not true, then we have a partial interval to account for at the end of the analysis period. This can be handled three ways: add another full interval (that goes past @dend); add the partial interval (ending at @dend); or ignore the partial interval. You agreed that ignoring the partial interval made sense.

For our initial test data , we'll use the following:

INSERT INTO tblA (user_login, tstamp) VALUES
       ('a', 1),('a',3),('a',5),('a',6)
      ,('x', 2),('x',6),('x',7),('x',10)
;

SET @dstart = 1;
SET @dend = 10;
SET @interval = 3;

IN fact, this gives us an example of a partial interval. With our sample values, we should be checking three intervals: from 1 through 3; from 4 through 6; and from 7 through 9. Any tstamps set to 10 will be ignored.

My first suggestion is to generate the interval ranges (the start and end values for each interval) with a recursive Common Table Expression, or CTE.

At its base, a CTE lets you define a temporary table for the duration of a single query statement. The expression can be used multiple times within the query, just like a actual table. It has the additional advantage of allowing you to build the "table" by creating an initial data set, and then using the existing set recursively to build the full data set. This is frequently used for assembling hierarchical data (like organization charts), but can also be used to build something like a list of sequential dates, or our intervals.

Here's the CTE to build the interval list:

WITH recursive theRange AS 
     (SELECT @dstart as startRange
            ,@dstart + @interval as endRange
        FROM DUAL
      UNION ALL
      SELECT endRange, endRange + @interval
        FROM theRange
       WHERE endRange + @interval <= @dend
     )

We start with our initial interval:

SELECT @dstart as startRange
      ,@dstart + @interval as endRange
  FROM DUAL

Then, using that as our starting point, we add new intervals, starting each new interval with the end value from the previous interval:

SELECT endRange, endRange + @interval
  FROM theRange

The WHERE clause tells us when to stop:

 WHERE endRange + @interval <= @dend

However, this didn't work out for you; it evidently requires a newer version of MariaDB than what you have available.

So, for option two, I went with using a "numbers table". If you're not acquainted with those, it's a table (or, in our case, a view) with one column, holding integer values (in most cases stepping by one, and starting at 0 or 1). This allows a set-based solution to tasks that might otherwise require a loop or a cursor.

I've borrowed the code to do this from the accepted answer to this question. Check that answer (and the article it links to) for more info on numbers tables.

CREATE OR REPLACE VIEW generator_16
AS (SELECT 0 n)   UNION ALL (SELECT 1 n)   UNION ALL (SELECT 2 n)   UNION ALL 
   (SELECT 3 n)   UNION ALL (SELECT 4 n)   UNION ALL (SELECT 5 n)   UNION ALL
   (SELECT 6 n)   UNION ALL (SELECT 7 n)   UNION ALL (SELECT 8 n)   UNION ALL
   (SELECT 9 n)   UNION ALL (SELECT 10 n)  UNION ALL (SELECT 11 n)  UNION ALL
   (SELECT 12 n)  UNION ALL (SELECT 13 n)  UNION ALL (SELECT 14 n)  UNION ALL 
   (SELECT 15 n);

CREATE OR REPLACE VIEW generator_256
AS SELECT ( ( hi.n << 4 ) | lo.n ) AS n
     FROM generator_16 lo, generator_16 hi;

We start by manually creating a view with the numbers from 0 to 15 (which require up to 4 bits to represent). We can then expand quickly by using a neat little trick. If take the bit representation of the numbers 0-15, shift the bits 4 places to the left, and "concatenate" each of these with our original numbers (i.e., a cross join), we get all numbers represented in 8 bits (the "concatenation is actually a bitwise OR). Do thsi once, and you get 0 - 255; do it again with those results, and you get 0 - 65,535; again, and you get 0 - 4,294,967,296; etc. We shouldn't need that many numbers, but I'll include the views in the full code at the end.

So, our first interval will run from @dstart to @dstart + @interval - 1; our second from @dstart + @interval to @dstart + (2 * interval) - 1. More generally, the start of interval X is @dstart + ((X - 1) * @interval), and the end if @dstart + (X * @interval) - 1). Since our generator starts us at 0 instead of 1, X will be n + 1 in our actual query:

INSERT INTO intervalRange (startRange, endRange)
SELECT @dstart + (n * @interval)
      ,@dstart + ((n + 1) * @interval) - 1
  FROM generator_256
 WHERE @dstart + ((n + 1) * @interval) - 1 <= @dend
;

And, of course, we stop when the next endRange would be bigger than @dend. I'll include a query to show you the intervals below, as well.

Once we have the intervals, we simply need to match each interval up to the data for each user in our table, and see how many rows have a tstamp value that falls within the current range:

SELECT user_login, dt, CASE WHEN theCount = 0 THEN 0 ELSE 1 END as presence, theCount
  FROM (SELECT user_login, endRange as dt
              ,SUM(CASE WHEN     tblA.tstamp >= r.startRange
                             AND tblA.tstamp <= r.endRange
                     THEN 1
                     ELSE 0
                   END
                  ) as theCount
          FROM tblA CROSS JOIN intervalRange r
         GROUP BY user_login, endRange
       ) z
;

The subquery aliased as z does all the real work here. We match every record in tblA with every interval; then, we group the records togather based on user_login and startRange, and count the records that fall into our range (with the SUMmed CASE statement). Since each interval is compared with all recods for the user_login, we do get a row out even if none of the records fall within the interval; we do need to see cases where the user wasn't present.

The outer part of the query uses the count to set presence (it's 0 if the count is zero, 1 otherwise); we could probably do this all at once, but it would make the query that much harder to understand, and we'd just be duplicating effort.

So, here's the full code. I tested it using the MySQL 5.5 engine at db-fiddle.com (most of the code was tested in pieces using the MariaDB 10.2 RC engine at dbfiddle.uk, but I haven't been able to get that site to respond for the past couple of days).

CREATE TABLE tblA (user_login varchar(128), tstamp bigint, state varchar(128));

INSERT INTO tblA (user_login, tstamp) VALUES
       ('a', 1),('a',3),('a',5),('a',6)
      ,('x', 2),('x',6),('x',7),('x',10)
;


SET @dstart = 1;
SET @dend = 10;
SET @interval = 3;

CREATE OR REPLACE VIEW generator_16
AS (SELECT 0 n)   UNION ALL (SELECT 1 n)   UNION ALL (SELECT 2 n)   UNION ALL 
   (SELECT 3 n)   UNION ALL (SELECT 4 n)   UNION ALL (SELECT 5 n)   UNION ALL
   (SELECT 6 n)   UNION ALL (SELECT 7 n)   UNION ALL (SELECT 8 n)   UNION ALL
   (SELECT 9 n)   UNION ALL (SELECT 10 n)  UNION ALL (SELECT 11 n)  UNION ALL
   (SELECT 12 n)  UNION ALL (SELECT 13 n)  UNION ALL (SELECT 14 n)  UNION ALL 
   (SELECT 15 n);

CREATE OR REPLACE VIEW generator_256
AS SELECT ( ( hi.n << 4 ) | lo.n ) AS n
     FROM generator_16 lo, generator_16 hi;

CREATE OR REPLACE VIEW generator_64k
AS SELECT ( ( hi.n << 8 ) | lo.n ) AS n
     FROM generator_256 lo, generator_256 hi;

CREATE OR REPLACE VIEW generator_4b
AS SELECT ( ( hi.n << 16 ) | lo.n ) AS n
     FROM generator_64k lo, generator_64k hi;

CREATE TABLE intervalRange (startRange bigint, endRange bigint);

INSERT INTO intervalRange (startRange, endRange)
SELECT @dstart + (n * @interval)
      ,@dstart + ((n + 1) * @interval) - 1
  FROM generator_256
 WHERE @dstart + ((n + 1) * @interval) - 1 <= @dend
;


SELECT * FROM intervalRange;


SELECT user_login, dt, CASE WHEN theCount = 0 THEN 0 ELSE 1 END as presence, theCount
  FROM (SELECT user_login, endRange as dt
              ,SUM(CASE WHEN     tblA.tstamp >= r.startRange
                             AND tblA.tstamp <= r.endRange
                     THEN 1
                     ELSE 0
                   END
                  ) as theCount
          FROM tblA CROSS JOIN intervalRange r
         GROUP BY user_login, endRange
       ) z
;

Here's the results using the test data shown:

user_login | dt | presence | theCount
-----------+----+----------+---------
   a          1          1         2 
   a          4          1         2 
   a          7          0         0 
   x          1          1         1 
   x          4          1         1 
   x          7          1         1 

Note that user x only has 3 logins counted; his login with timestamp 10 matches our @dend value, but @dend doesn't match up with the end of an interval, so anything more than '9' is ignored.

I also generated some test data that should be more similar to what would actually be seen, using these values:

INSERT INTO tblA (user_login, tstamp) VALUES
       ('a', 1492834674999)
      ,('a', 1493038075000)
      ,('a', 1493056075000)
      ,('a', 1493088475000)
      ,('a', 1493162275000)
      ,('a', 1493209075000)
      ,('a', 1493218075000)
      ,('a', 1493300875000)
      ,('a', 1493381875000)
      ,('a', 1493390875000)
;

SET @dstart = 1492798675001;
SET @dend = 1493403475000;
SET @interval = 86400000;

Results were:

user_login |       dt      | presence | theCount
-----------+---------------+----------+---------
   a         1492885075000          1         1 
   a         1492971475000          0         0 
   a         1493057875000          1         2 
   a         1493144275000          1         1 
   a         1493230675000          1         3 
   a         1493317075000          1         1 
   a         1493403475000          1         2