First Words
You can safely ignore the sections below (and including) JOINs: Starting Off if you just want to take a crack of the code. The background and results just serve as context. Please look at the edit history before 2015-10-06 if you want to see what the code looked like initially.
Objective
Ultimately I want to calculate interpolated GPS coordinates for the transmitter (X
or Xmit
) based on the DateTime stamps of available GPS data in table SecondTable
that directly flank the observation in table FirstTable
.
My immediate objective to accomplish the ultimate objective is to figure out how to best join FirstTable
to SecondTable
to get those flanking time points. Later I can use that information I can calculate intermediate GPS coordinates assuming linear fitting along an equirectangular coordinate system (fancy words to say I don't care that the Earth is a sphere at this scale).
Questions
- Is there a more efficient way to generate the closest
before-and-after time stamps?- Fixed by myself by just grabbing
the "after," and then getting the "before" only as it related to the
"after".
- Fixed by myself by just grabbing
- Is there a more intuitive way that doesn't involve the
(A<>B OR A=B)
structure.- Byrdzeye provided the basic alternatives,
however my "real-world" experience didn't line up with all 4 of his
join strategies performing the same. But full credit to him for
addressing the alternative join styles.
- Byrdzeye provided the basic alternatives,
- Any other thoughts, tricks and advice you may have.
- Thusfar both byrdzeye and Phrancis have been quite helpful in this regard. I
found that Phrancis' advice was excellently laid out and provided
help at a critical stage, so I'll give him the edge here.
- Thusfar both byrdzeye and Phrancis have been quite helpful in this regard. I
I still would appreciate any additional help that I can receive with regard to question 3.
Bulletpoints reflect who I believe helped me most on the individual question.
Table Definitions
Semi-visual representation
FirstTable
Fields
RecTStamp | DateTime --can contain milliseconds via VBA code (see Ref 1)
ReceivID | LONG
XmitID | TEXT(25)
Keys and Indices
PK_DT | Primary, Unique, No Null, Compound
XmitID | ASC
RecTStamp | ASC
ReceivID | ASC
UK_DRX | Unique, No Null, Compound
RecTStamp | ASC
ReceivID | ASC
XmitID | ASC
SecondTable
Fields
X_ID | LONG AUTONUMBER -- seeded after main table has been created and already sorted on the primary key
XTStamp | DateTime --will not contain partial seconds
Latitude | Double --these are in decimal degrees, not degrees/minutes/seconds
Longitude | Double --this way straight decimal math can be performed
Keys and Indices
PK_D | Primary, Unique, No Null, Simple
XTStamp | ASC
UIDX_ID | Unique, No Null, Simple
X_ID | ASC
ReceiverDetails table
Fields
ReceivID | LONG
Receiver_Location_Description | TEXT -- NULL OK
Beginning | DateTime --no partial seconds
Ending | DateTime --no partial seconds
Lat | DOUBLE
Lon | DOUBLE
Keys and Indicies
PK_RID | Primary, Unique, No Null, Simple
ReceivID | ASC
ValidXmitters table
Field (and primary key)
XmitID | TEXT(25) -- primary, unique, no null, simple
SQL fiddle…
…so that you can play with the table definitions and code
This question is for MSAccess, but as Phrancis pointed out, there is no SQL fiddle style for Access. So, you should be able to go here to see my table definitions and code based on Phrancis' answer:
http://sqlfiddle.com/#!6/e9942/4 (external link)
JOINs: Starting off
My current "inner guts" JOIN Strategy
First create a FirstTable_rekeyed with column order and compound primary key (RecTStamp, ReceivID, XmitID)
all indexed/sorted ASC
. I also created indexes on each column individually. Then fill it like so.
INSERT INTO FirstTable_rekeyed (RecTStamp, ReceivID, XmitID)
SELECT DISTINCT ROW RecTStamp, ReceivID, XmitID
FROM FirstTable
WHERE XmitID IN (SELECT XmitID from ValidXmitters)
ORDER BY RecTStamp, ReceivID, XmitID;
The above query fills the new table with 153006 records and returns within a matter of 10 seconds or so.
The following completes within a second or two when this whole method is wrapped in a "SELECT Count(*) FROM ( … )" when the TOP 1 subquery method is used
SELECT
ReceiverRecord.RecTStamp,
ReceiverRecord.ReceivID,
ReceiverRecord.XmitID,
(SELECT TOP 1 XmitGPS.X_ID FROM SecondTable as XmitGPS WHERE ReceiverRecord.RecTStamp < XmitGPS.XTStamp ORDER BY XmitGPS.X_ID) AS AfterXmit_ID
FROM FirstTable_rekeyed AS ReceiverRecord
-- INNER JOIN SecondTable AS XmitGPS ON (ReceiverRecord.RecTStamp < XmitGPS.XTStamp)
GROUP BY RecTStamp, ReceivID, XmitID;
-- No separate join needed for the Top 1 method, but it would be required for the other methods.
-- Additionally no restriction of the returned set is needed if I create the _rekeyed table.
-- May not need GROUP BY either. Could try ORDER BY.
-- The three AfterXmit_ID alternatives below take longer than 3 minutes to complete (or do not ever complete).
-- FIRST(XmitGPS.X_ID)
-- MIN(XmitGPS.X_ID)
-- MIN(SWITCH(XmitGPS.XTStamp > ReceiverRecord.RecTStamp, XmitGPS.X_ID, Null))
Previous "inner guts" JOIN query
First (fastish…but not good enough)
SELECT
A.RecTStamp,
A.ReceivID,
A.XmitID,
MAX(IIF(B.XTStamp<= A.RecTStamp,B.XTStamp,Null)) as BeforeXTStamp,
MIN(IIF(B.XTStamp > A.RecTStamp,B.XTStamp,Null)) as AfterXTStamp
FROM FirstTable as A
INNER JOIN SecondTable as B ON
(A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
GROUP BY A.RecTStamp, A.ReceivID, A.XmitID
-- alternative for BeforeXTStamp MAX(-(B.XTStamp<=A.RecTStamp)*B.XTStamp)
-- alternatives for AfterXTStamp (see "Aside" note below)
-- 1.0/(MAX(1.0/(-(B.XTStamp>A.RecTStamp)*B.XTStamp)))
-- -1.0/(MIN(1.0/((B.XTStamp>A.RecTStamp)*B.XTStamp)))
Second (slower)
SELECT
A.RecTStamp, AbyB1.XTStamp AS BeforeXTStamp, AbyB2.XTStamp AS AfterXTStamp
FROM (FirstTable AS A INNER JOIN
(select top 1 B1.XTStamp, A1.RecTStamp
from SecondTable as B1, FirstTable as A1
where B1.XTStamp<=A1.RecTStamp
order by B1.XTStamp DESC) AS AbyB1 --MAX (time points before)
ON A.RecTStamp = AbyB1.RecTStamp) INNER JOIN
(select top 1 B2.XTStamp, A2.RecTStamp
from SecondTable as B2, FirstTable as A2
where B2.XTStamp>A2.RecTStamp
order by B2.XTStamp ASC) AS AbyB2 --MIN (time points after)
ON A.RecTStamp = AbyB2.RecTStamp;
Background
I have a telemetry table (aliased as A) of just under 1 million entries with a compound primary key based on a DateTime
stamp, a Transmitter ID and a Recording Device ID. Due to circumstances beyond my control, my SQL language is the standard Jet DB in Microsoft Access (users will use 2007 and later versions). Only about 200,000 of these entries are relevant to the query because of the Transmitter ID.
There is a second telemetry table (alias B) that involves approximately 50,000 entries with a single DateTime
primary key
For the first step, I focused on finding the closest timestamps to the stamps in the first table from the second table.
JOIN Results
Quirks that I've Discovered…
…along the way during debugging
It feels really odd to be writing the JOIN
logic as FROM FirstTable as A INNER JOIN SecondTable as B ON (A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
which as @byrdzeye pointed out in a comment (that has since disappeared) is a form of cross-join. Note that substituting LEFT OUTER JOIN
for INNER JOIN
in the code above appears to make no impact in the quantity or identity of the lines returned. I also can't seem to leave off the ON clause or say ON (1=1)
. Just using a comma to join (rather than INNER
or LEFT OUTER
JOIN
) results in Count(select * from A) * Count(select * from B)
rows returned in this query, rather than just one line per table A, as the (A<>B OR A=B) explicit JOIN
returns. This is clearly not suitable. FIRST
doesn't seem to be available to use given a compound primary key type.
The Second JOIN
style, although arguably more legible, suffers from being slower. This may be because an additional two inner JOIN
s are required against the larger table as well as the two CROSS JOIN
s found in both options.
Aside: Replacing the IIF
clause with MIN
/MAX
appears to return the same number of entries.
MAX(-(B.XTStamp<=A.RecTStamp)*B.XTStamp)
works for the "Before" (MAX
) timestamp, but doesn't work directly for the "After" (MIN
) as follows:
MIN(-(B.XTStamp>A.RecTStamp)*B.XTStamp)
because the minimum is always 0 for the FALSE
condition. This 0 is less than any post-epoch DOUBLE
(which a DateTime
field is a subset of in Access and that this calculation transforms the field into). The IIF
and MIN
/MAX
methods The alternates proposed for the AfterXTStamp value work because division by zero (FALSE
) generates null values, which the aggregate functions MIN and MAX skip over.
Next Steps
Taking this further, I wish to find the timestamps in the second table that directly flank the timestamps in the first table and perform a linear interpolation of the data values from the second table based on the time distance to those points (i.e. if the timestamp from the first table is 25% of the way between the "before" and "after", I would like 25% of the calculated value to come from 2nd table value data associated with the "after" point and 75% from the "before"). Using the revised join type as part of the inner guts, and after the suggested answers below I produce…
SELECT
AvgGPS.XmitID,
StrDateIso8601Msec(AvgGPS.RecTStamp) AS RecTStamp_ms,
-- StrDateIso8601MSec is a VBA function returning a TEXT string in yyyy-mm-dd hh:nn:ss.lll format
AvgGPS.ReceivID,
RD.Receiver_Location_Description,
RD.Lat AS Receiver_Lat,
RD.Lon AS Receiver_Lon,
AvgGPS.Before_Lat * (1 - AvgGPS.AfterWeight) + AvgGPS.After_Lat * AvgGPS.AfterWeight AS Xmit_Lat,
AvgGPS.Before_Lon * (1 - AvgGPS.AfterWeight) + AvgGPS.After_Lon * AvgGPS.AfterWeight AS Xmit_Lon,
AvgGPS.RecTStamp AS RecTStamp_basic
FROM ( SELECT
AfterTimestampID.RecTStamp,
AfterTimestampID.XmitID,
AfterTimestampID.ReceivID,
GPSBefore.BeforeXTStamp,
GPSBefore.Latitude AS Before_Lat,
GPSBefore.Longitude AS Before_Lon,
GPSAfter.AfterXTStamp,
GPSAfter.Latitude AS After_Lat,
GPSAfter.Longitude AS After_Lon,
( (AfterTimestampID.RecTStamp - GPSBefore.XTStamp) / (GPSAfter.XTStamp - GPSBefore.XTStamp) ) AS AfterWeight
FROM (
(SELECT
ReceiverRecord.RecTStamp,
ReceiverRecord.ReceivID,
ReceiverRecord.XmitID,
(SELECT TOP 1 XmitGPS.X_ID FROM SecondTable as XmitGPS WHERE ReceiverRecord.RecTStamp < XmitGPS.XTStamp ORDER BY XmitGPS.X_ID) AS AfterXmit_ID
FROM FirstTable AS ReceiverRecord
-- WHERE ReceiverRecord.XmitID IN (select XmitID from ValidXmitters)
GROUP BY RecTStamp, ReceivID, XmitID
) AS AfterTimestampID INNER JOIN SecondTable AS GPSAfter ON AfterTimestampID.AfterXmit_ID = GPSAfter.X_ID
) INNER JOIN SecondTable AS GPSBefore ON AfterTimestampID.AfterXmit_ID = GPSBefore.X_ID + 1
) AS AvgGPS INNER JOIN ReceiverDetails AS RD ON (AvgGPS.ReceivID = RD.ReceivID) AND (AvgGPS.RecTStamp BETWEEN RD.Beginning AND RD.Ending)
ORDER BY AvgGPS.RecTStamp, AvgGPS.ReceivID;
…which returns 152928 records, conforming (at least approximately) to the final number of expected records. Run time is probably 5-10 minutes on my i7-4790, 16GB RAM, no SSD, Win 8.1 Pro system.
Reference 1: MS Access Can Handle Millisecond Time Values–Really and accompanying source file [08080011.txt]
Best Answer
I must first compliment you on your courage to do something like this with an Access DB, which from my experience is very difficult to do anything SQL-like. Anyways, on to the review.
First join
Your
IIF
field selections might benefit from using a Switch statement instead. It seems to be sometimes the case, especially with things SQL, that aSWITCH
(more commonly known asCASE
in typical SQL) is quite fast when just making simple comparisons in the body of aSELECT
. The syntax in your case would be almost identical, although a switch can be expanded to cover a large chunk of comparisons in one field. Something to consider.A switch can also help readability, in larger statements. In context:
As for the join itself, I think
(A.RecTStamp<>B.XTStamp OR A.RecTStamp=B.XTStamp)
is about as good as you're going to get, given what you are trying to do. It's not that fast, but I wouldn't expect it to be either.Second join
You said this is slower. It's also less readable from a code standpoint. Given equally satisfactory result sets between 1 and 2, I'd say go for 1. At least it's obvious what you are trying to do that way. Subqueries are often not very fast (though often unavoidable) especially in this case you are throwing in an extra join in each, which must certainly complicate the execution plan.
One remark, I saw that you used old ANSI-89 join syntax. It's best to avoid that, the performance will be same or better with the more modern join syntax, and they are less ambiguous or easier to read, harder to make mistakes.
Naming things
I think the way your things are named is unhelpful at best, and cryptic at worst.
A, B, A1, B1
etc. as table aliases I think could be better. Also, I think the field names are not very good, but I realize you may not have control over this. I will just quickly quote The Codeless Code on the topic of naming things, and leave it at that..."Next steps" query
I couldn't make much sense of it how it was written, I had to take it to a text editor and do some style changes to make it more readable. I know Access' SQL editor is beyond clunky, so I usually write my queries in a good editor like Notepad++ or Sublime Text. Some of the stylistic changes I applied to make it more readable:
So as it turns out, this is a very complicated query indeed. To make sense of it, I have to start from the innermost query, your
ID
data set, which I understand is the same as your First Join. It returns the IDs and timestamps of the devices where the before/after timestamps are the closest, within the subset of devices you are interested in. So instead ofID
why not call itClosestTimestampID
.Your
Det
join is used only once:The rest of the time, it only joins the values you already have from
ClosestTimestampID
. So instead we should be able to just do this:Maybe not be a huge performance gain, but anything we can do to help the poor Jet DB optimizer will help!
I can't shake the feeling that the calculations/algorithm for
BeforeWeight
andAfterWeight
which you use to interpolate could be done better, but unfortunately I'm not very good with those.One suggestion to avoid crashing (although it's not ideal depending on your application) would be to break out your nested subqueries into tables of their own and update those when needed. I'm not sure how often you need your source data to be refreshed, but if it is not too often you might think of writing some VBA code to schedule an update of the tables and derived tables, and just leave your outermost query to pull from those tables instead of the original source. Just a thought, like I said not ideal but given the tool you may not have a choice.
Everything together: