For each row, find the next closest value in another table (Teradata)

teradata

I've seen a few close examples here but I'm not quite getting something close enough for me to noodle out a fix.

I have events, where trucks come in and out of a gate and I have two tables. The first is all ingate events the second is all outgate events. The problem is conceptually easy: For each ingate, find the nearest outgate. The dilemma I have is that obviously trucks can come back. So I might have 300 ingates for a unit on the ingate table with a corresponding 300 outgates.

Both tables look like this, except the other is session.Outgate

session.Ingate:

CHAS_INIT|  CHAS_NR   |     Ingate_TIME
---------+------------+---------------------
  MAEC   |   66862    | 2018-01-29-01.34.00
  MAEC   |   67218    | 2018-03-24-01.29.00
  MAEC   |   67557    | 2018-03-30-01.26.00
  MAEC   |   67557    | 2018-04-06-01.49.00
  MAEC   |   68773    | 2018-02-22-01.47.00

While this is a breeze to do in a spreadsheeting tool, I'm stumped on if it can be done during a join. Any helps is greatly appreciated.

Best Answer

select CHAS_INIT, CHAS_NR, Ingate_TIME,
   -- find the previous IN-time
   last_value(case when inout = 'i' then Ingate_TIME end ignore nulls)
   over (partition by CHAS_INIT, CHAS_NR -- does this determine a truck?
         order by Ingate_TIME)
from
 (  -- combine both in & out into a single result
    select 'i' as inout, CHAS_INIT, CHAS_NR, Ingate_TIME
    from session.Ingate
    union all 
    select 'o' as inout, CHAS_INIT, CHAS_NR, Ingate_TIME
    from session.Outgate
 ) as dt
-- only return OUT rows
qualify inout = 'o'

Related Solutions

Teradata – Driving Responsive Web Application

as I said in my answer to this question, Teradata wouldn't be my first choice for a web backend: it really shines on moving/elaborating large quantities of data, but the typical workload of an application (frequent small queries) is not the best scenario for it.

Obviously there are other voices (see here) but there is quite an agreement about some critical points:

Teradata locking behavior is the primary challenge for an OLTP style application.

and:

use as much cache as you can: the Teradata connection is a precious resource you'd better not use for information that doesn't need to be realtime

It is clearly not impossible to implement such a solution on Teradata, but it requires your application to adapt fully to Teradata peculiarities (e.g. you could have to create a non-trivial application data model, to strive for single-AMP queries)

An example of the challenges you're going to face:

in an application I had to update a flag field on a set of rows in a table, chosen with a condition that was not the Teradata "primary index". However, there was an index on the columns used in the condition.

Usually, I would have used a simple "update": however, in Teradata, this would have created a "table-level lock". At the end it turned out it was "better" to select all the rows in the application and do all the updates (in batch) one by one by primary index, to use the "hash-level" locks.

=========================================================

Edited: after the additional info from OP

First the good news:

Mostly read-only since it is a mostly "read only" application, you can setup an efficient caching system. You could also "pre-cache" some common content so that the hit of first querying it won't be apparent to the user.

Why that's important in this case: connections in Teradata can't be created over a specific number for appliance. That's a pool wìthat will be shared between users in the application, the ETL and the normal OLAP users. In addition, queries that are not single-AMP won't have a very good latency, and they'll be more subject to the instan load of the system

Fine-grained queries

It might be a good thing if in this way you're able to do mostly single-AMP queries.

Why that's important in this case: single-AMP queries (the ones querying/insert/deleting using the primary key as criteria) have lower lock impact and can be managed without paying the cost of coordination between the AMPs. On the other side, having such a "chatty" protocol wil occupy more the connections.

Materialized tables

It's good that you are already thinking about materializing some data. You might even find out that it will be useful have the same data materialized in two or more tables with different primary indexes, to satisfy your user interface (e.g. if you have different selection criteria on the same entity). However this will have a cost in terms of ETL time and disk space.

Now, the bad news:

More than 1000 users As the application is not "completely" read-only, this number of users will probably create access conflict in the cases when they have to write.

Why that's a problem: Forget multi-versioning: here readers block writers and viceversa. Your select might stop the update of another user from completing. If you are not able to use primary index for every update you do, you'll be locking the full table, and even if you use primary index, you are not going to have a row-level lock (you have an "hash-level".. normally it works ok, but it can have some issues with batch inserts). There's a workaround this conflict betwen readers and writers (LOCK ROW FOR ACCESS) but it opens the door to dirty reads, and you should carefully think if you can accept this in your application

Web application and OLAP on the same machine

Independently from the underlying technology, it's always a challenge to accomodate these two different workload on the same platform.

Why that's a problem: Teradata has some workload management rules that can help in this field, however they need a careful tuning from DBAs and, however, extraordinary load (e.g. end of month batches) will likely have an impact on the user experience of the application

Hourly ETL

The hourly ETL will be a stress point for the application, since you'll be loading the same tables that you are querying. Since you cannot afford to lock the readers, you'll probably resort to "dirty reads", but you'll have to find some policy to manage the fact that the application could have temporarily inconsistent data.

Why that's a problem: Already said: table locks, dirty reads and the presence of two different workloads

Number of queries

I know I said that "finer grained queries" can be a good thing, but here we're talking of 100 queries per user for a page, and a minimum of 20 queries per user action. Now, if you execute them serially, you're increasing latency (and you're looking for sub-second response time); but if you parallelize them, you're occupying the equivalent of the maximum sessions of a Parsing Engine just for an user (I'm not considering caching here)

Disclaimer: I've been working last two years in more traditional Teradata projects, but when OLTP and raw latency counted I don't remember figures to go a lot less than the 150ms you mention. Maybe 130ms. But that was after optimizations and configuring the system to accommodate our workload, and just on the simpler queries/macros.

But, then, if you already have an instance, and you want to verify Teradata performance on this scenario, I'd suggest you to ask Teradata Sales support for a "Proof of Concept" on your environment. I think they'll gladly accept (the more load you have, the more likely you'll need additional nodes).

The reasons I wanted to know so many things about your application is that, at certain conditions (mostly read data, few active users, clearly defined batch and online windows), it is easy to create applications on Teradata (as on every other system). When some of these constraints are lifted, the lock architecture and the necessity to target single-AMP queries, makes it really difficult (not impossible, but really difficult) to achieve satisfying performances.

How to select the not null value of a column when it has both null and not null value in teradata

What I did to solve this was:

CREATE TABLE x 
(
  id SMALLINT NOT NULL, 
  hcc SMALLINT NULL, 
  d_cd SMALLINT NOT NULL
);

and populate it:

INSERT INTO x VALUES
(1,    null,   1),
(1,    5   ,  1),
(2,    null,  2),
(2,    10,     2),
(2,    20,     3),
(3,    null,   3);

And then run this query:

WITH y AS
(
  SELECT id, hcc
  FROM x
  WHERE hcc IS NOT NULL
),
z AS
(
  SELECT id, hcc
  FROM x
  WHERE hcc IS NULL AND id NOT IN (SELECT id FROM y)
)
SELECT id, hcc FROM y
UNION
SELECT id, hcc FROM z
ORDER BY id;

Result:

As desired!

In future, could you please provide a fiddle with your table DDL and DML! This should work with Teradata - I double-checked and it does support Common Table Expressions!