How to prevent Pentaho DI/Spoon from deadlocking

deadlocketlpentaho

I have a transformation with four steps:

  1. Read a pipe-delimited text (a text file input step)
  2. Write to a staging table, with truncation (import table)
  3. Run a stored procedure against the table populated in #2
  4. Run a stored procedure against the table modified in #3

These have hops between them: 1 to 2 to 3 to 4.

It consistently hangs (tested several times, waiting up to 1.6 hours); that is, Pentaho reports starting the job in the Logging window, then nothing more. The GUI itself does not hang, in fact the Stop button disables after a moment, but the transformation does not proceed.

It appears to be deadlocking: checking for blocked sessions on the target database (PostgreSQL), I see two sessions, one for the stored procedure in step #3 (which shouldn't have started yet), which is blocked by a session which is trying to truncate the table in step #2. The TRUNCATE never completes (though Postgres does not report it as blocked), so no records get written and step #3 cannot run. If I kill either session, the transformation fails, with no records written; the sprocs in steps #3 or #4 sometimes run, but with no new data on which to operate.

If I disable the hop from 2 to 3, the job runs fine, and it takes just a couple of minutes.

Why is the sproc in step #3 trying to run when it has a dependency on step #2? Does a Pentaho "hop" not mean "run A, wait for it to end, and then run B"? What can I do to prevent the deadlock situation, or have I misread what's going on?

Best Answer

Noob error! Pentaho "hops" are not sequential; they will all run simultaneously.

The correct way to handle this type of process - read data, write data, and then manipulate the data in SQL - is to put the read/write in a transformation by itself, and then put that transformation into a job with the SQL procedure(s) following the transformation.