PostgreSQL Data Synchronization – Keeping Bank Transactions in Sync

data synchronizationpostgresql

I'm counting on your expertise in this one, I'm developing a scrapper for banking transactions history.
My problem is, scrapping should be done regularly and the bank doesn't supply any unique identification field for each transaction, thus I need to find a way of not importing already imported data..
I can't greedly import if not exists because outliers might ruin the reliability of the application:

1) "01-01-2014", "ABCDEFG", "+100,00", "+1024,56"
2) "01-01-2014", "ABCDEFG", "-57,00", "+967,56"
3) "01-01-2014", "ABCDEFG", "-43,00", "+924,56"
4) "01-01-2014", "ABCDEFG", "+100",   "+1024,56"

Here the 1st and 4th are different operations, that should both be imported into the db. I'm probably looking for a solution that takes into account the whole imported data sequence.
Any ideas on how to implement this, possibly efficiently?! I'm using postgresql, here's my table's DDL's:

CREATE TABLE "extrato" (
    "extrato_id"  VARCHAR(4) PRIMARY KEY,
    "linha"    INTEGER NOT NULL
);

CREATE SEQUENCE linha_extrato_seq;

CREATE TABLE "linha_extrato" (
    "linha_id" INTEGER PRIMARY KEY DEFAULT nextval('linha_extrato_seq'),
    "dt_mov" DATE NOT NULL,
    "dt_val" DATE NOT NULL,
    "descricao" VARCHAR(60) NOT NULL,
    "quantia" NUMERIC(10,2) NOT NULL,
    "saldo" NUMERIC(10,2) NOT NULL,
);
ALTER SEQUENCE linha_extrato_seq OWNED BY linha_extrato.id;

As Craig Ringer said the lack of a timestamp field or unique identifications might ruin the reliability of the application, so I'm doing the best i can to lessen the impact of the "incomplete" data I have.
An important detail is that the synchronization process is to be done frequently, say the last sync was in day 1, then the second sync will take into account the transactions from day 1 and forth, thus overlaps if existent will always happen in the last synced date.

Thus far I tough about the following procedure:

FOR first_row to last_row in to_insert
    IF !exists(cur_row)
        insert([cur_row,remaining_rows]);
        break;
    END IF
END FOR

Any way this can be implemented in PL/pgSQL? can i process the set of inserts as a unique operation?

Best Answer

It probably isn't possible to do this robustly.

If your previous sequence ends in:

"01-01-2014", "ABCDEFG", "+100,00", "+1024,56"

and the next sequence begins with:

"01-01-2014", "ABCDEFG", "+100",   "+1024,56"

is that the same transaction as before? Or is the data really:

"01-01-2014", "ABCDEFG", "+100",   "+1024,56"
"01-01-2014", "ABCDEFG", "+100",   "+1024,56"

so you should import two transactions?

Without a timestamp, some useful unique transaction identifier, or the ability to know for sure what point you've already imported up to (say, transactions being split into discrete and non-overlapping monthly statements) I don't think you can do this reliably.