PostgreSQL Data Synchronization – Keeping Bank Transactions in Sync

data synchronizationpostgresql

I'm counting on your expertise in this one, I'm developing a scrapper for banking transactions history.
My problem is, scrapping should be done regularly and the bank doesn't supply any unique identification field for each transaction, thus I need to find a way of not importing already imported data..
I can't greedly import if not exists because outliers might ruin the reliability of the application:

1) "01-01-2014", "ABCDEFG", "+100,00", "+1024,56"
2) "01-01-2014", "ABCDEFG", "-57,00", "+967,56"
3) "01-01-2014", "ABCDEFG", "-43,00", "+924,56"
4) "01-01-2014", "ABCDEFG", "+100",   "+1024,56"

Here the 1st and 4th are different operations, that should both be imported into the db. I'm probably looking for a solution that takes into account the whole imported data sequence.
Any ideas on how to implement this, possibly efficiently?! I'm using postgresql, here's my table's DDL's:

CREATE TABLE "extrato" (
    "extrato_id"  VARCHAR(4) PRIMARY KEY,
    "linha"    INTEGER NOT NULL
);

CREATE SEQUENCE linha_extrato_seq;

CREATE TABLE "linha_extrato" (
    "linha_id" INTEGER PRIMARY KEY DEFAULT nextval('linha_extrato_seq'),
    "dt_mov" DATE NOT NULL,
    "dt_val" DATE NOT NULL,
    "descricao" VARCHAR(60) NOT NULL,
    "quantia" NUMERIC(10,2) NOT NULL,
    "saldo" NUMERIC(10,2) NOT NULL,
);
ALTER SEQUENCE linha_extrato_seq OWNED BY linha_extrato.id;

As Craig Ringer said the lack of a timestamp field or unique identifications might ruin the reliability of the application, so I'm doing the best i can to lessen the impact of the "incomplete" data I have.
An important detail is that the synchronization process is to be done frequently, say the last sync was in day 1, then the second sync will take into account the transactions from day 1 and forth, thus overlaps if existent will always happen in the last synced date.

Thus far I tough about the following procedure:

FOR first_row to last_row in to_insert
    IF !exists(cur_row)
        insert([cur_row,remaining_rows]);
        break;
    END IF
END FOR

Any way this can be implemented in PL/pgSQL? can i process the set of inserts as a unique operation?

Best Answer

It probably isn't possible to do this robustly.

If your previous sequence ends in:

"01-01-2014", "ABCDEFG", "+100,00", "+1024,56"

and the next sequence begins with:

"01-01-2014", "ABCDEFG", "+100",   "+1024,56"

is that the same transaction as before? Or is the data really:

"01-01-2014", "ABCDEFG", "+100",   "+1024,56"
"01-01-2014", "ABCDEFG", "+100",   "+1024,56"

so you should import two transactions?

Without a timestamp, some useful unique transaction identifier, or the ability to know for sure what point you've already imported up to (say, transactions being split into discrete and non-overlapping monthly statements) I don't think you can do this reliably.

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

PostgreSQL Python – Algorithm for Populating Tables

You can use a query like below to get all tables with dependencies;

WITH fkeys AS (
SELECT

  c.conrelid AS table_id,
  c_fromtablens.nspname AS schemaname,
  c_fromtable.relname AS tablename,

  c.confrelid AS parent_id,
  c_totablens.nspname AS parent_schemaname,
  c_totable.relname AS parent_tablename

FROM pg_constraint c
JOIN pg_namespace n ON n.oid = c.connamespace

JOIN pg_class c_fromtable ON c_fromtable.oid = c.conrelid
JOIN pg_namespace c_fromtablens ON c_fromtablens.oid = c_fromtable.relnamespace

JOIN pg_class c_totable ON c_totable.oid = c.confrelid
JOIN pg_namespace c_totablens ON c_totablens.oid = c_totable.relnamespace
WHERE
  c.contype = 'f'
)

SELECT
  t.schemaname,
  t.tablename,
  fkeys.parent_schemaname,
  fkeys.parent_tablename

FROM pg_tables t
LEFT JOIN fkeys ON  t.schemaname = fkeys.schemaname AND 
                    t.tablename =  fkeys.tablename 
WHERE
  t.schemaname NOT IN ('pg_catalog', 'information_schema')

ORDER BY
  3 NULLS FIRST,
  4 NULLS FIRST

Moreover, the query below provides you foreign key details. It provides relation types by using primary key definition. Note that it does not check unique constrains.

WITH fkey AS (

SELECT
  n.nspname AS fkey_schema_name,
  c.conname AS fkey_name,

  c_fromtablens.nspname AS table_schema_name,
  c_fromtable.relname AS table_name,

  c_totablens.nspname AS foreign_table_schema_name,
  c_totable.relname AS foreign_table_name,

  c.conrelid,
  unnest(c.conkey) AS fkey_field_num,

  c.confrelid,
  unnest(c.confkey) AS fkey_foreign_field_num

FROM pg_constraint c
JOIN pg_namespace n ON n.oid = c.connamespace

JOIN pg_class c_fromtable ON c_fromtable.oid = c.conrelid
JOIN pg_namespace c_fromtablens ON c_fromtablens.oid = c_fromtable.relnamespace

JOIN pg_class c_totable ON c_totable.oid = c.confrelid
JOIN pg_namespace c_totablens ON c_totablens.oid = c_totable.relnamespace
WHERE
  c.contype = 'f'

),

pkey AS (

SELECT
  c.conrelid,
  unnest(c.conkey) AS conkey
FROM pg_constraint c
WHERE
  c.contype = 'p'
)

SELECT 
  fkey.fkey_schema_name,
  fkey.fkey_name,
  fkey.table_schema_name,
  fkey.table_name,
  fkey.foreign_table_schema_name,
  fkey.foreign_table_name,
  a.attname AS field_name,
  a_f.attname AS foreign_field_name,
  CASE
    WHEN pkey.conrelid IS NULL AND pkeyf.conrelid IS NULL THEN 'N-N'
    WHEN pkey.conrelid IS NOT NULL AND pkeyf.conrelid IS NOT NULL THEN '1-1'
    WHEN pkey.conrelid IS NULL AND pkeyf.conrelid IS NOT NULL THEN 'N-1'
  END AS relation_type
FROM fkey
JOIN pg_attribute a ON a.attrelid = fkey.conrelid AND a.attnum = fkey.fkey_field_num
JOIN pg_attribute a_f ON a_f.attrelid = fkey.confrelid AND a_f.attnum = fkey.fkey_foreign_field_num
LEFT JOIN pkey ON pkey.conrelid = a.attrelid AND pkey.conkey = a.attnum
LEFT JOIN pkey pkeyf ON pkeyf.conrelid = a_f.attrelid AND pkeyf.conkey = a_f.attnum

Best Answer

Related Solutions

PostgreSQL – How to Delete Duplicate Records Efficiently

Notes

PostgreSQL Python – Algorithm for Populating Tables

Related Question