Sql-server – Query “last” records in a table with non-sequential key fields

etloraclesql server

My goal is to clone a table from an Oracle database to a SQL Server database every midnight. I am using Talend Data Integration for this task.

The problem is that this table has around 70 million records (~100GB of data, including indexes), and I can't just drop it then load it again, because it is too resource intensive (especially because other tasks must be done).

The obvious solution would be an incremental update: I query the newly inserted records from Oracle and insert them into SQL Server.

So far, so good.

But the table I am querying from in Oracle has no ordering at all. Its unique key is composed of four fields that can receive any value, and I have no control over them. I must ensure that only "new" records are returned, but there is not a field (or set of fields) that I can use to sort the table and 'force' a sequential ordering. To be clear: none of the columns stores any kind of CreationDateTime datum.

I have only SELECT permissions on the Oracle database, so I can't make any changes, for example to flag records I've already imported.

I can connect to both databases from the same computer.

Querying the destination database for each record is also very resource intensive. Network and disk overheads are a problem, but they are not my main concern here.

Has anyone faced this problem before? Any idea how to solve it? Maybe there is no solution for this problem, but I was expecting some sort of 'trick': some hidden Oracle column, internal procedure, or other dark magic…?

My last resort is to make some assumptions to import, like "assume these records are ordered and import, then check consistency of some well chosen samples, if inconsistent, run full import again". This is very far from ideal, but at least I will not run a full import every time.

Best Answer

You can use ORA_ROWSCN pseudocolumn to find rows that have been changed since last update. By default it tracks changes by database block, not by row, but it can be used to limit the amount of data transferred between databases.

I would suggest a following procedure:

  1. Find the current Oracle SCN (System Change Number)

    select dbms_flashback.get_system_change_number from dual;

  2. Get rows that have been changed since last update

    select * from [source_table] where ora_rowscn > [last scn] and ora_rowscn <= [current scn]

  3. Resolve possible duplicates in your target database

  4. Store the [current scn] to target database so it can be used as [last scn] during the next iteration