Efficient matching rows across two tables with huge data set

optimizationoracle

I have two tables with about 300+ columns and 20k+ rows each. Each row is characterized by a primary key which both tables have.I have to find the identical rows, rows with changes,and rows without a corresponding match in the second table(i.e there is a row with primary id=x in table 1 and no row has primary id=x in table 2). Attributing to this huge data set, can somebody suggest an "efficient" method to solve this.

Best Answer

Without making any schema change, your solution requires a table scan on both sides - that's the most efficient you'll be able to do. You simply have to LEFT JOIN A to B and compare every field.

However, if you expect to do this in the future, you can do a couple things to make life easier: you can add a calculated hash, which will almost always detect a change. Compare this field, and you can identify changes. Add an index to each table with the PK field(s) and this hash field, and you can very quickly find new and changed records:

SELECT
    A.PK,
    CASE
        WHEN B.PK IS NULL THEN 'New'
        WHEN A.Hash <> B.Hash THEN 'Changed'
        ELSE 'Unchanged' END AS Status
FROM
    A
    LEFT JOIN B ON A.PK = B.PK

Microsoft SQL offers a ROWVERSION construct which makes this type of operation simple; it looks like the equivalent in Oracle is system change numbers: https://stackoverflow.com/a/20488598/565869.

If for any reason that doesn't work for you, you could always roll your own using triggers or by modifying the sprocs which update these tables (you are wrapping your UPDATEs in sprocs, right?)

Related Solutions

Oracle 11g: performance improvements of inserts

Just saw the update, 60-col table with mostly VARCHAR(2k) fields -- that is (potentially) a monster table.

First things first...

You have to understand your bottleneck FIRST. On the app side, go all the way back to your single-threaded batch-insert solution (1/2/3k at a time) and begin running it and login to the DB machine and run a 'top' -- see how much time the DB process is taking AND how much (if any) wa% time the machine is showing.

If top is showing you ANY wa% time, that means your DB is I/O bound and you likely need to consider multiple DB machines (shards) or consider throwing SSDs on the host machine.

That's it; your research stops here. It doesn't matter how much CPU the DB was taking or how saturated your app client was; if you are hitting I/O latency issues on the host DB, that is as fast as it will EVER go for you.

TIP If hardware changes are out of the question, depending on the filesystem you are running (Linux) you can try disabling logging or metadata writing for the DB to slightly improve performance at the filesystem level. You can do something similar on NTFS, but this will only give you a little boost. This won't be 2x.

Now, second things second...

Let's say you had next to no wa% time but your CPU is pegged fully by the DB process. Your only option now is to introduce more DB machines (shards) and split the work.

Again, you're done with your research if this is the case. Nothing you can do to tweak the CPU to go faster.

Lastly, third things... third...

Let's say the DB isn't doing much of anything. Then, go to the client machine running the batch insert and check the CPU load -- is it pegged? If so, fire up some more machines doing the exact same batch inserts and see if you can get a linear ramp.

If the CPU isn't pegged, fire up some more threads on the same machine until it is pegged and see how the DB scales.

I think you may have already tried that, so my guess is that either your client host was already pegged (and more threads isn't going to make a difference) or the DB was already hitting its limit and can't scale any farther.

Addendum

Doing raw inserts on an unindexed table that has no garbage in it is essentially an APPEND operation which should be going as fast as the disk can handle the writes.

Creating more tables on the same host machine isn't going to help, if anything it will increase your disk seeks (to get to the other tables on disk to append to) and will slow things down.

It is critical to figure out that bottleneck first, then we can optimize the hell out of it.

Hope that helps! Keep us posted.

Optimize rows to columns conversion

Instead of using so many JOINs to get the result, you might get better performance pivoting the data using an aggregate function with a CASE expression. Oracle 10g doesn't have a PIVOT function so you'd have to use this type of query (aggregate/CASE) if you aren't going to use multiple JOINs on your table.

Since you have a limited number of questions (7), you can easily hard-code the query:

select
  id,
  name,
  max(case when question = 1 then Answer end) Q1_Answer,
  max(case when question = 1 then Reason end) Q1_Reason,
  max(case when question = 2 then Answer end) Q2_Answer,
  max(case when question = 2 then Reason end) Q2_Reason,
  max(case when question = 3 then Answer end) Q3_Answer,
  max(case when question = 3 then Reason end) Q3_Reason,
  max(case when question = 4 then Answer end) Q4_Answer,
  max(case when question = 4 then Reason end) Q4_Reason,
  max(case when question = 5 then Answer end) Q5_Answer,
  max(case when question = 5 then Reason end) Q5_Reason,
  max(case when question = 6 then Answer end) Q6_Answer,
  max(case when question = 6 then Reason end) Q6_Reason,
  max(case when question = 7 then Answer end) Q7_Answer,
  max(case when question = 7 then Reason end) Q7_Reason
from tablename
group by id, name;

See SQL Fiddle with Demo

Best Answer

Related Solutions

Oracle 11g: performance improvements of inserts

Optimize rows to columns conversion

Related Question