Sql-server – Join Tables of Varying Criteria in Order of Priority

join;sql serversql-server-2016

I am trying to join two tables based on varying criteria in order of priority.

The situation

I have inherited a clinical database which has inherently bad data quality from years of poor input and shockingly bad application build etc… all out of my control but I am tasked with creating a data warehouse from this system.

I have a table which has diary appointments in them

The structure is

(Appointments)

Appointment_ID (PK),
Appointment_Date,
Appointment_User (FK),
Appointment_Patient_ID (FK),
Appointment_Outcome

Events

When a patient is seen in an appointment there are events that occur..

The events table contains the following information

Event_ID (PK),
Event_Date,
Appointment_ID (FK),
Patient_ID (FK),
User_ID (FK),
Event_Type_ID (FK),
Event_Outcome

The poor design of the application allows:

1) Users create an event without linking it to an appointment.

2) It also allows someone to complete an event on behalf of a clinician (medical secretary) whilst impersonating them – however in the database it records the medical secretary's ID instead of the clinicians ID, but keeps the clinicians ID in the appointment table.

The poor data quality exists where:

1) The user creates the event on the wrong date (so no appointment_ID and date is wrong).

So whilst the event should contain the appointment_ID, user_ID and be on the same date as the appointment it isn't always.

also

A patient can have more than one appointment on a day with and with more than one clinician (as defined by the user_ID).

The ideal would be to get the data quality fixed but the organisation can't do this in time scale (there are people going in and fixing all the errors but this won't be finished for another couple of months) The organisation has agreed to best match on 5 sets of criteria in a certain order of priority and accept the risk that there will be some inaccuracies.

The required outcome

I need to create a table which shows the following information

All data from Appointments (a) on the left
All rows from all columns (where there is a match) on the right from Events (e)

The join should occur in the order of priority:

1) Join e on a.appointment_ID = e.appointment_ID

if there is no match then

2) Join e on a.patient_ID = e.Patient_ID AND a.date = e.date AND a.user_ID = e.user_ID AND e.event ID not found in the results from join 1

if there is no match then

3) Join e on a.patient_ID = e.Patient_ID AND a.date = e.date AND e.event ID not found in the results from join 1 or join 2

if there is no match then

4) Join e on a.patient_ID = e.Patient_ID AND a.date – e.date +/- 5 days AND a.user_ID = e.user_ID AND e.event ID not found in the results from join 1, 2 or 3

5) Join e on a.patient_ID = e.Patient_ID AND a.date – e.date +/- 5 days AND e.event ID not found in the results from join 1, 2, 3 or 4

I tried doing nested not IN statements on the join but as the data is in the GB and records are in the millions of rows per table it's taking half an hour to run the query and this needs to be updated on demand as people are checking the impact of the data quality amendments as they go.

Is there a resource and query run time efficient way of running the join on these 5 sets of criteria in order of priority?

Best Answer

This would be better accomplished through a series of inserts. Create a temporary table that contains all of the columns from table Appointments and table Events.

Insert into the temp table using each of your join predicates in the order of importance that you have listed above. For each insert after the first one (the one joined on appointment_id), use the following predicates:

appointment_id not in (select appointment_id from #yourTempTable) and
   event_id not in (select event_id from #yourTempTable)

Since your goal is to have all appointments listed, do a final insert for everything from Appointments that has:

appointment_id not in (select appointment_id from #yourTempTable)

Then you will want to run a report showing what events were not matched up to an appointment by searching all event_ids from Events that do not exist in #yourTempTable.