Sql-server – Finding missing gaps of data in a table with ~2.5 Million rows

gaps-and-islandssql serversql-server-2012

I work for an economics consulting company and all of our data is housed in SQL Server 11 (2012). Our data essentially consists of a date column, a raw data column, several calculations columns and a column that has a shortcode to distinguish each set of data from the next.

We have thousands of these sets that are all loaded into the same table (~2.5 Mil rows total) and then drawn from and ordered with a query. They are generally ordered by date, starting with the first month and moving month by month until the last month. The start and end dates of each series varies from just a year's worth of time to 100+ years.

Recently, we've been having some issues where random sections of data just disappear. The entire row just up and leaves, which makes finding these missing rows somewhat of a challenge without combing through every single month to check if a month is missing, a somewhat impossible task for a 2.5 million row table.

My boss has tasked me with writing a query/stored procedure that will look through this giant table and look to see which sets having missing rows and where they are.

I've been trying to work through this problem that is a little bit above where my SQL skills are and I can't seem to find anybody that has had a similar issue anywhere on the web. I'm going to go through what I already have and maybe someone can at least tell me if I'm headed in the right direction and possibly provide some insight as to where I should go from here.

The best solution I could find on the web was using a CTE to create a temp table of dates and then compare them with the original table. This works great if I was just scanning for problems in one particular data set, but I have many data sets within the same table, all with different starting and end dates. So I went with it anyways in hopes that I could eventually expand it to search through the lot of them. Here's my code:

declare @startDate Date, @endDate Date 
set @startDate = '2000-01-01'
set @endDate = '2016-11-01'

;with GetDates As  
(  
select @startDate as TheDate
UNION ALL  
select DATEADD(MONTH,1, TheDate) from GetDates
where DATEADD(MONTH,1, TheDate) <= @endDate
)

SELECT TheDate,SHORTCODE,MonthYear 
From GetDates
LEFT OUTER JOIN VWTBL_INDICATOR
ON GetDates.TheDate=VWTBL_INDICATOR.MonthYear
AND VWTBL_INDICATOR.SHORTCODE='RMI WEST'
OPTION(MAXRECURSION 1000)

'RMI West' would be the shortcode that marks this particular data set and it is missing data from November of 2004 to March of 2005 which appear as nulls when I execute this query. That is almost exactly what I need, but for every data set that I have in the table.

How do I write this query properly? Our company contracts out a bunch of the work done on our database so I'm not really all that familiar with the guts of it. We have an upload function and when I pull the data in using that, everything is fine. But when I look at the data a week or two later, it is missing random dates. They have been looking into a solution, but we need a way to find these gaps in the meantime.

We will always be checking months. Our data almost always starts on January in a given year and ends on the current month or close to it if the data hasn't come out yet.

I have a metadata table for the shortcodes, but the expected number of months that each shortcode has is not included as we haven't had any use for this information until now. I could inquire about adding in the fields and modifying the update script to include them when the data is updated, if this would be essential in getting this to work properly.

The company I work for has historically stored all of its data on an overwhelming number of Excel spreadsheets. They have been in the process of transitioning all of this data into SQL. My job is to check it through various methods to make sure that the data on SQL Server matches up with the data in the original spreadsheet. The raw data is accessible through the spreadsheet network on our file system. When updating data, we still use Excel to update the information and then load it in SQL using the upload function that I mentioned previously. The data has been mostly clean and matching for some time now. Only within the last couple months has this data gap issue arisen.

Best Answer

There is no need to generate dates.


The following query will give you a list of SHORTCODES with no rows at all:

select SHORTCODE from shortcodes
except
select SHORTCODE from VWTBL_INDICATOR

The following query will give you the continuous ranges of MonthYear per SHORTCODE.

select      SHORTCODE
            ,min(MonthYear) as from_MonthYear
            ,max(MonthYear) as to_MonthYear
            ,count(*)       as months

from       (SELECT   SHORTCODE
                    ,MonthYear
                    ,row_number() over (partition by SHORTCODE order by MonthYear)  as rn

            From     VWTBL_INDICATOR
            ) t

group by    SHORTCODE
            ,DATEADD(month,-rn,MonthYear)   

order by    SHORTCODE
            ,from_MonthYear

If you wish you can use the following version which has an additional layer of information:

  • missing_from_MonthYear + to_MonthYear: missing range in the middle
  • ranges: Number of ranges per SHORTCODE (ranges>1 means you have gaps in the middle)
  • range_seq: the sequential number of each SHORTCODE range
  • is_first: Indication for the first range per SHORTCODE (check from_MonthYear to see if you are missing preceding dates)
  • is_last: Indication for the last range per SHORTCODE (check to_MonthYear to see if you are missing following dates)

select      SHORTCODE
           ,from_MonthYear                                                                                  as exists_from_MonthYear
           ,to_MonthYear                                                                                    as exists_to_MonthYear
           ,dateadd (day,1,to_MonthYear)                                                                    as missing_from_MonthYear
           ,dateadd (day,-1,lead (from_MonthYear) over (partition by SHORTCODE order by from_MonthYear))    as missing_to_MonthYear
           ,count       (*) over (partition by SHORTCODE)                                                   as ranges
           ,row_number  ()  over (partition by SHORTCODE order by from_MonthYear)                           as range_seq
           ,case from_MonthYear when min(from_MonthYear) over (partition by SHORTCODE) then 1 end           as is_first
           ,case to_MonthYear   when max(to_MonthYear)   over (partition by SHORTCODE) then 1 end           as is_last

from       (select      SHORTCODE
                       ,min(MonthYear)  as from_MonthYear
                       ,max(MonthYear)  as to_MonthYear
                       ,count(*)        as months

            from       (SELECT      SHORTCODE
                                   ,MonthYear
                                   ,row_number() over (partition by SHORTCODE order by MonthYear)   as rn

                        From        VWTBL_INDICATOR
                        ) t

            group by    SHORTCODE
                       ,DATEADD(month,-rn,MonthYear)    
            ) t

order by    SHORTCODE
           ,from_MonthYear