Sql-server – Query producing radically different results running in ado.net vs SSMS

ado.netsql server

Top level: I have this query I'm running. It's been running for several weeks exactly as I expected, but the last couple of days I'm getting pretty random incorrect results when I run the query through ADO.Net but still getting the "correct" results when I run the exact same query in SSMS, and I am at a loss to see why.

I'm writing a process to harvest SEC filings from the Edgar site. Every night, the SEC publishes a small text file with the "light metadata" of the filings done on a given day (id for the company it's about, what type of filing it is, the id number of the filing). You can whip through these really fast and get the general idea of what's happening. It's the first stage of the project I wrote, and I could get the light metadata this way for 27 years or so of filings loaded in about an hour.

The thing that's really time consuming is fetching the whole filing and parsing that. As that side of the project has grown, every day I want to get the new data and then fetch and process a few extra full filings to backfill the filings I know about from the light meta data.

Another thing the process has to accommodate is that a company can submit a filing on Monday and upload a new version on Friday, or next week, or next year. So I have to recognize when I've gotten an update to a filing I've processed before.

In other words, people care about the more recent stuff, so I want to keep current and fill in the old stuff as time allows.

So I have the following query that, in high level terms, asks "give me the top X filings that a) I haven't fully processed before or b) got and update and I need to reprocess it"

FilingIndex is the table that gets the light metadata from the nightly file SEC posts about the daily activity

FilingCiks (Cik being a govt company id) is a one-to-many table about the companies referenced by the filing (at least one of these ids is required to fetch the full text for processing

FilingContent is my table where I keep the information from the full text I've fetched and parsed for a file.

So every night I run the process and get the list of "what's new" for today; that goes into FilingIndex and FilingCiks.

Then I run this query, saying "in order of the newest filings I've heard about, get me what I need to call the SEC for the full text". @max is the number of filings I found in the prior step + some number of thousands to help backfill.

I expect to see a list of all the filings I just got from the previous step, followed by entries where my backfilling last left off. So the first 5000 rows are today's/last night's filings, the next 5000 may pick up somewhere in 2018 where I left off.

I've processed a couple years of filings this way and the results have been what I expected every time. But yesterday for no reason I can find, the query below started returning scattershot results when run through ADO.Net but still returns what I expect in SSMS.

Through ADO.Net, the results are still reverse chron, but a) they don't start with the newest, and b) they skip lightly through the existing rows.

Any ideas why a) ADO.Net and SSMS would run so differently and b) why the ADO.Net run would just start behaving differently in the last day or so?

As I've said, I've been running this process for a while. I've back and filled back to August 2018 and the query was running just as I expected. But now it's jumping around in the calendar.

DECLARE @max int = 5000;
SELECT TOP (@max) idx.accession_number, ciks.cik, cast(case when idx.fetched < idx.crawlerdate then 1 ELSE 0 END as bit) as cleanup, idx.CrawlerDate
FROM
(SELECT idx.accession_number, idx.crawlerdate, idx.filingType, ROW_NUMBER() OVER (PARTITION BY idx.accession_number ORDER BY idx.crawlerdate DESC) AS rn, content.fetched
  FROM raw.FilingIndex idx
  LEFT JOIN raw.FilingContent content ON content.accession_number = idx.accession_number
  WHERE idx.accession_number < 999999999700000000 -- paper filings with no good data to mine
       AND (content.accession_number is null OR content.Fetched < idx.crawlerdate)
) idx
CROSS APPLY (SELECT TOP 1 cik from raw.filingciks ciks WHERE ciks.accession_number idx.accession_number AND ciks.crawlerdate = idx.crawlerdate) ciks
WHERE idx.rn = 1
ORDER BY idx.crawlerdate DESC, idx.Accession_Number asc

Best Answer

Are your idx.crawlerdate field's values UNIQUE for each value within each PARTITION of idx.accession_number?

In other words, if you have two values for crawlerdate that are the same within the same PARTITION then ROW_NUMBER() will randomly select which record wins in the tie when generating values for your rn field. (This is due to the nondeterministic nature of ordering data sets on a non-unique field.)

To fix this, you either need to add another field to uniqify the logic for the ORDER BY clause of the ROW_NUMBER() function OR you can use RANK() or DENSE_RANK() to generate your row numbers. If you use RANK() or DENSE_RANK() then rows with the same crawlerdate within the same PARTITION will get assigned the same rank number.

This is a great read on the nondeterministic ordering that can occur when using the ROW_NUMBER() function.

Also, I agree that with @JustinCave that your ciks query SELECT TOP 1 cik from raw.filingciks ciks WHERE ciks.accession_number idx.accession_number AND ciks.crawlerdate = idx.crawlerdate is nondeterminstic because you're using the TOP operator without an ORDER BY operator. Depending on your data, you can also get random results there.