PostgreSQL – Does System Process Slow Down Database Access?

performancepostgresql

I have a Fairly large DB containing mostly html documents. I wrote a python script that fetches always 1000 html documents at once from my database cur.execute("SELECT id,url,html_file FROM html ORDER BY id OFFSET %s LIMIT %s;" % (offset, limit)). Afterwards I do a bit of regular expression on the html documents. Because I have a lot html files to go through I track the time how long each step takes. For the first 4000 times retrieving the html documents from the database took around 3 seconds. Now it is up to 4 Minutes. I'm running a Windows 7 machine and took a look at the Resource Monitor. The Database is on its own HDD with nothing else on it. But in the Resource Monitor i could see that the System Process constantly reads something out of my Postgres Folder

Image   PID File    Read (B/sec)    Write (B/sec)   Total (B/sec)   I/O Priority    Response Time (ms)
System  4   E:\PostgreSQL\data\base\10596207\10598404.1 13,855,193  0   13,855,193  Normal  9
System  4   E:\PostgreSQL\data\base\10596207\10598404.1 11,182,442  0   11,182,442  Normal  9

so the question is is this normal or is the system process the culprit here and how do I stop it? (malware and virus scanning are disabled as well as search indexing)

I followed the advice from Reaces and found that superfetch was already disabled. But I also downloaded the process explorer from sysinternals and there I observed something strange. After about the same amount of parsed html documents (about 4,513,000 fetched in package of a 1,000 from the database) the postgres process begins to write a lot of stuff to the HDD up to this point it did read ~3TB from disk and write ~100mb.

But now about 3000 html documents later the process explorer shows me that the postgress proces did write 10GB to the HDD with heavy write access for every cur.execute("SELECT id,url,html_file FROM html ORDER BY id OFFSET %s LIMIT %s;" % (offset, limit)) command

What is the database doing and how do I stop it? my best guess is that the database orders it self to be more efficient the next time but that does not help me now.

Best Answer

OFFSET is not what you want to use.

The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient.

http://www.postgresql.org/docs/9.3/static/queries-limit.html

Instead, add a where clause along the lines of cur.execute("... WHERE id > %s ORDER BY id LIMIT %s", (last_id_from_previous_batch, 1000)). This will read and return rows starting from that id, instead of scanning OFFSET+LIMIT rows every time and returning only the last chunk. Once offset is large enough, you are likely either swapping, or the query plan starts to require an on-disk sort, causing the I/O.

Also, your code as written is vulnerable to SQL injection. Pass parameterized values as the second argument to execute(), the dbapi module will handle quoting/etc - the % operator will not.