I have an online MySQL (presto) database which keeps on auto-updating (i.e., new rows keep getting added to the tables(views, to be exact) every day). I want to fetch the data from it every day but don't want to refetch the data that I have already fetched before.
I am using python3 for extracting the data from the online host database as:
from pyhive import presto
import numpy as np
cursor=presto.connect().cursor()
sql = 'select * from schema.table'
cursor.execute(sql)
rows = np.array(cursor.fetchall())
cursor.close()
Doing this takes time and fetches all the data (new data as well as already fetched data). I am storing the data in rows
variable in a csv file afterward.
Is there any way to get only the new data (id
is the primary key in the table)?
Thanks in advance.
Best Answer
Assuming
id
is an incrementing number or something else which grows every time you insert a new row (such as a timestamp), this should be quite straightforward:First read the last
id
value in the csv file from the previous run, and then use this in your query line:UPDATE:
If
id
is not an incrementing key, e.g. it's a UUID, then this becomes much harder. Other DBMSes (e.g. Oracle) haverowid
which could probably have been used here, though that is not a MySQL feature. MariaDB hasrowid
in itsCONNECT
storage engine, so you could set up a local MariaDB instance, install the CONNECT engine, create aCONNECT
table connected to your remote MySQL tableschema.table
with the special rowid column using a statement such as: