Mysql – Why do the queries slow down after a few minutes, when trying to build a DB

excelMySQLpythonUbuntu

First – I am a complete newb to DBA. Please be gentle. (And I am fairly new to python as well.) Oh, and I have no idea how to change any MySQL settings or the like, so all the MySQL DB settings are just the defaults. And I am running Ubuntu 14.04 on an AMD64 with 8G of RAM.

I have a huge excel (xlsx) spreadsheet (about 20K rows and 100 columns) which came from an old DB (which I do not have access to), and I am trying to use it to build out a new MySQL DB. I am using python, and openpyxl with mysqldb. I have written the following:

(There is more to it, but I believe everything you may need is here. Code works fine, just unbearably slow after several iterations.):

theSList = []
theBigList = []
for col in range(1, ws.get_highest_column() + 1):
    myStr1 = str(ws.cell(column = col, row=2).value)
    myStr1 = myStr1.translate(string.maketrans("",""),string.punctuation)
    myStr1 = ''.join(myStr1.split())            
        theBigList.append(myStr1)
    theSList.append('%s')

theDeetsString = 'INSERT INTO hf_fund_details(' + ", ".join([theBigList[i - 1] for i in deetsCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in deetsCols]) + ')'
theMgrDeetsString = 'INSERT INTO hf_mgr_details (' + ", ".join([theBigList[i - 1] for i in mgrCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in mgrCols]) + ')'
theFundStratString = 'INSERT INTO hf_strat_details (' + ", ".join([theBigList[i - 1] for i in stratCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in stratCols]) + ')'
theFundIDsString = 'INSERT INTO hf_id_details (' + ", ".join([theBigList[i - 1] for i in idCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in idCols]) + ')'
theFundFeeString = 'INSERT INTO hf_fee_details (' + ", ".join([theBigList[i - 1] for i in feeCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in feeCols]) + ')'
theFundSPsString = 'INSERT INTO hf_servpro_details (' + ", ".join([theBigList[i - 1] for i in servproCols]) + ')' + ' VALUES (' + ", ".join([theSList[i - 1] for i in servproCols]) + ')'

con = mdb.connect('localhost', 'root', 'K2Kill3rs', 'igp_hf_db');

with con:
    cur = con.cursor()

    cur.execute("DROP TABLE IF EXISTS hf_mgr_details")
    cur.execute("DROP TABLE IF EXISTS hf_fund_details")
    cur.execute("DROP TABLE IF EXISTS hf_strat_details")
    cur.execute("DROP TABLE IF EXISTS hf_id_details")
    cur.execute("DROP TABLE IF EXISTS hf_fee_details")
    cur.execute("DROP TABLE IF EXISTS hf_servpro_details")

    cur.execute("CREATE TABLE hf_fund_details(" + theDeetsStr + ")")
    cur.execute("CREATE TABLE hf_mgr_details(" + theMgrDeetsStr + ")")
    cur.execute("CREATE TABLE hf_strat_details(" + theFundStratStr + ")")
    cur.execute("CREATE TABLE hf_id_details(" + theFundIDsStr + ")")
    cur.execute("CREATE TABLE hf_fee_details(" + theFundFeeStr + ")")
    cur.execute("CREATE TABLE hf_servpro_details(" + theFundSPsStr + ")")

for row in range(4,int(ws.get_highest_row())):
    if ws.cell(column = 1, row=row).value is None:
        continue
    theValueList = []
    for col in range(1, ws.get_highest_column() + 1):
        theCellValue = ws.cell(column = col, row=row).value     
        if theCellValue is not None:
            if theCellValue == 'Yes':   
                theValueList.append(str(1))
            elif theCellValue == 'No':  
                theValueList.append(str(0))
            elif type(theCellValue) is UnicodeType: 
                theValueList.append(theCellValue.encode('ascii', 'ignore'))
            else:
                theValueList.append(theCellValue)
        else:
            theValueList.append(0)

    with con:
        cur = con.cursor()

        cur.execute(theDeetsString, tuple([theValueList[i - 1] for i in deetsCols]))
        cur.execute(theMgrDeetsString, tuple([theValueList[i - 1] for i in mgrCols]))
        cur.execute(theFundStratString, tuple([theValueList[i - 1] for i in stratCols]))
        cur.execute(theFundIDsString, tuple([theValueList[i - 1] for i in idCols]))
        cur.execute(theFundFeeString, tuple([theValueList[i - 1] for i in feeCols]))
        cur.execute(theFundSPsString, tuple([theValueList[i - 1] for i in servproCols]))

    #ver = cur.fetchone()
    #print "Database version : %s " % ver
    print 'Completed Row #' + str(row) + ' at ' + datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H:%M:%S')


if con:    
    con.close()

So this deletes the old tables (the ones i just created in the last attempt), creates new ones, and then cycles through every row of the excel file, creating 6 queries and putting the data in the 6 different tables. This starts out pretty quick, doing all 6 queries for each row in under a second. Within a few mintues, though, it slows down DRASTICALLY, and within 30 minutes, it gets up to taking about 5-10 minutes to run though each iteration of the loop (1 row in excel , 6 queries).

After playing with it, I am pretty positive the problem is NOT with openpyxl, but with the queries going to MySQL. It is not any particular table or query, each query just takes several seconds to complete. Is something blatantly wrong here?

As I understand it, MySQL does a bunch of maintenance in the background as all this is going on, and that could be what is slowing everything down. I am thinking it is trying to optimize the tables while I am still trying to build them. Is there some way to set MySQL into a 'build mode', so that it doesn't try to do that? Is there a way to just build one massive query with all of these rows? Or queries which do a few hundred rows at once? (The excel file has about 20K rows of data I need to get through.)

Again, these are the thoughts of someone who has never worked with DBs before, but if I am dumb, please feel free to tell me.

Best Answer

The slowness you're experiencing could be because of several reasons, one or more.

It is important to note that 20K rows is relatively small number of rows.

You may want to check for the following:

  • What is the structure of the tables, engine, keys (indexes), and primary keys.
  • Output the query after certain number of iterations to make sure your query is not getting cumulatively huge.
  • The slow query log may have some hints.

Based on the reason that you "may" find, a solution would be proposed, such as:

  • Disable (or drop) the keys and re-enable them after the import is finished.
  • Check that you have a proper primary key for each table, specially if the engine you are using is InnoDB
  • Use bulk inserts.
  • Use load data

HTH