Postgresql – Parallel PostgreSQL Queries with R

parallelismpostgresqlr

I'm running a large query that for various reasons, I've broken into a series of smaller queries. The example below is just to show how the query is broken up by id; the actual query is much more complex:

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

etc...

I've written a function in R that generates the queries sequentially and sends them to the database using the R library RPostgreSQL:

library(RPostgreSQL)

num <- seq(0, 100, 10)

 query.func <- function(num){

      con <- dbConnect(PostgreSQL(), dbname = "name", post = 5432, user = "user_name", password = "password")

      num2 <- num + 10

       q1 <- paste('UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >=', num, 'AND id <', num2, sep = "")

         dbSendQuery(con, q1)

           dbDisconnect(con)
}  

lapply(num, query.func)

Rather than having this long sequence of queries run one after another, I wanted to try and take advantage of my server's multiple CPU's and run them in parallel. Using the R library parallel I've created a cluster and sent multiple queries to the cluster simultaneously:

no_cores <- detectCores() - 1

 cl <- makeCluster(no_cores)

  clusterExport(cl, "query.func")
  clusterExport(cl, "num")

  clusterEvalQ(cl, library(RPostgreSQL))

    parLapply(cl, num, query.func)

stopCluster()

Where query.func is defined as above.

When I run this R script at the terminal I receive no errors and if I check pg_stat_activity I see a list of active queries, each incrementally operating on a separate block of data. However, when I use top to check the resource management on my system, I still only see one postgres process. When I look at the CPU usage, I only see one CPU active at a time.

Are these queries really running simultaneously on separate CPU's? My understanding was that, prior for 9.6, a single postgres query could not be split across multiple cores but each connection could utilize a separate core (related question). Does the process I've outlined above open multiple connections and send a query to the database via each connection?

PostgreSQL 9.3/ Ubuntu 14.04 / R 3.3.2

Best Answer

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

That's insane especially if the query is not CPU bound. Just run one query with one update.

UPDATE targate_table set col1_target = col1_source
FROM source_table
WHERE (id BETWEEN 0 AND 10)
  OR (id BETWEEN 10 AND 20)
  OR (id BETWEEN 20 AND 30);

PostgreSQL doesn't lock the table for row updates. It only locks the rows. There is no reason to break this up.