Postgresql – Parallel PostgreSQL Queries with R

parallelismpostgresqlr

I'm running a large query that for various reasons, I've broken into a series of smaller queries. The example below is just to show how the query is broken up by id; the actual query is much more complex:

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

etc...

I've written a function in R that generates the queries sequentially and sends them to the database using the R library RPostgreSQL:

library(RPostgreSQL)

num <- seq(0, 100, 10)

 query.func <- function(num){

      con <- dbConnect(PostgreSQL(), dbname = "name", post = 5432, user = "user_name", password = "password")

      num2 <- num + 10

       q1 <- paste('UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >=', num, 'AND id <', num2, sep = "")

         dbSendQuery(con, q1)

           dbDisconnect(con)
}  

lapply(num, query.func)

Rather than having this long sequence of queries run one after another, I wanted to try and take advantage of my server's multiple CPU's and run them in parallel. Using the R library parallel I've created a cluster and sent multiple queries to the cluster simultaneously:

no_cores <- detectCores() - 1

 cl <- makeCluster(no_cores)

  clusterExport(cl, "query.func")
  clusterExport(cl, "num")

  clusterEvalQ(cl, library(RPostgreSQL))

    parLapply(cl, num, query.func)

stopCluster()

Where query.func is defined as above.

When I run this R script at the terminal I receive no errors and if I check pg_stat_activity I see a list of active queries, each incrementally operating on a separate block of data. However, when I use top to check the resource management on my system, I still only see one postgres process. When I look at the CPU usage, I only see one CPU active at a time.

Are these queries really running simultaneously on separate CPU's? My understanding was that, prior for 9.6, a single postgres query could not be split across multiple cores but each connection could utilize a separate core (related question). Does the process I've outlined above open multiple connections and send a query to the database via each connection?

PostgreSQL 9.3/ Ubuntu 14.04 / R 3.3.2

Best Answer

UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0  AND id < 10;
 UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
  UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20  AND id < 30;

That's insane especially if the query is not CPU bound. Just run one query with one update.

UPDATE targate_table set col1_target = col1_source
FROM source_table
WHERE (id BETWEEN 0 AND 10)
  OR (id BETWEEN 10 AND 20)
  OR (id BETWEEN 20 AND 30);

PostgreSQL doesn't lock the table for row updates. It only locks the rows. There is no reason to break this up.

Best Answer

Related Solutions

Using Multiple Cores for Single MySQL Queries on Debian

UPDATE 2012-03-19 08:30 EDT

Sql-server – Parallel Plan Selection

Related Question