I'm running a large query that for various reasons, I've broken into a series of smaller queries. The example below is just to show how the query is broken up by id; the actual query is much more complex:
UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 0 AND id < 10;
UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 10 AND id < 20;;
UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >= 20 AND id < 30;
etc...
I've written a function in R that generates the queries sequentially and sends them to the database using the R library RPostgreSQL
:
library(RPostgreSQL)
num <- seq(0, 100, 10)
query.func <- function(num){
con <- dbConnect(PostgreSQL(), dbname = "name", post = 5432, user = "user_name", password = "password")
num2 <- num + 10
q1 <- paste('UPDATE target_table SET col1_target = col1_source FROM source_table WHERE id >=', num, 'AND id <', num2, sep = "")
dbSendQuery(con, q1)
dbDisconnect(con)
}
lapply(num, query.func)
Rather than having this long sequence of queries run one after another, I wanted to try and take advantage of my server's multiple CPU's and run them in parallel. Using the R library parallel
I've created a cluster and sent multiple queries to the cluster simultaneously:
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
clusterExport(cl, "query.func")
clusterExport(cl, "num")
clusterEvalQ(cl, library(RPostgreSQL))
parLapply(cl, num, query.func)
stopCluster()
Where query.func
is defined as above.
When I run this R script at the terminal I receive no errors and if I check pg_stat_activity
I see a list of active queries, each incrementally operating on a separate block of data. However, when I use top to check the resource management on my system, I still only see one postgres process. When I look at the CPU usage, I only see one CPU active at a time.
Are these queries really running simultaneously on separate CPU's? My understanding was that, prior for 9.6, a single postgres query could not be split across multiple cores but each connection could utilize a separate core (related question). Does the process I've outlined above open multiple connections and send a query to the database via each connection?
PostgreSQL 9.3/ Ubuntu 14.04 / R 3.3.2
Best Answer
That's insane especially if the query is not CPU bound. Just run one query with one update.
PostgreSQL doesn't lock the table for row updates. It only locks the rows. There is no reason to break this up.