I'm trying to run a data science experiment as explained in this tutorial.
The tutorial has 5 lessons, and each lesson has a few sub-sections in it.
For this tutorial, I am using
- a SQL Server 2016 RC3 Virtual Machine on Azure, with R Services enabled.
- RRE for Windows 8.0.0 as a data science client / R Client (for remote connection to SQL Server)
- A new SQL Login created with read, write and ddl access to the database – used to connect to the SQL Server via the R client.
I've successfully completed Lesson 1, i.e. Create SQL Server data objects from my R client, Query and Modify SQL Server data, and Define/set compute contexts.
I'm stuck with an error in the beginning of Lesson 2 of the tutorial.
Once I change the compute context from local to sql server, a simple summarize function (rxsummary) is throwing an error.
The error looks like this:
C:\Users\...\Project0\DeepDive Experiment.R(109): Error in try({ : ODBC statement error: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure 'master..xp_ScaleR_init_job'. Error in rxInDbJobIdParam(schedulerJobInstance, FALSE) : hpcServerJob object has an invalid id. Ensure it was returned from a prior rxStartClusterJob() call Error in rxStartClusterJob(hpcServerJob, timesIsValidated = TRUE, continueOnFailure = FALSE) : Error in try({ : ODBC statement error: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not find stored procedure 'master..xp_ScaleR_init_job'. Error in rxInDbJobIdParam(schedulerJobInstance, FALSE) : hpcServerJob object has an invalid id. Ensure it was returned from a prior rxStartClusterJob() call
Any help on
- why this error is occurring?
- how to find/check stored procedures in the master database – how to check if xp_scaleR_init_job exists?
- how to add/create the stored procedure if it does not exist?
will be appreciated.
For easy access, here's the full commented script until I reach the error:
###########################################DATA SCIENCE DEEP DIVE TUTORIAL###############################################
##Create the SQL Server Data Objects##
#Provide your database connection string in an R variable.
#DDUser01 is a login created on the sql server instance for remote login.
#It has read, write and ddl access to the DeepDive database.
sqlConnString <- "Driver=SQL Server;Server=*ip address*; Database=DeepDive;Uid=DDUser01;Pwd=*******"
#Specify the name of the table you want to create, and save it in an R variable.
sqlFraudTable <- "ccFraudSmall"
#Chunking
sqlRowsPerRead = 5000
#Define a variable to store the new data source
sqlFraudDS <- RxSqlServerData(connectionString = sqlConnString,table = sqlFraudTable, rowsPerRead = sqlRowsPerRead)
#Create a new R variable, sqlScoreTable, to store the name of the table used for scoring.
sqlScoreTable <- "ccFraudScoreSmall"
#Define a second data source object
sqlScoreDS <- RxSqlServerData(connectionString = sqlConnString,table = sqlScoreTable, rowsPerRead = sqlRowsPerRead)
##Load Data into SQL Tables Using R##
#Create an R variable, and assign to the variable the file path for the CSV file.
ccFraudCsv <- file.path(rxGetOption("sampleDataDir"), "ccFraudSmall.csv")
#RxTextData function to specify the text data source.
inTextData <- RxTextData(file = ccFraudCsv, colClasses = c(
"custID" = "integer", "gender" = "integer", "state" = "integer",
"cardholder" = "integer", "balance" = "integer",
"numTrans" = "integer",
"numIntlTrans" = "integer", "creditLine" = "integer",
"fraudRisk" = "integer"))
#Call rxDataStep to insert the data into the SQL Server table
rxDataStep(inData = inTextData, outFile = sqlFraudDS, overwrite = TRUE)
#Variable for creating a path to the source file - score
ccScoreCsv <- file.path(rxGetOption("sampleDataDir"), "ccFraudScoreSmall.csv")
#RxTextData function to get the data and save it in the variable
inTextData <- RxTextData(file = ccScoreCsv, colClasses = c(
"custID" = "integer", "gender" = "integer", "state" = "integer",
"cardholder" = "integer", "balance" = "integer",
"numTrans" = "integer",
"numIntlTrans" = "integer", "creditLine" = "integer"))
#Call rxDataStep to overwrite the current table with the new schema and data.
rxDataStep(inData = inTextData, sqlScoreDS, overwrite = TRUE)
##Query the Data ##
#Use the function rxGetVarInfo and specify the data source you want to analyze
rxGetVarInfo(data = sqlFraudDS)
##Modify Metadata##
#Mapping of USA State abbreviations (categorical) to their integer identifiers
#Create an R variable that holds the vector of strings to add to it - different states of the USA.
stateAbb <- c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC",
"DE", "FL", "GA", "HI","IA", "ID", "IL", "IN", "KS", "KY", "LA",
"MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NB", "NC", "ND",
"NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI","SC",
"SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")
#Create a column information object that specifies the mapping of the existing integer values to the categorical levels
#This statement also creates factor variables for gender and cardholder.
ccColInfo <- list(
gender = list(
type = "factor",
levels = c("1", "2"),
newLevels = c("Male", "Female")),
cardholder = list(type = "factor",
levels = c( "1", "2"),
newLevels = c("Principal", "Secondary")),
state = list(type = "factor", levels = as.character(1:51), newLevels = stateAbb)
)
#Update the SQL Server data source that uses the updated data
sqlFraudDS <- RxSqlServerData(connectionString = sqlConnString,
table = sqlFraudTable, colInfo = ccColInfo,
rowsPerRead = sqlRowsPerRead)
#Query new information
rxGetVarInfo(data = sqlFraudDS)
##Create and Set a Compute Context##
#Specify the connection string for the instance where computations will take place.
sqlConnString <- "Driver=SQL Server;Server=*ip address*; Database=DeepDive;Uid=DDUser01;Pwd=*******"
#Specify the location of the shared directory (temp folder for workspace objects) and save it in a variable.
sqlShareDir <- paste("c:AllShare", Sys.getenv("USERNAME"), sep="")
#Create shared directory if it does not exist
if (!file.exists(sqlShareDir)) dir.create(sqlShareDir, recursive = TRUE)
#Specify how you want the output handled.
#Here, you are indicating that the R session on the workstation should always wait for R job results,
#but not return console output from remote computations.
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
#Define the compute context object
sqlCompute <- RxInSqlServer(
connectionString = sqlConnString,
shareDir = sqlShareDir,
wait = sqlWait,
consoleOutput = sqlConsoleOutput)
#ALTERNATIVE:Enable Tracing on the Compute Context
sqlComputeTrace <- RxInSqlServer(
connectionString = sqlConnString,
shareDir = sqlShareDir,
wait = sqlWait,
consoleOutput = sqlConsoleOutput,
traceEnabled = TRUE, traceLevel = 7)
#Change Compute Context to the Server
rxSetComputeContext(sqlCompute)
##Compute Summary Statistics##
#Compute summary statistics for several of the variables
##THIS IS WHERE I FACE THE ERROR##
sumOut <- rxSummary(formula = ~gender + balance + numTrans + numIntlTrans + creditLine, data = sqlFraudDS)
Best Answer
Apparently, SQL Server 2016 needs an R Server Client (RRE) that is atleast 8.0.3. This website talks about it. I also got the same answer from Microsoft Support.
The R server I have is RRE 8.0.0. That was probably the reason for the error that I received. When I installed the Microsoft R Client instead, the script worked (barring rxCube) ! I could push computations to SQL Server and finish the tutorial.