SQL Server 2016 R Services – How to Connect to Hadoop Data

hadooprsql-server-2016

I've only been experimenting with SQL Server 2016 recently. So, please correct me if my assumptions are incorrect:

From some research on SQL Server R Services, I see that RxHDFSConnect and RxHDFSFileSystem functions help load data from Hadoop directly into a SQL Server 2016 database.

Can these functions still work without a Polybase connection to Hadoop?
If yes, why use the Polybase connection at all?

Best Answer

The fastest way to import data into SQL Data Warehouse is to use PolyBase...

https://azure.microsoft.com/en-gb/documentation/articles/sql-data-warehouse-overview-load/

I do not think the same recommendation exists for SQL Server 2016, but it still makes sense. The two options are slightly different in that Polybase allows you to access Hadoop or Azure Blob storage directly and has been optimised for scale, eg PolyBase scale-out groups. You write ordinary T-SQL to access these external tables which is one of the main advantages - no Map Reduce jobs for you! Using R in SQL Server 2016, you would have to use stored procedure sp_execute_external_script to call these libraries, create the connection, and import the data into a dataframe. You would then have to expose the dataframe back to SQL Server as an output resultset, then write it to a table, something like ...

--NB This is pseduo-code / will not execute!

INSERT INTO dbo.someTable ( ... 
EXEC dbo.sp_execute_external_script
    @language = N'R', 
    @script = N'
library(RxHDFSConnect)
OutputDataSet <- some code that allows RxHDFSConnect to get some data ...'
WITH RESULT SETS ( ( col1 INT NOT NULL, col2 FLOAT NOT NULL ... etc ) );

Just in terms of number of steps it feels more complicated. I haven't done any scale testing on this but instinct tells me this is not meant for massive volumes. Interestingly I could only find this page on importing data with RevoScaelR in googlecache.

What kind of use cases are you looking out, just out of interest?

Related Solutions

Using SQL Server 2016 R Services vs. Standalone Microsoft R Server

Possible a slightly subjective question but sometimes it's helpful to think of these things in scenarios, for example:

A predominantly SQL Server project, run by DBAs/devs (who may or may not know some R). A data scientist or R professional writes some R for them or they do it themselves and test on their local machines, eg using R Studio or the Microsoft R IDE. When they are happy with the R, they plug it into their data workflow using sp_execute_external_script. They can easily incorporate this into other stored procedures, SSIS etc. This might be a good candidate for SQL Server 2016 with R Services.
A secure project, the data is in your secured SQL Server and you want to keep the data close to the analytics and not use the cloud. This might be a good candidate for an on-premise installation of SQL Server 2016 with R Services.
A predominantly data science project which requires data but does not require SQL Server access. This might be a good candidate for Microsoft R Server only.
High performance is required for a big data project. You do not want to share the box with the traditionally memory-hungry SQL Server. This might be a good candidate for Microsoft R Server only.

Just a few examples there. What examples can you think of?

HTH

Connecting Microsoft R Server to SQL Server with and without R Services

I can answer some of these questions hopefully:

The standalone machine has the benefit of most of the machine resources at its disposal. However as far as benefits "while connecting" go, there are none. The point of SQL Server with R services is to keep the analytics close to the data. This has little or no consequence for external clients connecting to the data, as I see it at least

you can invoke SQL Server stored procedures from R using the RODBC library, for example this worked for me:

install.packages("RODBC")
library(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL Server};server=.\\sql2014;database=tempdb;trusted_connection=true')
data <-sqlQuery(dbhandle, "EXEC sp_who2")
data
odbcCloseAll()

Results:

push compute is a feature of the ScaleR library, eg function RxInSqlServer. See here for more details.
regarding "anything special that SQL Server with R Services offers for a standalone Microsoft R Server...", you could conceivably call the R stored procedure sp_execute_external_script from R using RODBC, but this seems a bit pointless. A dedicated SQL Server box (ie without R Services enabled) does not have to share its resources, and this kind of answers your last question I think; the point of having a dedicated box of either kind (MRS only or SQL Server only) would be so that neither has to share its resources. The point of having a SQL Server with R Services enabled would be so the box could perform dual roles, resources permitting, plus keeping the analytics close to the data. I hope that makes sense.

Here's a nice article on persisting machine learning models to a table using SQL Server with R Services. How would you do that in standalone R?

Maybe someone else can chime in because honestly this seems very similar to your other questions. You might guess not that many people are doing this so one of the best people to do this is you; try these things out, report back : )

Best Answer

Related Solutions

Using SQL Server 2016 R Services vs. Standalone Microsoft R Server

Connecting Microsoft R Server to SQL Server with and without R Services

Related Question