SQL Server 2016 R Services – How to Connect to Hadoop Data

hadooprsql-server-2016

I've only been experimenting with SQL Server 2016 recently. So, please correct me if my assumptions are incorrect:

From some research on SQL Server R Services, I see that RxHDFSConnect and RxHDFSFileSystem functions help load data from Hadoop directly into a SQL Server 2016 database.

  • Can these functions still work without a Polybase connection to Hadoop?
  • If yes, why use the Polybase connection at all?

Best Answer

The fastest way to import data into SQL Data Warehouse is to use PolyBase...

https://azure.microsoft.com/en-gb/documentation/articles/sql-data-warehouse-overview-load/

I do not think the same recommendation exists for SQL Server 2016, but it still makes sense. The two options are slightly different in that Polybase allows you to access Hadoop or Azure Blob storage directly and has been optimised for scale, eg PolyBase scale-out groups. You write ordinary T-SQL to access these external tables which is one of the main advantages - no Map Reduce jobs for you! Using R in SQL Server 2016, you would have to use stored procedure sp_execute_external_script to call these libraries, create the connection, and import the data into a dataframe. You would then have to expose the dataframe back to SQL Server as an output resultset, then write it to a table, something like ...

--NB This is pseduo-code / will not execute!

INSERT INTO dbo.someTable ( ... 
EXEC dbo.sp_execute_external_script
    @language = N'R', 
    @script = N'
library(RxHDFSConnect)
OutputDataSet <- some code that allows RxHDFSConnect to get some data ...'
WITH RESULT SETS ( ( col1 INT NOT NULL, col2 FLOAT NOT NULL ... etc ) );

Just in terms of number of steps it feels more complicated. I haven't done any scale testing on this but instinct tells me this is not meant for massive volumes. Interestingly I could only find this page on importing data with RevoScaelR in googlecache.

What kind of use cases are you looking out, just out of interest?