I've only been experimenting with SQL Server 2016 recently. So, please correct me if my assumptions are incorrect:
From some research on SQL Server R Services, I see that RxHDFSConnect and RxHDFSFileSystem functions help load data from Hadoop directly into a SQL Server 2016 database.
- Can these functions still work without a Polybase connection to Hadoop?
- If yes, why use the Polybase connection at all?
Best Answer
https://azure.microsoft.com/en-gb/documentation/articles/sql-data-warehouse-overview-load/
I do not think the same recommendation exists for SQL Server 2016, but it still makes sense. The two options are slightly different in that Polybase allows you to access Hadoop or Azure Blob storage directly and has been optimised for scale, eg PolyBase scale-out groups. You write ordinary T-SQL to access these external tables which is one of the main advantages - no Map Reduce jobs for you! Using R in SQL Server 2016, you would have to use stored procedure
sp_execute_external_script
to call these libraries, create the connection, and import the data into a dataframe. You would then have to expose the dataframe back to SQL Server as an output resultset, then write it to a table, something like ...Just in terms of number of steps it feels more complicated. I haven't done any scale testing on this but instinct tells me this is not meant for massive volumes. Interestingly I could only find this page on importing data with RevoScaelR in googlecache.
What kind of use cases are you looking out, just out of interest?