Sql-server – How does PolyBase collect statistics from external tables

hadooppolybasesql serversql-server-2016

Books online has this to say about query optimisation in PolyBase:

The query optimizer makes a cost-based decision to push computation to Hadoop when doing so will improve query performance. It uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.

My question – how and when does the optimizer collect these statistics?

Inside the SQL Server relational engine statistics can be created implicitly or explicitly, they age as writes occur, and can be refreshed explicitly or during table maintenance. Are there similar strategies employed in PolyBase?

Given the data may be loaded into Hadoop/ Azure blob without SQL Server seeing it on the way in, and data volumes will be large (most likely) run-time ad hoc statistic creation through sampling or similar seems an unlikely strategy to me.

Best Answer

Creation and maintenance of PolyBase statistics is in the hands of the DBA.

The "getting started" guide provides several steps to establish PolyBase, the last of which is CREATE STATISTICS referencing the external table.

The documentation on CREATE STATISTICS states:

Statistics for external tables

When creating external table statistics, SQL Server imports the external table into a temporary SQL Server table, and then creates the statistics. For samples statistics, only the sampled rows are imported. If you have a large external table, it will be much faster to use the default sampling instead of the full scan option.

and further down

Limitations and Restrictions

Updating statistics is not supported on external tables. To update statistics on an external table, drop and re-create the statistics.

From this I conclude it is up to the system administrator to

create appropriate statistics for the intended workload
monitor for stale statistics
implement maintenance procedures for external statistics
ensuring there is sufficient space for the "temporary table" (TempDB?)

One imagines that, as the product matures and gains traction, more features from the current data engine (auto create & auto update statistics) will be ported to PolyBase. On the flip side, do you really want to wait for results while a 5PB store is sampled?

Related Solutions

How often does the Impala StateStore refresh

It is my understanding that impala does not automatically refresh the metadata. If you create new tables etc through impala (1.2 and higher) it will update the other impala nodes, however if you create new tables etc in hive you will have to manually refresh the impala metadata.

You can manually refresh the metadata on either a table basis with 'REFRESH table_name' or you can do this for all tables with 'INVALIDATE METADATA'. Also, note that if you update the files on HDFS you will have to refresh the table to pickup the new data as well.

Sql-server – How to update statistics for a database’s system tables

You can "see" the system tables without using DAC connection:

SELECT *
FROM sys.objects o
WHERE o.type = 'S'
ORDER BY o.name

From there, you could build a TSQL string of commands and execute it. (I hardcoded sys for the schema. Presumably this will never change.):

DECLARE @TSql NVARCHAR(MAX) = ''
SELECT @TSql = @TSql +  'UPDATE STATISTICS sys.' + o.name + ' WITH FULLSCAN;' + CHAR(13) + CHAR(10)
FROM sys.objects o
WHERE o.type = 'S'
ORDER BY o.name

--Verify/test commands.
PRINT @TSql

--Uncomment and re-run when ready to proceed.
--EXEC sp_executesql @TSql

This could easily be put into a job step for a SQL Agent job. I tested this on SQL Server 2012 w/ SP3. I've also run similar code in SQL 2008 R2 and SQL 2014. I don't know for sure if it works in SQL 2016--please let us know if it does not.

Other Notes
The UPDATE STATISTICS commands worked when I was logged in as a [sysadmin]. It also worked if the login was not a member of [sysadmin], but was the database owner. Membership in [db_owner] alone was not sufficient. (Again, this was on SQL 2012 w/ SP3. YMMV.)

Best Answer

Related Solutions

How often does the Impala StateStore refresh

Sql-server – How to update statistics for a database’s system tables

Related Question