Sql-server – How does PolyBase collect statistics from external tables

hadooppolybasesql serversql-server-2016

Books online has this to say about query optimisation in PolyBase:

The query optimizer makes a cost-based decision to push computation to Hadoop when doing so will improve query performance. It uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.

My question – how and when does the optimizer collect these statistics?

Inside the SQL Server relational engine statistics can be created implicitly or explicitly, they age as writes occur, and can be refreshed explicitly or during table maintenance. Are there similar strategies employed in PolyBase?

Given the data may be loaded into Hadoop/ Azure blob without SQL Server seeing it on the way in, and data volumes will be large (most likely) run-time ad hoc statistic creation through sampling or similar seems an unlikely strategy to me.

Best Answer

Creation and maintenance of PolyBase statistics is in the hands of the DBA.

The "getting started" guide provides several steps to establish PolyBase, the last of which is CREATE STATISTICS referencing the external table.

The documentation on CREATE STATISTICS states:

Statistics for external tables

When creating external table statistics, SQL Server imports the external table into a temporary SQL Server table, and then creates the statistics. For samples statistics, only the sampled rows are imported. If you have a large external table, it will be much faster to use the default sampling instead of the full scan option.

and further down

Limitations and Restrictions

Updating statistics is not supported on external tables. To update statistics on an external table, drop and re-create the statistics.

From this I conclude it is up to the system administrator to

  • create appropriate statistics for the intended workload
  • monitor for stale statistics
  • implement maintenance procedures for external statistics
  • ensuring there is sufficient space for the "temporary table" (TempDB?)

One imagines that, as the product matures and gains traction, more features from the current data engine (auto create & auto update statistics) will be ported to PolyBase. On the flip side, do you really want to wait for results while a 5PB store is sampled?