Books online has this to say about query optimisation in PolyBase:
The query optimizer makes a cost-based decision to push computation to Hadoop when doing so will improve query performance. It uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.
My question – how and when does the optimizer collect these statistics?
Inside the SQL Server relational engine statistics can be created implicitly or explicitly, they age as writes occur, and can be refreshed explicitly or during table maintenance. Are there similar strategies employed in PolyBase?
Given the data may be loaded into Hadoop/ Azure blob without SQL Server seeing it on the way in, and data volumes will be large (most likely) run-time ad hoc statistic creation through sampling or similar seems an unlikely strategy to me.
Best Answer
Creation and maintenance of PolyBase statistics is in the hands of the DBA.
The "getting started" guide provides several steps to establish PolyBase, the last of which is CREATE STATISTICS referencing the external table.
The documentation on CREATE STATISTICS states:
and further down
From this I conclude it is up to the system administrator to
One imagines that, as the product matures and gains traction, more features from the current data engine (auto create & auto update statistics) will be ported to PolyBase. On the flip side, do you really want to wait for results while a 5PB store is sampled?