Azure Synapse — SQL Pool
This article highlights the Azure Synapse SQL Pool (formerly SQL Data Warehouse) features, it’s design, scalability and costing.
What is Azure Synapse?
Azure Synapse is an end-to-end analytics solution on Azure, which includes Data Lake, Data Warehouse, ETL/ELT, ML capabilities and visualization. It has primarily four components:
- Synapse SQL: Complete T-SQL based analytics which has two SQL engines - provisioned, and on-demand to query external data.
- SQL pool (pay per DWU provisioned)
- SQL on-demand (pay per TB processed)
2. Spark: Deeply integrated Apache Spark
3. Synapse Pipelines: Hybrid data integration — ADF
4. Studio: Unified user experience.
For more information, Please refer following link https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse#architecture which has the details of the architecture and the data flow.
What is Synapse SQL pool?
Synapse SQL Pool (formerly SQL Data Warehouse) is an MPP Enterprise Data Warehouse on Azure.
- Synapse SQL pool stores data in relational tables with columnar storage.
- Synapse SQL pool has compute separate from storage which enables to scale.
- Synapse Service Level Objective (Compute) is the scalability setting which determines cost and performance of the data warehouse (SQL Pool). It ranges from DW100c to DW30000c, higher configuration has more nodes. The compute is discussed in details in further sections.
- Workload management is additional feature around managing resource utilization, concurrency and query priority.
- Synapse compute can be scale up down via SQL, PowerShell and Portal. The SQL ability for scaling is unique, and can be achieved via an “ALTER” command.
- SQL to verify which compute is in use for a SQL pool data warehouses:
use master;
SELECT
db.name [Database]
, ds.edition [Edition]
, ds.service_objective [Service Objective]
FROM
sys.database_service_objectives ds
JOIN
sys.databases db ON ds.database_id = db.database_id;
Modify the compute as per scalability requirement using the below SQL query:
ALTER DATABASE DWHDB MODIFY (SERVICE_OBJECTIVE = 'DW200c');
Please note scaling or pausing a SQL Pool impacts running queries. For more information, please refer following link “https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview#drain-transactions-before-pausing-or-scaling”.
The Azure Synapse SQL Pool cost is based on two factors.
- Compute Cost: The compute of SQL pool is determined by the Data Warehousing units (DWU). For more information refer following link “https://azure.microsoft.com/en-ca/pricing/details/synapse-analytics/”.
- Storage Cost: This cost includes data storage and snapshots (7 days of Snapshot). The storage cost is approximately around $0.17/1 TB/hour for “East US 2” region. This cost is applicable irrespective of whether the SQL pool is active or paused. For more accurate information, refer the earlier pricing link.
What are Data Warehouse Units?
A Synapse SQL pool represents a collection of analytic resources that are being provisioned. Analytic resources are defined as a combination of CPU, memory, and IO.
These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs). A DWU represents an abstract, normalized measure of compute resources and performance.
For higher performance, you can increase the number of data warehouse units. For less performance, reduce data warehouse units. Storage and compute costs are billed separately, so changing data warehouse units does not affect storage costs.
There is an additional concept of Performance Tier which can be either Gen1 Data Warehouse Units (DWUs) or Gen2 compute Data Warehouse units (cDWUs). The Gen2 is optimized and comes up with a local cache on compute node. There is a difference in billing for Gen1 and Gen2. The latest SQL pool version is available on Gen2 and hence will consider cDWUs for costing.
There can be only one cDWU unit associated with a SQL Pool, and it can be scaled up or down.
For cost optimization, compute can either be
- paused: No compute charges are applicable in this case.
- Scale down: Minimal charges are applicable as per cDWU configured but the database is online.
If we pause compute the database is not accessible unless we resume it, which can be possible through portal. In case of scaling up/down the users are sessions are impacted for the duration of this process.
Distributions
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data.
The Compute node sizing is based on combination of Node and distribution, lower the cDWU configuration lesser the node and more distribution, higher the configuration more number of Nodes and lesser the distribution. Overall for each cDWU combination of Nodes and distribution will always be 60.
When Synapse SQL runs a query, the work is divided into 60 smaller queries that run in parallel.
Each of the 60 smaller queries runs on one of the data distributions.
The below graph shows the cDWU configuration, for DW100c there is one Node (indicated in blue) and has 60 distributions (indicated in orange), if we choose the high end cDWU then it has 60 Nodes with each Node having one distribution.
A table in Synapse can have 60 distribution, and the compute scale (cDWU) also has 60 distribution so this architecture does not need to distribute data for any compute scale up/scale down setting. It is more like mounting the storage distribution on these Compute (Service Levels).
Table can have following distributions methods:
- Round Robin
- Hash
- Replicated
For further information, please refer following link “https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute”.
Choosing of right scale needs understanding of data and query patterns, it should be started with a small scale and can be gradually increased. The other factor is around dynamically scale up as per the business schedule/workloads.
Final thoughts:
Azure Synapse has all the required components for data analytics and looks to be a cost optimal solution. It is still evolving in terms of it’s on-demand capability, and I am looking forward for further enhancement on scaling feature without impacting the current queries/transactions, ideally it should wait for current queries. In addition, it would be nice to have a feature around linking multiple Compute (Service Level) to the same storage without changing the end point.
Please do share your feedback on this article.