Introduced in Chapter 1, “Gaining the Azure Data Engineer Associate Certification,” there are two options when it comes to Azure Synapse Analytics SQL pools: serverless and dedicated. When you run in serverless, there is no need to be overly concerned about database table structures because you are querying data that already exists in other sources. Those sources should already be optimized for performance. If that is not the case, then you need to be concerned about tables. However, when running on a dedicated SQL pool architecture, you do need to understand a bit more about the table and data structure. You need to know this for the exam as well.
Azure Synapse Analytics dedicated SQL pools were previously named Azure SQL Data Warehouses.
Distributed tables is a feature found in Azure Synapse Analytics dedicated SQL pools. When running a dedicated SQL pool, your data is partitioned and distributed to 60 database nodes. Having your data rightly shared, partitioned, and indexed helps get your data efficiently placed across these nodes. Data is spread out across these nodes, with the help of the DMS to provide optimal MPP opportunities. Note that a COLUMNSTORE INDEX is not optimal for tables with fewer than 60 million rows. This is because the data is distributed to 60 database nodes and a performance enhancement is not gained unless there are over 1 million rows per distribution. There are three methods in which dedicated SQL pools distribute data:
- Round‐robin (default)
- Hash
- Replicated
ROUND‐ROBIN
This term is common and is used in many scenarios. It has a lot do to with load balancing. Imagine that you have 60 nodes that are running behind a controller. Requests to execute a query are sent to the controller, which then must choose a node to forward that request to. If many requests are incoming at the same time or in short sequence, then it makes a lot of sense not to send all the requests to only one of the 60 nodes. That would likely overload that node, leaving all the others idle. Round‐robin is an algorithm that keeps track of which nodes have been forwarded a request to execute and attempts to distribute the incoming requests equally and sequentially across all the nodes in the pool. Round‐robin is the default distribution method, so if you leave out the DISTRIBUTION = ROUND_ROBIN argument from your table definition, that is what distribution you get.
Round‐robin delivers the fastest performance when it comes to the loading data onto a node; what data is loaded on what node is random. The data is loaded evenly across the nodes. The fact that there is no optimization regarding the location of data on the nodes means that query performance might be slower when compared to a hash table distribution.