There are many concepts you must be aware, comfortable, and competent with to manage data efficiently. This section covers many data concepts that will not only help you pass the Data Engineering on Microsoft Azure exam, but also help you do the job in the real world. Keep in mind that when discussing relational structure or tables, the context is generally focused on Azure SQL or Azure Synapse Analytics SQL pools. In contrast, DataFrames lean toward Azure Synapse Analytics Spark pools. Much of the following is about Azure Synapse Analytics SQL pools. When this is not the case, the context will be specifically called out.
Sharding
This is a technique used primarily to store large amounts of data that is too big to fit on a single database. Sharding is also useful for separating data within a database into faster, smaller, easier‐to‐manage shards. Shards distribute data across different machines, aka nodes. Figure 2.9 illustrates how data might be sharded across numerous database instances.
FIGURE 2.9 Azure Synapse Analytics Sharding example
There are two features in Figure 2.9 that need some explanation, beginning with the Data Movement Service (DMS). DMS is responsible for moving data across the nodes when necessary for executing queries and returning the results. This is a platform‐level feature and is configured by the way your table distributions are implemented. Table distributions will be discussed in a later section. The other feature is the massively parallel processing (MPP) Engine, which is shown as a layer between the control node and the compute nodes. MPP manages and coordinates the processing of queries across all the SQL dedicated compute nodes in the cluster. DMS moves the data while MPP manages the state and optimizes queries that are running in parallel—in other words, at the same time. Numerous clients or users can connect to the control node and execute queries at the same time. Running the queries one after another (i.e., sequentially) is not optimal. MPP will manage the query executions and run them as quickly and efficiently as possible.