To enable caching for a session on an Azure Synapse Analytics SQL pool, you would execute the following command. Caching is OFF by default. SET RESULT_SET_CACHING ON The first time a query is executed, the results are stored in cache. The next time the same query is run, instead of parsing through all the data […]
Querying Data – CREATE DATABASE dbName; GO
Data is not very useful without some way to look at it, search through it, and manipulate it—in other words, querying. You have seen many examples of managing and manipulating data from both structured and semi‐structured data sources. In this section, you’ll learn many ways to analyze the data in your data lake, data warehouse, […]
DataFrame – CREATE DATABASE dbName; GO
Up to this point you have seen examples that created a DataFrame, typically identified as df from a spark.read.* method: df = spark.read.csv(‘/tmp/output/brainjammer/reading.csv’) Instead of passing the data to load into a DataFrame as a path via the read.* method, you could load the data into an object, named data, for example: data =’abfss://<uid>@<accountName>.dfs.core.windows.net/reading.csv’ Once […]
GROUPBY() – CREATE DATABASE dbName; GO
This method provides the ability to run aggregation, which is the gathering, summary, and presentation of data in an easily consumable format. The groupBy() method provides several aggregate functions; here are the most common: avg() Returns the average of grouped columnscount() Returns the number of rows in that identified groupmax() Returns the largest value in […]
Data Programming and Querying for Data Engineers – CREATE DATABASE dbName; GO
To perform the duties of an Azure data engineer, you will need to write some code. Perhaps you will not need to have a great understanding of encapsulation, asynchronous patterns, or parallel LINQ queries, but some coding skill is necessary. Up to this point you have been exposed primarily to SQL syntax and PySpark, which […]
Feature Availability– CREATE DATABASE dbName; GO
Hadoop external tables, created using the previous SQL syntax, are only available when using dedicated SQL pools and support CSV, parquet, and ORC file types. Notice in the following SQL syntax that there is no TYPE argument. The result of not identifying a TYPE is supported only on serverless SQL pools, with CSV and Parquet […]
Data Sources – CREATE DATABASE dbName; GO
There are many locations where you can retrieve data. In this section you will see how to read and write JSON, CSV, and parquet files using PySpark. You have already been introduced to a DataFrame in some capacity. Reading and writing data can happen totally within the context of a file, or the data can […]
HASH – CREATE DATABASE dbName; GO
This distribution model uses a function to make the distribution, as shown in Figure 2.10. For large table sizes, this distribution model delivers the highest query performance. Consider the following snippet, which can be added to the script that creates the READING table: DISTRIBUTION = HASH([ELECTRODE_ID]) This results in the data being deterministically distributed across […]
Data Concepts– CREATE DATABASE dbName; GO
There are many concepts you must be aware, comfortable, and competent with to manage data efficiently. This section covers many data concepts that will not only help you pass the Data Engineering on Microsoft Azure exam, but also help you do the job in the real world. Keep in mind that when discussing relational structure […]
Create an Azure Cosmos DB– CREATE DATABASE dbName; GO
FIGURE 2.6 Azure Cosmos DB APIs FIGURE 2.7 Azure Cosmos Data Explorer FIGURE 2.8 Azure Cosmos Data Explorer SQL query The first query returns the scenario from all the files in that container. The second query returns the first reading for a specific scenario.