Menu

Data Programming and Querying for Data Engineers – CREATE DATABASE dbName; GO

To perform the duties of an Azure data engineer, you will need to write some code. Perhaps you will not need to have a great understanding of encapsulation, asynchronous patterns, or parallel LINQ queries, but some coding skill is necessary. Up to this point you have been exposed primarily to SQL syntax and PySpark, which targets the Python programming language. Going forward you might see a bit of C# code, but SQL and PySpark will be the primary syntax used in examples. There are, however, a few examples of data manipulation code in C# in the /Chapter02/Source Code directory here: https://github.com/benperk/ADE. Take a look at those code examples if you have not already done so.

Please note that there are many books that focus specifically on programming languages like Python, R, Scala, C#, and the SQL syntax, and there are also books dedicated specifically to database structures and data storage theory and concepts. This book helps improve your knowledge and experience in all those areas; however, the focus is to make sure you get the knowledge required to become a great Azure data engineer, which then will lead to becoming accredited by passing the DP‐203 exam.

Data Programming

The coding you will do in the context of the DP‐203 exam will take place primarily within an Azure Synapse Analytics Notebook. I haven’t covered notebooks in detail yet, but I will in the coming chapters. Other places where you might find yourself coding is in Azure Databricks, Azure Data Factory, or Azure HDInsight. What you learn in this section relating to PySpark and SparkSQL will work across all those products. As you’ll see later, C# (dotnet) will work in each of these scenarios in some capacity as well.

PySpark/Spark

You might have noticed the usage of the magic command %%pyspark and spark when loading a DataFrame or manipulating data. Table 2.6 compares PySpark and Spark, which can help clarify your understanding.

TABLE 2.6PySpark vs. Spark

PySparkSpark
An API that allows Python to work with SparkA computational platform that works with Big Data
Requires Python, Big Data, and Spark knowledgeRequires Scala and database knowledge
Uses Py4j library written in PythonWritten in Scala

For more complete details, visit these sites:

The remainder of this section will focus on PySpark example syntax that you might see while taking the DP‐203 exam. It would, however, be prudent to have a look at the official documentation sites to get a complete view of the syntax, language, and capabilities. You might also consider a book on those specific languages.

Leave a Reply

Your email address will not be published. Required fields are marked *