Spark Streaming – CREATE DATABASE dbName; GO

Kadisha Cruickshank Updated on 08/03/202406/26/2024

The previous chapter introduced you to both Azure Stream Analytics/Event Hubs and Apache Spark/Apache Kafka. Those products are what you use to implement a data streaming solution, as illustrated in Figure 2.20. Notice the various kinds of data producers that can feed into Kafka. Any device that has permission and that can send correctly formatted data to the Kafka Topic works. Note that the Kafka distribution includes a producer shell, which you can use for testing. The Apache Spark is a consumer of the data being sent to Kafka. Apache Spark can then store the stream to files, store the data in memory, or make the data available to an application. Before you begin reading the stream from your Apache Spark instance, you need a subscription to a Kafka Topic. When you subscribe to a topic, when data arrives a notification is sent to you. The notification contains either the data or the metadata, which can then be used by the consumer to pull the data from Kafka.

FIGURE 2.20 Data streaming solution using Apache Kafka and Apache Spark

As the data enters your Apache Spark cluster, you can process and store the data. The data can be stored in many formats, such as JSON, TXT, or CSV files. It is also possible to store the data in memory or dump it out into a command console. To read a JSON‐formatted stream of data from a Kafka Topic you would use the spark.readStream() method:

val df = spark.readStream.format(“kafka”) \
              .option(“<serverName>”,”<IP:PORT>”) \
              .option(“subscribe”,”brainwave_topic”) \
              .option(“startingOffsets”,”latest”).load()

This code snippet first identifies that the stream is intended to come from Kafka. An option() method follows that contains the pointer to the Kafka server and IP address. The next option() method is where you notify Kafka which topic you want to subscribe to. The default value for startingOffsets is latest, which means that the readStream() method only gets data it has not already received. The alternative is earliest, which means that when that code snippet is executed, all the data is requested. Once the data is read, you can write the stream to any of the numerous datastores or to a console. The following code snippet illustrates how to write the stream to a console:

df.writeStream.format(“console”).outputMode(“append”).start() .awaitTermination()

You can use a wide range of formats; the most common are file types (json, parquet, csv, etc.), console, and memory. When the format() method is a file type, the only supported outputMode is append. Other formats support update or complete modes. The start() method begins the write process and will continue writing until the session is terminated. Here are a few additional examples of writing the stream of data:

df.writeStream.format(“json”).option(“path”, “/path/to/brainwaves”)
df.writeStream.format(“memory”).queryName(“Brainwaves”).start()

C# and .NET

It goes without saying that many if not most people who work with Big Data use programming languages like Python, R, and Scala, etc. It also goes without saying that when you work with Microsoft products, you will run into C# and .NET at some point. Having options is good for a company because they can choose the language based on the skills of their staff. It is also good for Microsoft because the door is no longer closed to people who know, use, and prefer open source technologies. In this section you will learn about a few C#‐compatible software development kits (SDKs) that exist for Azure Data Solution development.

The integrated development environment (IDE) that is commonly used for coding C# is Visual Studio. There is also a popular Microsoft IDE named Visual Studio Code, which is a lightweight but feature‐rich version of the full Visual Studio edition. The content that follows will target Visual Studio. Notice in Figure 2.21 that there are some data‐oriented workloads available with the IDE.

FIGURE 2.21 Visual Studio data workloads

When you install these workloads, templates for those kinds of projects are installed and can be helpful for getting started. They are helpful because when you create those projects, such as a SQL Server database project, you get a list of the types of items you can add into the project and build upon, such as External Table, External Data Source, Inline Functions, Tables, and Views. These templates and workloads exist so that you do not have to start from scratch, which is very beneficial, especially if you are not highly skilled in the area or have a tight deadline. An SDK is similar to templates in that you do not need to write all the code to perform common tasks, like making a connection to a database. Using an SDK, you can accomplish that in less than five lines of code. Imagine if you needed to perform that without the SDK—you would be talking hundreds of lines of code. Thank goodness for SDKs.

C# and .NET

Leave a Reply Cancel reply

Recommended Articles