Menu

DataFrame – CREATE DATABASE dbName; GO

Up to this point you have seen examples that created a DataFrame, typically identified as df from a spark.read.* method:

df = spark.read.csv(‘/tmp/output/brainjammer/reading.csv’)

Instead of passing the data to load into a DataFrame as a path via the read.* method, you could load the data into an object, named data, for example:

data =’abfss://<uid>@<accountName>.dfs.core.windows.net/reading.csv’

Once you have the reference point set to the data variable, you can load the data into a DataFrame using the createDataFrame() method. The createDataFrame() method also takes a schema parameter:

df = spark.createDataFrame(data, schema)

Once the data has been read into a DataFrame, many actions are available that you can take on that data. Remember that a DataFrame is a container that holds a copy of the immutable data loaded into it. This data can be structured like a table, having columns and rows with an associated schema and casted data types. The distribution of a DataFrame onto the nodes in the Spark pool, for example, is handled by the platform, using its own optimizers to determine the best placement and distribution model. Let’s take a closer look at a few of the most important functions that you should know when working with a DataFrame. In practice, these methods are prefixed with df., such as df.show().

SHOW()

This method returns rows to an output console, for example:

df.show(5, truncate=False, vertical=True)

The following parameters are supported:

n Identifies the number of rows to return; if no value, up to 20 rows are rendered.

truncateTrue by default. Only the first 20 characters of the column are rendered. If set to False, then the entire column value is rendered.

verticalFalse by default. If set to True, rows and columns are listed one after another from top to bottom, instead of the common left‐to‐right row column alignment.

JOIN()

The concept of a JOIN hasn’t been fully covered yet; it is later in this chapter. In summary, JOIN is the means for combining data, based on given criteria, from two DataFrames into one.

dfSession = spark.read.csv(‘path/session.csv’)
+————+————-+———+——————+
| SESSION_ID | SCENARIO_ID | MODE_ID | SESSION_DATETIME |
+————+————-+———+——————+
| 1          | 1           | 2       | 2021-07-30 09:35 |
| 2          | 1           | 2       | 2021-07-31 10:15 |
| 3          | 2           | 2       | 2021-07-30 12:49 |
| …          | …           | …       | …                |
+————+————-+———+——————+

dfScenario = spark.read.csv(‘path/scenario.csv’)

+————-+—————-+
| SCENARIO_ID | SCENARIO       |
+————-+—————-+
| 1           | ClassicalMusic |
| 2           | FlipChart      |
| …           | …              |
+————-+—————-+

dfSession.join(dfScenario, dfSession.SCENARIO_ID == dfScenario.SCENARIO_ID)

+————+———-+———+——————+———-+—————-+
| SESSION_ID | SCEN*_ID | MODE_ID | SESSION_DATETIME | SCEN*_ID | SCENARIO       |
+————+———-+———+——————+———-|—————-|
| 1          | 1         | 2        | 2021-07-30 09:35 | 1         | ClassicalMusic |  
| 2          | 1         | 2        | 2021-07-31 10:15 | 1       | ClassicalMusic |
| 3          | 2         | 2        | 2021-07-30 12:49 | 2        | FlipChart      |
| …          | …         | …        | …                | …         | …              |
+———–+———–+———+——————+———-+—————-+

The default join type is inner, which means the information that matches is combined, and data that has no match is not added to the result. You can find a summary of all DataFrame methods in the Apache Spark documentation website at https://spark.apache.org. There are numerous methods that provide some very powerful capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *