This method creates a temporary view, which has a lifetime of the Spark application. If a view with the same name already exists, then an exception is thrown.
df.createGlobalTempView(‘Brainwaves’)
df2 = spark.sql(‘SELECT Session.POWReading.AF3[0].THETA FROM Brainwaves’)
Notice that the argument following FROM is the name of the view created in the previous line of code.
CREATEORREPLACEGLOBALTEMPVIEW()
This method does the same as the previous method, but if the view already exists, it will not be re‐created. Instead, the data in the current view will be replaced (i.e., overwritten) by the new data. This avoids the invocation of an exception if it already exists.
df.createOrReplaceGlobalTempView(‘Brainwaves’)
df2 = df.filter(Session.POWReading.Counter> 5)
df2.createOrReplaceGlobalTempView(‘Brainwaves’)
df3 = spark.sql(‘SELECT Session.POWReading.AF3[0].THETA FROM Brainwaves’)
This code snippet calls the createOrReplaceGlobalTempView() method twice: once to initialize the view and then again after the data is filtered using the filter() method. Finally, the data you wanted is extracted using SQL syntax.
When an object has application scope, it means anyone who has access to that application has access to that object. Multiple users can have access to an object with application scope. An object with session scope is accessible only to the user who instantiated that object.
CREATETEMPVIEW()
This method creates a view that has a lifespan of the Spark session. If the name of the view already exists, an exception is thrown.
df.createTempView(‘Brainwaves’)
df2 = spark.sql(‘SELECT Session.POWReading.AF3[0].THETA FROM Brainwaves’)
CREATEORREPLACETEMPVIEW()
You can probably guess how this one differs from the previous. This method does the same as the previous method, but if the view already exists, then it will not be created; it will instead be replaced. This avoids the invocation of an exception if it already exists.
df.createOrReplaceTempView(‘Brainwaves’)
df2 = df.filter(Session.POWReading.Counter> 5)
df2.createOrReplaceTempView(‘Brainwaves’)
df3 = spark.sql(‘SELECT Session.POWReading.AF3[0].THETA FROM Brainwaves’)
PySpark Functions
In addition to what you just read regarding the methods associated to a DataFrame, there are many built‐in PySpark functions. These functions are useful for manipulation of data outside of the DataFrame context. There are numerous functions; a summary of the most common is provided here. You’ll find official PySpark documentation at https://spark.apache.org along with complete language documentation.
EXPLODE()
There was a section about this method earlier, “Explode Arrays.” Review that section if you don’t remember specifically what this method does. I’m covering it here again because you might be asked about it on the exam. The important thing to note is that this method parses an array and places each component in the array into its own column. Recall how to use explode():
dfe = df.select(‘Session.Scenario’, explode(‘Session.POWReading.AF3’))
This code snippet is from the previous example, which results in the frequency values of an AF3 electrode to be placed into individual columns.
SUBSTRING()
It is common to find yourself working with strings when coding. One scenario involves getting a specific part of the string. It is useful for parsing out a date, for example.
data = [(1, ‘2021-07-30’), (2, ‘2021-07-31’)]
columns = [‘id’, ‘session_date’]
df = spark.createDataFrame(data, columns)
df.withColumn(‘year’, substring(‘session_date’, 1, 4) \
.withColumn(‘month’, substring(‘session_date’, 6, 2) \
.withColumn(‘day’, substring(‘session_date’, 9, 2)
df.show()
+—-+————–+——+——-+—–+
| id | session_date | year | month | day |
+—-+————–+——+——-+—–+
| 1 | 2021-07-30 | 2021 | 07 | 30 |
| 2 | 2021-07-31 | 2021 | 07 | 31 |
+—-+————–+——+——-+—–+
Notice that the substring() method is not preceded with a df., which illustrates that this method is not part of the DataFrame; rather it’s part of the PySpark programming language. Notice the withColumn() method and recall from the previous section where it states that this method is used for renaming columns. It is also useful for adding columns. The following example uses the select() method with the substring() method. The result is the same.
df.select(‘session_date’, substring(‘session_date’, 1, 4).alias(‘year’), \
substring(‘session_date’, 6, 2).alias(‘month’), \
substring(‘session_date’, 9, 2).alias(‘day’))
Instead of using the withColumn() method to add the column, the alias() method is used.