This method provides the ability to run aggregation, which is the gathering, summary, and presentation of data in an easily consumable format. The groupBy() method provides several aggregate functions; here are the most common:
avg() Returns the average of grouped columns
count() Returns the number of rows in that identified group
max() Returns the largest value in the group
mean() Returns the mean value of the group
min() Returns the smallest value in the group
sum() Returns the total value of the group
You can again either append a show() or add the result to a new DataFrame:
df.groupBy(‘frequency_id’).avg(‘value’).show()
df2 = df.groupBy(‘session_id’).count()
df.groupBy(‘frequency_id’).max(‘value’).show()
df.groupBy(‘frequency_id’).mean(‘value’).show()
df.groupBy(‘frequency_id’).min(‘value’).show()
df2 = df.groupBy(‘frequency_id’).sum(‘value’).where(col(‘value’)> 2.254)
The last line of the preceding code snippet uses the where() method to place a projection onto the resulting data.
WHERE() AND FILTER()
The where() method is an alias for filter(). These methods provide the same capabilities, which include accepting a condition and returning the rows that match the condition. You can render the result into a console or load the result into a new DataFrame.
df.filter(df.electrode_id == 2).show()df2 = df.filter(df.electrode_id == 2)
ORDERBY() AND SORT()
The sort() method performs the same manipulation as orderby(). Passing one or more column names to the method results in the output of the data being rendered in ascending order—in other words, from smallest to largest. Sorting in descending order is also supported, which results in the output being rendered largest to smallest.
df.orderBy(df.electrode_id.asc(), df.frequency_id.asc()).show()
df.sort(col(‘AF3/theta’).desc()).show()
df2 = df.sort(col(‘AF3/alpha’).desc())
As seen in the first line, you can identify which column to sort or order by referencing the element name in the DataFrame. Also notice in the first line that it is possible to sort and order by multiple columns. In that case, the data is sorted by the first parameter, then by the second. It is also possible to provide the name of the column as a string and pass it to the method. Finally, rendering the data to the console for review or into another DataFrame for further transformation is a typical activity.
SELECT()
This method returns only the columns identified and provides the ability to reduce the amount of rendered data. Either append a show() at the end to render the value in a console or load a DataFrame with the result. Adding ‘*’ to the select() method results in all columns being returned.
df.select(‘AF3/theta’, ‘AF3/alpha’).show()
df.select(‘*’)
df2 = df.select(‘AF3/theta’, ‘AF3/alpha’)
WITHCOLUMN()
This method provides one way to rename or add a column of data that will be loaded into a DataFrame:
df2 = df.withColumn(‘AF3 THETA’, col(‘AF3/theta’)) \
.withColumn(‘AF3 ALPHA’, col(‘AF3/alpha’)) \
.printSchema()
root
|– AF3 THETA: decimal (nullable = true)
|– AF3 ALPHA: decimal (nullable = true)
You may have a need to rename a column for some reason. For example, the existing column name may not represent the data well, or perhaps a downstream system that cannot be changed needs the column to be a different name. The withColumn() method is a way for you to change the column names for data to be loaded into a DataFrame.
The following DataFrame methods create a table or a view that gives you the ability to execute queries on the data using SQL syntax. A DataFrame should already be loaded with data. Using SQL syntax is an alternative to using the select() method covered earlier.