There can be many challenges when working with dates and datetimes. In many scenarios a date is stored as a string. That means if you want to perform any calculation with it, the date value stored in the string needs to be converted to the date data type. Additionally, the date format is often specific to the platform or stack with which you are working, which results in some complexities and requires some additional troubleshooting. In general, you can find descriptions online to help you work through it, and the error messages are helpful. Getting a date into the format you require may be challenging at times but it’s doable.
data = [(1, ‘2021-07-30 09:35:00’), (2, ‘2021-07-31 10:15:00’)]columns = [‘id’, ‘session_date’]
df = spark.createDataFrame(data, columns)df.withColumn(‘date’, to_date(col(‘session_date’))) \
.withColumn(‘timestamp’, to_timestamp(col(‘session_date’))) \ .show()
+—-+———————+————+———————+
| id | session_date | date | timestamp |
+—-+———————+————+———————+
| 1 | 2021-07-30 09:35:00 | 2021-07-30 | 2021-07-30 09:35:00 |
| 2 | 2021-07-31 10:15:00 | 2021-07-31 | 2021-07-31 10:15:00 |
+—-+———————+————+———————+
The value in the session_date column is a string, but the other two columns, date and timestamp, are not strings because they were cast into the date and timestamp data types. Review the following to gain a better understanding:
df.select(col(‘date’), current_date().alias(‘today’),
datediff(current_date(), to_date(col(‘date’)).alias(‘date_diff’))).show()
+————+————+———–+
| date | today | date_diff |
+—-+——————–+———–+
| 2021-07-30 | 2021-11-29 | 123 |
| 2021-07-31 | 2021-11-29 | 122 |
+————+————+———–+
Notice the PySpark method current_date(), which returns the current date on the machine where the method is executed—that is, the system clock. The datediff() method compares two dates and returns the number of days in between them. This method requires that the inputs be of type date; it would not work with strings. There is another date‐related method, months_between(), which as its name implies, returns the number of months that exist between two dates.
df.select(col(‘date’), current_date().alias(‘today’),
months_between(current_date(),
to_date(col(‘date’)).alias(‘month_diff’))).show()
+————+————+————+
| date | today | month_diff |
+—-+——————–+————+
| 2021-07-30 | 2021-11-29 | 3.933 |
| 2021-07-31 | 2021-11-29 | 3.966 |
+————+————+————+
Again, the values cannot be strings. The date content in a string must be converted to a supported date type format for the date methods to run without exception.
WHEN()
This method is used when you want to apply some conditions on the data in a DataFrame. It is similar to an if then else statement in Python or C#, or CASE WHEN cond1 THEN result ELSE result in SQL syntax. The when() method is commonly used in conjunction with otherwise(). The otherwise() method is useful for setting a default if the data does not match any of the when() method’s conditions.
df2 = df.withColumn(“new_scenario”,
when(df.Scenario == “ClassicalMusic”, “Music”) \
.when(df.Scenario == “FlipChart”, “News”) \
.otherwise(df.Scenario)
Assume that df was loaded with the SCENARIO table. Running the previous code snippet on the data would result in the following:
+————-+—————–+————–+
| SCENARIO_ID + SCENARIO | new_scenario |
+————-+—————–+————–+
| 1 + ClassicalMusic + Music +
| 2 + FlipChart + News +
| 3 + Meditation + Meditation +
| 4 + MetalMusic + MetalMusic +
| … + … + … +
+————-+—————–+————–+
Using the withColumn() method results in the creation of a column with the provided name. The column is populated with the results of the when() and otherwise() methods. If there is no match with the when() condition, the original Scenario is applied; otherwise, the new value is added to the column.