The word static has numerous meanings, and the one that applies is dependent on the context in which it is used. In the database context, the meaning is that once a schema is defined and created, it will not change. You find static schemas in relational (aka structured) databases. If you recall from the previous section, you notice the SQL command used to create the schema is quite simple—just two statements. However, the construction of the contents that lives within the schema takes much design, planning, and management. That leads again into the context of a database table as those existing within a database schema. In the relational database context, there is a concept known as schema on write, which is represented by the following steps:
- Create the table.
- Load the table with data.
- Query the table.
Assume that the amount of data loaded was very large and continues to grow. Assume that you need to change the data type of an existing column or add a new column. This is a very time‐consuming action because performing an ALTER command on the table to make this change will not work. It doesn’t support such activities without some additional actions. Consider the following SQL syntax. If any values in that column are greater than 25, then the ALTER command will fail.
ALTER TABLE MODE
ALTER COLUMN MODE NVARCHAR(25)
The only way to make this kind of change is to remove the data in the column before changing it or drop and re‐create the table. For small datasets this might be acceptable, but for large datasets, it is not, due to the time required to execute those commands and actions, not to mention the risk of data loss.
Dynamic Schema
The opposite of a static schema is a dynamic one. This kind of schema is most prevalent in nonrelational databases. If you recall from the content in the “Schema” section earlier, you can see that the table schema is inferred when the data is loaded into the DataFrame. If a schema is not defined when the DataFrame is loaded with data, Spark will do its best to infer the data types based on the row values. Here is an example of a schema definition using PySpark followed by the loading of the data from a data source like ADLS, referencing the schema:
schema = StructType([StructField(‘Timestamp’, TimeStampType(), False),StructField(‘AF3/theta’, DecimalType(), False),StructField(‘AF3/alpha’, DecimalType(), False), …])df = spark.createDataFrame(data, schema)
What this means is that the table schema can be modified prior to loading the data from its source. The schema is interpreted as the data is being loaded into the DataFrame. If the data type for a given column changes, then you can change it in the schema definition before you load the data. Recognize the fact that how an Azure data engineer approaches and consumes data occurs from many different vectors, where a vector is relational, nonrelational, and so forth. How you consume this data and the constraints within you must operate are significantly different. The point is, someone who has more experience with relational databases typically expects structure and the concepts in a structured environment don’t always align with the concepts in a dynamic environment. When you are using a Spark pool on Azure Synapse Analytics, you would typically pull data from a nonrelational source and load it into memory for transformation and modeling, whereas in a SQL pool the data is most likely coming from a structured relational source, and it is expected to be in a defined table. Therefore, numerous kinds of table schemas and methods must exist for loading data because there are numerous kinds of data storage structures. Lastly, I used the words typically and likely in my previous comments on purpose. This is because, although Spark pools have historically worked well with nonrelational data and SQL pools with relational data, it does not mean that either cannot be used in both scenarios.