Menu

Index – CREATE DATABASE dbName; GO

In its most common use, an index is the place you look for a key term to find a page number for the detailed explanation of that term. You will find an index at the end of this book. If you look for the term Index, you will find this page referencing it. An index in the database world is a technique used to speed up queries. Typically, you create an index on the pattern of most common WHERE clauses. Just in case you are not clear on how a standard SQL statement looks, here is an example:

SELECT VALUE from READING WHERE ELECTRODE_ID = 2 AND FREQUENCY_ID = 5

Without an index, the query would scan the entire table looking for rows of data that match the WHERE clause. However, if you create an index using the following snippet, then the query would perform much faster because the DBMS would already know where those rows are, just like an index in a book:

CREATE INDEX idx_electrode_frequency
ON READING (ELECTRODE_ID, FREQUENCY_ID)

If you do not create any indexes in your Azure Synapse Analytics dedicated SQL pool, then a Clustered Columnstore Index (CCI) is created for you. Recall the CLUSTERED COLUMNSTORE INDEX statement from the previous section, which illustrated the creation of a READING table. A CCI is the format in which data is physically stored on a disk. There are two indexing aspects you should know: logical and physical storage. The logical aspect is the way in which you can approach interfacing with the data using SQL queries, C#, or Python code to filter, retrieve, and analyze it. Physically, however, the data that you are accessing is stored in a file that exists on a storage drive. The CCI instructs the DBMS how to store the data in the physical file so that both an increase in performance and a decrease in storage capacity are achieved. The DBMS is then able to decipher the commands you send and apply them onto the datastored in the file and retrieve the result as quickly as possible. There are two other types of indexes: heap and clustered. Replacing CLUSTERED COLUMNSTORE INDEX with either HEAP or CLUSTERED INDEX results in those kinds of indexing.

A HEAP index and a CLUSTERED INDEX both store the data by row instead of by column as with a CLUSTERED COLUMNSTORE INDEX. The difference between the two is that a HEAP index stores the data without any underlying order, whereas the CLUSTERED INDEX does. Data stored with a HEAP will perform faster when queried, but a CLUSTERED INDEX will perform faster when manipulating data with INSERT, UPDATE, and DELETE commands. If you want to create an index when using a Spark pool, there is a library named Hyperspace. You can create an index for a Spark pool using the following syntax:

from hyperspace import hyperspacehs = Hyperspace(spark)
df = spark.read.json(‘abfss://<*.dfs.core.windows.net>/brainjammer.json’)
hs.createIndex(df, IndexConfig(‘ELECTRODE_ID’, [‘FREQUENCY_ID’], [‘VALUE’])

This command will create files in an index folder on your ADLS container. This will incur more cost for storage but less cost on the compute side because the queries will run more quickly and require less compute power. At the moment, Hyperspace does not provide an in‐memory capability, so the indexed data must be written to and read from disk.

Tables

A table is something that was referred to earlier in Figure 2.2. It is a structure that contains data organized in rows and columns. When running queries on Azure Synapse Analytics SQL pools, there are many considerations that pertain to how a table is created and how the data is stored within it. Again, keep in mind that when focusing on Spark pools, queries are often performed on data loaded into memory. They are primarily semi‐structured, loaded from a file and not relational.

Leave a Reply

Your email address will not be published. Required fields are marked *