If you already know what the term projection means, then you can use that as a basis for the meaning of pruning. You can also use the literal meaning of the word, which involves trimming branches of a tree or a bush. Also, many times there are some stems that simply come out of nowhere and make the entire bush look awkward, so that one gets cut away. Implementing projection is rather simple. You would not execute a global query like the following:
SELECT * FROM READING
Instead, you can project that query by adding a WHERE clause, which will reduce the amount of data returned:
SELECT * FROM READING WHERE ELECTRODE_ID = 1
Both projection and pruning will reduce the amount of time required to perform the query, but for different reasons. In many of the examples so far where HASH distribution tables have been created, you notice there has been either ELECTRODE_ID or FREQUENCY_ID used as the distribution key of the partition. Rightly assume that there are five electrodes. If you perform a query that is looking for data WHERE ELECTRODE_ID = 1, then the partitions that contain ELECTRODE_ID IN (2, 3, 4, 5) will not be searched, since that data will not exist on those nodes. Nodes are used here in place of partitions because it is common, when using huge datasets, that a partition is placed on its own node. An optimizer, which is typically part of the DBMS, is what implements such query execution plans. There is also a term called explain plan. Both topics are discussed in Part IV, “Secure, Monitor, and Optimize Data Storage and Data Processing.”
Data Shuffling
The data shuffling concept is typically applied to tables with a DISTRIBUTION value of HASH. Shuffling (or reshuffling when shuffling has taken place once already) is a data movement operation that results in the transfer of data from one node to another. This is an expensive operation and is likely to increase query execution latency. Shuffling cannot be avoided 100 percent of the time; it is often inevitable. However, knowing some reasons shuffling can happen is valuable. There are two primary reasons: The first is due to joining two tables that have different distributions. For example, if you have configured the READING table to be HASH distributed and SCENARIO to be ROUND_ROBIN, it is possible that the data on the node that is selected to execute the query does not have the SCENARIO data, which then requires the data to be transferred.
The other example to watch out for is like the first but has to do with joining two tables, both of which are HASH, but with different distribution keys. If you think about that for a second, you can probably already visualize why this might cause a data shuffle. For example, a READING table with a HASH distribution key of ELECTRODE_ID and another table READING_TMP with a HASH distribution key of FREQUENCY_ID can cause a data shuffle. If you executed a query that joined those two tables together, the data may not exist on the same node. In that case, the data from one would need to be moved/shuffled to the one on which the query is executed. JOINs and aggregates have not been fully explained up to now but are covered later in this chapter in the Querying data section. Skip ahead if you want to learn about them now, and then come back here.