Menu

Data Management– CREATE DATABASE dbName; GO

Don’t confuse data management with database management, where the focus is on the mechanics of the DBMS. When you choose to run your database on the Azure platform and select a PaaS product, then the management of that database is no longer your or your company’s responsibility. Instead, the focus here is the management of the data, such as where the data is stored, who has access to the data, what the meaning and use cases are for the data, how long the data is useful, and how long the data should be retained. You can apply Time To Live (TTL) configuration to data that can help managing data lifetime. Many of these topics have been touched on already in this chapter, and will likely be covered again, but they are important concepts that pertain to data management. The data needs to be as close to those who consume or produce it so that latency and potential data loss can be minimized. There may be some constraints on the location in which the data can be stored based on industry and governmental regulations.

From a data privacy and governance perspective, you need to know the sensitivity level each column, node, or file has in your datastore. You also need to control access. You must know who has access, who is accessing the data, and how often. If you consider the data being used in this book, brainwave data, the validity of the data is likely determined on the sophistication of the device that captured it. Once there are better devices, the data that has been captured today is no longer as useful as what can be collected with a better BCI. Which leads to the question: how long should the data be retained?

There are numerous reasons to retain data. For one, being able to revisit historical data to show the path of forward progression from beginning to end might add some value. For example, if all things remain equal, the trend will continue in that direction. Further, if there is ever some kind of investigation required to uncover reasons for missing projections or beating projections, historical data is a place to find out some reasons for the outcome. This kind of data analytics is called predictive analytics. There are also reasons for expiring data, such as legal obligations, costs, and data staleness. There is a very popular privacy framework named General Data Protection Regulation (GDPR) that contains some guidelines for the retention of data for given circumstances. One general example is for bookkeeping documents where the retention period is seven years, after which the data should be permanently deleted. The retention period also depends on the type of data it is. For example, if it contains information that can identify an individual human, then it should be deleted as quickly as possible.

Data retention also comes with a cost. Consider the cost of storage space on ADLS Gen2, which currently starts at $0.00099 per gigabyte in an archived state, up to $0.15 per gigabyte when running in Premium mode. If you are storing massive amounts of data, which is common in the Big Data context, you can do the math and determine how much storing a terabyte or even a petabyte of data costs; it is significant. Finally, at some point data become less relevant. For example, consider that the AF3/AF4 electrodes on a BCI have a 10 percent margin of error. When a new BCI device is created with a margin of error at 2.5 percent, then the old data collected with a 10 percent margin of error is mostly useless. There is no real reason to store that data any longer. Delete it and save the money for the new, more precise data collections.

Leave a Reply

Your email address will not be published. Required fields are marked *