The Present and Future of Policy-based Storage Automation

by Kamel Shaath, CTO, KOM Networks Inc.

We live in a digital age, where everything has to happen now, pronto, on the double. To deal with this, we try to put as much of our lives on “automatic” as we can, both personally and professionally, just so we can try to keep up. Well, it’s now possible to apply the principles of automation to the management of our data, which will save not only time but also some significant costs from the inefficient use of storage resources.

The data management problem is really simple: most applications are incapable of handling more than one source or target path for the data. Some applications have evolved over time, and are capable of handling part of this condition after some manual administrative intervention, but most have not. So why is this important?

Well, eventually your storage resources will fill up or reach their expansion limit. This means that the application will eventually be exposed to this problem; it’s just a matter of time. The big issue is that every application has to deal with this problem on its own. - there is no standard mechanism that every application would apply.

At the end of the day there are several aspects that have to be addressed whenever dealing with storage resource expansion:

Can I expand without disrupting access to the existing data?

This is not a new problem; historically the process would entail that you would have to maintain the same data path and namespace. This would minimize the impact on the existing applications. To achieve this end you must be able to extend or expand your existing storage repository, or you would have to copy the data to new storage resources and wait for the process to complete, then point your applications to the new location, at which stage you will be able to allow the users to start using the application again. In other cases expansion means reconfiguring the hardware, which would require a full backup, then reconfiguration and then a restore of the data before the applications can be brought live again. In other words, this is not an easy task and depends greatly on your current hardware configuration and data organization.

How can I take advantage of the newer higher availability, lower cost storage hardware?

In most cases the newer hardware will not be compatible with your existing infrastructure. You will most likely lose any redundancy and enhanced reliability properties of the new hardware once it is included into your existing repository after the expansion process. The alternative would be to completely eliminate the existing storage infrastructure from the configuration. In other words, you need to perform a full backup, then reconfigure the new hardware and restore the backup. In other cases it could be a simple recursive copy process to copy the existing data to the new hardware. Hopefully you will be able to utilize the older storage resources for something else.

How can I take advantage of the higher performance hardware?

The answer is similar to the previous question. You will have to eliminate the older hardware or dedicate some specific application data to take advantage of the newer, leaner and meaner hardware.

The good news is that storage vendors in both hardware and software disciplines are working on resolving these kinds of problems. The fact is that applications were not designed to manage storage resources, but merely to use them to store valuable information. The applications’ only concern is that there must be a path and available capacity. At the same time there are so many different applications out there that cover every possible aspect of our daily interactions that it would be virtually impossible to have all the different vendors, consultants and integrators agree on a common storage mechanism. Instead there is a focus from the storage vendors, especially on the software side, to make the storage repositories more intelligent and more attentive to the needs and demands of the applications.

Given the current status of applications, the best solution would be to introduce the intelligence in a middleware layer that can understand the context of the data that the applications create and use, as well as manage the various heterogeneous storage resources, which the applications have no context for. So what does this mean?

Applications usually interact with files, and each file has a context, type and attributes. A file is just a bunch of blocks (JBOB) that has to maintain a certain predefined sequence in order to maintain its context. A Word document is significantly different from a tiff image and so on. At the same time, storage resources have varying characteristics from connectivity, reliability, performance, administration and configuration. Wouldn’t it be nice to automatically match data types to the most appropriate and cost-effective classification of storage?

In the real word we typically store and distribute information based on a number of variables:

Frequency of Access

Regardless of the type of data, the frequency of access will diminish over time. The rate of decline in frequency of access will vary based on the data type and the associated application.

Regulatory obligations

Different local, state and federal government bodies and agencies have varying data retention requirements. They vary on the type of data and the need to maintain the records. These agencies include the FAA, SEC, ATF, IRS, FDA, NTSB, NASA, DOD, and DOE.

Organizational/departmental mandates and directives

Some data is maintained forever, others should be destroyed and shredded at a pre-determined point in the data’s lifecycle (data is usually an asset for some period of time, but can then become a liability at the end of its useful life).

Performance requirements

Each application has different reliability and accessibility requirements. The decision on where to store data will depend on the number of users and the type of information and frequency of access. Data for critical applications that are crucial for the organization are usually stored on the most reliable and fastest storage resources.

Data Lifecycle

There is lifecycle for each type of data. This depends on the applications’ association as well as the need to fulfill any other user obligations and mandates. The data life cycle covers the location where the data is stored from inception until archiving or possible destruction. The fact remains that even though a particular application may be of greater importance, not all of the data should be maintained on the most reliable and highest performance, and usually most expensive, storage real estate.

Latency and Proximity to users

In every organization time is money or, in some cases, a matter of life or death. The importance of the data and the acceptable time to access it is relative to its importance. This usually determines where the data is stored and for how long it remains there, before it is migrated to another less frequently accessed storage resource.

The fact is that every application has storage requirements and different types of data associations as well. The most interesting aspect is that most of the applications are interrelated and hence share some common lifecycle and storage management requirements. This means that data belonging to applications can be grouped into data classes. The data’s classification is based on the associated application, access and retention requirements.

Each data class can utilize a data lifecycle. The lifecycle would manage and determine where the data is stored relative to a number of variables. The parameters that control the destination where the data is stored should be based on characteristics and attributes of the data. These properties could include name mask, extension, age, last access, last modification, date of creation as well as file attributes. The entire process should be automated through the use of rules that would enforce the lifecycle storage policies.

Effective lifecycle management also requires classification of the storage resources. Resources with common attributes and properties ranging from connectivity, manufacturer, manageability, reliability, Quality of Service (“QoS”), Service Level Agreements “SLA”, as well as performance and latency are grouped together to create layers within the storage pool. This classification of the storage resources is important to simplify the distribution of the data according to the policies in the lifecycles. The ideology behind classifying storage resources is similar in principal to the concept of moving old boxes of stuff to the attic, and keeping all the newer toys in the kids’ playroom.

These concepts in themselves are not new, however the ability to apply them in a completely automated fashion is revolutionary. The automated provisioning of data into data classes whose storage policies are managed via lifecycles that determine in which storage class the data is stored alleviates the burden of storage management from the applications.

This means that applications would not have to worry about supporting different storage targets and destinations to store and retrieve different classes of information. At the same time the administrative tasks associated with capacity and allocation management would be significantly reduced. At the end of the day consolidation will deliver to the application a single volume with a single name space that can be accessed to store and retrieve information without disruption.

There are several aspects to automating the entire storage management process: resource consolidation, resource provisioning, and automated data distribution. Under this model:

Storage consolidation enables the aggregation of the resources into common repositories of storage classes. This concept improves utilization by allowing the applications access to all the storage classes. The consolidated volumes are managed and molded by the administrator to meet any changes in demand. The administrator is able to add new storage classes at will. The administrator can also retire aging or failing storage resources. This facilitates modernization without impacting productivity. New technology can be phased-in while older, high operating cost resources may be phased out dynamically.
The consolidated storage creates a large pool that is available to all applications. Storage resources are not allocated indefinitely to a specific application. Rather, unused capacity is pulled from the available pool only as applications need additional space in a given storage class. This greatly improves overall storage utilization and reduces the need to continually purchase new storage for data-generating applications.
The data lifecycles are employed by the administrator to distribute the data based on whatever objectives have to be met. These lifecycles control which data will remain in what resource and for how long and where it will be stored next. This is important since data that has to be accessed immediately can be stored on the fastest, most reliable storage resources while less frequently accessed data can be stored in slower, less reliable resources. The lifecycles can manage where the data is stored when it is created and whether it ever has to be relocated. The benefits of this kind of automation include improved utilization and reduced human intervention.

These kinds of policies can be extended to control when resources are allocated and how they are utilized. At the same time the classification of resources helps reduce the management burden on the administrative side. The overall performance of the system can be optimized by manipulating the most frequently accessed data and storing it on the fastest resources in response to the change in demand. The policies can allocate and de-allocate the resources to match the changes in demand. All these policies can be manipulated and defined in advance or on–demand by the administrator.

This is not just SRM. We are talking about middleware. It is important to recognize that these kinds of data manipulation policies and processes cannot take place without context and the context cannot be acquired at a later stage when the resources are scanned for their characteristics. The best place to achieve this task would be in the area that can manage and maintain that context and that is the file system.

So be prepared, because there is a whole new breed of non-conventional file systems that will appear in the near future. These file systems will incorporate the intelligence required to take actions relative to the storage allocation, automated provisioning and consolidation. This is the most logical place to introduce this type of intelligence. They are not your standard run of the mill file systems, but rather virtual file systems that will span resources, networks and servers consolidating and automating the processes of storage management.

Submitted by KOM Networks Inc.

Source: Data Storage Connection