Archive for the ‘Network Security’ Category

Drive Failures and Data Protection in Large-capacity Storage Repositories

Monday, July 19th, 2010

As we see more and more customers’ data repositories headed into hundreds of terabytes and even petabytes, the issues of data protection, both in production and archive, are growing areas of concern for them. With more organizations wanting to minimize tape – or simply finding that they have too much data to archive hat way – the emphasis shifts to better on-disk protection models that are flexible enough to accommodate changing needs with regard to data primacy.

In the traditional protection world, RAID 5 has largely given way to RAID 6, and RAID 7 appears on the horizon as the latest potential fix to what is a sliding scale of drive reliability mapped against repository size.

Drives fail. Always have, probably always will. They also do not always succeed in giving you back the same data you originally gave them, creating soft error rates that must be taken into account with true failures when determining data integrity. But as drives get larger and rebuild times proportionately longer, it becomes critical to either dramatically improve error rates or improve protection schema beyond current RAID capabilities – or both.

Drive error rates are reported as 10x, which corresponds to the predicted number of bit read errors over a certain capacity. This measurement is referred to with some variability as the Bit Error Rate (BER), Unrecoverable Read Error (URE), or Unrecoverable Bit Error (UBE) – for the limited purposes of this article those terms are interchangeable. A 1014 error rate represents a bad bit for every 12.5TB of reads, while at 1016 that capacity increases to 1.25PB – it is easy to see that, at scale, enterprise-class drives with higher BER’s are very important.

If you imagine a 15-shelf drive at RAID 5, there is a hot spare and a parity drive – that leaves about 13TB of useable space. Under normal conditions, BER is not an issue, as parity will take care of any bad data; however, if a drive fails, the entire 13TB has to be read. If the drives are rated 1014 (1 error per 12.5TB), there is almost a guarantee of a read error. This means file loss or, worse, a rebuild error. In a ‘hot’ area of the repository, heavy utilization (and often younger drives if that data sat is expanding) can cause decreased drive lifetimes[i], and often negatively impact access times.

If drives get bigger and the BER stays the same, eventually the lines cross and there is a failure on every rebuild. RAID fails at this point. In 2007, storage pundit Robin Harris suggested that RAID 5 would be dead in 2009; it seems not to be true on the whole, but arguably very true at PB scale.[ii] Solutions then offered for this problem included mirroring data, replicating full data sets, or taking 1-2 weekly fulls – all inefficient ways to manage PB-scale repositories.

So many organizations now use RAID 6, where dual parity coupled with improved BER provides a much greater sense of data integrity. There are trade-offs, however: higher CPU utilization, increased wattage/GB, and higher parity overhead. Many repositories are efficiently functioning within those limits, but the risk continues to grow at scale and will eventually threaten RAID 6 just as it currently does RAID 5.[iii]

While the 2019 reference in the Harris article foretelling the death of RAID 6 is admittedly a long way off – much can change within the industry in that scale of time – the more immediate problem appears to be drive throughput and its impact on rebuild times. Large drive capacities, coupled with lagging throughput metrics, create windows of vulnerability; when a drive fails, RAID 6 becomes RAID 5 for the duration of the rebuild (hours? days?), and RAID 5 simply becomes nothing. These are the windows in which data can be lost.

Parts of the industry are now actively calling for RAID 7 implementations to stay ahead of the curve[iv], but many of our customers view increased costs and decreased efficiency as an excuse to take a look at new approaches – dynamic protection schema, faster rebuild environments, protection from silent data corruption, and customizable CPU allocation within their repositories as ways to stem the tide of the RAID chase and bring efficiency and integrity to their ever-increasing data storage needs.


[i] “Failure Trends in a Large Disk Drive Population”, Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz André Barroso, 5th USENIX Conference on File and Storage Technologies (FAST 2007)

[ii] http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

[iii] http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019

[iv] http://queue.acm.org/detail.cfm?id=1670144

The Handling Of Critical Assets

Tuesday, September 29th, 2009

A couple of weeks ago, I boarded a flight out of Philadelphia on my way to Accunet’s Atlanta office. Poor weather in Philly had created a bit of a fiasco, but an hour after being announced as 73rd in line for take-off we were aloft. As we made our way south, the captain came on a couple of times to update us on our progress, sensitive to the fact that some of the passengers needed to make connections at Hartsfield to such faraway places as Oslo and Tel Aviv. Unfortunately the news was not much better once back on the ground; the scheduling mishap had left us without a parking space and meant another 20 minutes on the tarmac waiting for an open gate.

Needless to say, the level of collective angst amongst the passengers was increasing rapidly. I had no time constraints whatsoever, as my objective for the evening was nothing more than getting to a rental car and making my way 20 miles or so to my hotel; nonetheless, I was more than ready to be off of this particular flight. That got me thinking about my fellow travelers, and that almost every one of them was likely under more pressure than I to deplane. For those looking to make connections to Europe and beyond, every minute translated into a probability of an unscheduled night in Atlanta, time lost on expensive vacations, family or business events missed – for the airline, these passengers were critical assets.

It is an interesting exercise to read about the ergonomic analysis that the airline industry has performed in an effort to determine the best way to load a plane. Regardless of the fact that you be inclined might argue that the reverse pyramid method is better than a rotating zone, one inescapable conclusion is that the plane is not going to move until everyone is in a seat and everything brought on board is stowed with at least a passing wave to safety. And as we have all seen, no loading plan can remain pristine when airlines must tip their hats to frequent flyers, people who need extra help, small children whose parents are bringing them so that they can board sooner, people with mismatched socks, etc. In the end, the boarding process simply represents the confluence of 100-odd lives to share a relatively brief common experience, and any efficiencies gained are relatively minor in the grand scheme of things. But what about the deplaning process? Clearly far less science has been applied to that end of the transaction, as all airlines seem resigned to the Free-for-All model – the fastest, least-burdened, rudest people almost always get to leave first.

So lets’ go back to my flight to Atlanta, and find something relevant here aside from a rant on airline processes and human nature (despite both being such easy targets). There are people on the plane for whom minutes mean the continuation of their journeys or significant upheaval, which could impact the airline in hard or soft dollars. There is a group for whom any more inconvenience could induce them to choose another airline for their next trip. There are those who are aggravated, but will withstand whatever delay remains. And within each of these groups are the frequent or infrequent flyer groups of which the airline wants so desperately to keep track. The problem is that the airline has no idea where any of these groups are positioned in their aircraft.

So imagine a system that allowed the airline to make those distinctions and to allow passengers to move in alignment with the criticality of their circumstance. The cabin lights dim, and then a few seats at a time are lit and those passengers are free to leave. The frequent flyer heading to Oslo makes her flight; the family trying to get to Tel Aviv for a relative’s 100th birthday makes the party; those who need a good night sleep before an early morning meeting get it; and people like me who on this trip have no constraints wait our turn. The airline maximizes its handling of its critical assets, and those assets function within the system in accordance with their priority.

I see analogies in our clients’ efforts to retain, categorize, and protect the critical data assets that define their businesses. A lot of resources are expended on getting the passengers onto the plane and into the right seats – faster networks, converged server populations, higher-performance disk structures. Where things get challenging is when it comes time to differentiate that data, be it for the purpose of archiving, managing, or protecting it.

The protection decision may be the most challenging, as it is the hardest to modify after-the-fact. If an organization archives data incorrectly or inefficiently there could certainly be some opportunity costs, but an analysis of the data can often point the way to more effectual and economic ways to administer things moving forward; if instead it mischaracterizes data and lets important data out the door it may prove too late to recover from the loss of intellectual property or sensitive customer information.

Chances are you are doing things to protect data as it moves through the network, hopefully have ways to at least see that data should it try to leave your network, and have probably identified your endpoints  - mobile and otherwise – as danger spots. But what about data at rest? Successfully characterizing and protecting that information helps not only with security while the data sits in its natural habitat, but puts organizations at an advantage when that data starts to move and must be monitored and controlled.

Here at Accunet we are involved with our clients’ efforts in this area every day, and we bring to the table insight and experience with a wide range of technologies and solutions. We would be happy to share that experience with you, and help your organization tackle this challenge.

The Greening of Today’s Data Center

Thursday, April 23rd, 2009

My youngest daughter has assumed a position of authority for herself on our house as the resident expert on environmental issues, a scepter she beats us with every time we throw a plastic bottle in the trash, walk out of a room with the lights on, or bring home groceries in anything but a reusable cloth sack. Despite the occasional aggravation of being called out simply for doing something I have been doing for forty years, in large part I applaud her efforts – the global problem is getting worse, and her generation’s burden will likely be greater than mine with regard to making amends. Plus, she’s saving me money, a worthy end indeed.

So while the lectures may get a bit tedious at times, I know the end result will be a good one, both for me and everyone else. And that got me to thinking about what she might do if turned loose in the data centers of the many verticals we work in every day. Within minutes I suspect the walls would be covered with the “Reduce, Reuse” stickers that seem to come home from school in endless supply.

Without drawing attention yet again to your 401k statement, there is little question that organizations are all about REDUCING at this point – fewer people, less space, and lower overhead. Yet with technology being such a key differentiator for many companies and institutions, it is important to reduce with an eye towards efficiency, and without any negative impact on utility. The good news is that from the branch office to the central data center there are many technologies available to accomplish this goal, often with matching benefits in simplification of management that are critical to today’s smaller IT staff.

Branch offices often pose a challenge for many reasons: they are difficult to staff, they often disrupt economy-of-scale pricing models, and they generate high risks for redundant purchases that are poorly utilized and expensive to maintain and replace. For these and other reasons, many companies are always looking for opportunities to centralize resources, thereby REDUCING the need for remote personnel, servers, and other hardware. To make consolidation projects work without negative user impact it becomes critical to provide the bandwidth that key applications require. Solutions from companies like Riverbed, Juniper, and Cisco offer command and control over WAN bandwidth, allowing applications to be centralized and to simplify delivery and management. Riverbed has actually taken the argument one step further, offering virtualized implementations of services (DNS, DHCP, Print, etc.) that do not lend themselves to centralization, further REDUCING the number of servers required at the branch office.

As employees and customers touch the edge of your network for information and services, it can be difficult to keep physical resources aligned with actual need. Web servers assigned to provide specific application services often get bogged down managing large numbers of connections, something they aren’t made to do. Solutions such as F5’s BIG-IP provide local traffic management designed to maximize bandwidth, and to offload tasks from your web servers – usually REDUCING the number you need to meet your service level goals. F5 augments this capability with balancing at a global level, allowing multiple sites to resource-burst into each other, offering internally-managed cloud computing concepts and REDUCING the need to build out each site to peak usage parameters.

Once inside the data center, it’s common to see the room bursting at the seams, with organizations struggling to find space, power, and cooling to satisfy the needs of growing storage repositories and archive platforms. More efficient power supplies advertised by myriad vendors are a good start, but often the problem calls for more drastic measures. Consider a solution from Isilon Systems, who has partnered with Ocarina Networks with the intention of doing the unthinkable for a disk vendor – optimizing online storage and thereby REDUCING storage footprint and resource consumption. Isilon’s vision is towards the big picture, helping its customers manage a key problem with Ocarina’s revolutionary de-duplication and compression capabilities and allowing them the flexibility to create purpose-built storage repositories that fit both the need and the room. Also bringing a creative approach to the storage market is Data Direct Networks, whose drive capacity metrics are dramatically REDUCING storage footprints, and whose D-MAID power feature automatically takes advantage of data dormancy and can create power reductions on the order of 80%.

The process of REDUCING the number of servers in the data center has been underway for some time, as standalone servers evolved into blade servers, and yet again into virtualized environments. The lure of eliminating servers and the ability to repurpose or recreate virtual machines with little more than a drag and drop is strong, but it is important to understand application flow to be able to identify good targets for virtualization, as well as to have reliable metrics to demonstrate that these projects are producing the promised results. Lakeside Software provides such a tool, offering network and application analysis designed to make virtualization projects deliver.

Lakeside takes us further to the end of our journey – the central office PC that sits on all night, all weekend, draining power and increasing costs without adding benefit. Users won’t remember to turn off their machines every night, and even if they did it would become impossible to manage the virus software update that you have scheduled for 2am. What’s needed is a centralized way to learn workstation usage patterns to establish automated sleep cycles that can be overridden by either scheduled software maintenance or unscheduled workflow – all while slowly but steadily REDUCING the electric bill.

My daughter won’t be showing up to ask you to take these steps, but someone will. The CIO needs to do more with fewer people and less space; the CFO wants the monthly expense to go down; your customers want to see you demonstrate a concern for the environment – regardless of who is telling you to reduce, the resources and tools you need are out there, and we’re here to help connect you to them.