It's been quite a bad 2 weeks at our own Espoo DC in regards of disk failures.

Many months without a single disk failure went by, but last 1½-2 weeks has been maddening!
During this 2 weeks 4.6% of all disks failed. Yes, a whopping 4.6% ! !
This happened across multiple disk models, manufacturers and includes both SSD and HDD.

Along this new storage arrays failed to perform due to a firmware / driver issue with LSI MegaRaid on Linux, we lost a lot of precious time with that.

I'm happy to say that the total damage is rather small tho, aside from some maddeningly long downtime for a few nodes.
During this ordeal we only lost about 25.15TiB of data capacity, which was approximately 60% utilized, resulting in just 15.09TiB of actual data lost.

Right now a total of 50.31TiB (~56TB of "Sales" capacity) is still rebuilding, and all other arrays with failed disks during this period has already finished resync / recovery cycles.

Everything should be back in order after the new arrays finish stress testing in 24-48hrs.

Research work on clustering the storage to avoid issues during these situations has already started and new storage arrays are already in the plans.
Systems with local data drives had less of a effect, and system with complete local storage were completely unaffected. Data graphs shows only a 20% dip in outgoing bandwidth from this period.

Martes, May 20, 2014

« Atrás