After more than a month in waiting, finally the last of testing gear arrived. 3 big boxes of precious hardware.
The two testing nodes are minimally configured. AMD 6core 3.3Ghz CPU, 8Gb DDR3 ECC, 5x3Tb Cuda, 1x256Gb SSD Cache, 2x32Gb SSD Boot drives, Highpoint raid controller for cache drives, basic RAID controller for the boot drives.
Final configuration will be 32G DDR3 ECC, upto 14 magnetic drives, 4 cache drives. The two configurations we intend to have in launch are: 14x3Tb + 4x256G and 14x4Tb + 4x256G. Both will operate in Parity 3 mode, allowing 3 drive failures before data is in danger. By my maths, during first 6months there will be 1 disk swap in either of the nodes, and during 3 years designed lifespan there will be 2-3 total failed drives. If failure rate exceeds 4 drives over the two nodes during the 3 years, then cooling and anti-vibration needs to be enhanced.
Now we are testing with total of 10 drives, both nodes running parity 1. Each drive are mounted with rubber insulation, and the drive caddy is mounted with rubber insulators for anti-vibration. Then we put all the drives next to each other to maximize the heat build up on the drives, to drive them to the edge during testing.
At our colocation room the storage servers will be next to the cold air vent, getting the best cooling. Target temperature at the hottest spot @ ~1.5meter height in the colocation room will be 30C. So yes, we will be running the hardware hot, but as testing by google has shown this does not necessarily translate negatively into defection rates. If it does, we will drop the temperature 1C at a time until defection rates are at acceptable level. In the end, we have built in redundancy by design, and it boils down to financial maths, is it better to let 1 extra unit fail per 6months, or is it better to lower temperature and pay X amount extra in cooling.
This is why we test things, for one we need to get extra short lockable SATA cables, better SATA power cables, the ones used in Corsair PSU take up too much space with SSD drives, and modify the cases to accomodate extra 120mm fan or two. Maybe some additional 80mm fans too. Also, someone (me) forgot to purchase low profile video cards as well!
Thankfully my apartment has a spare room to fiddle with these before we get the colocation room, it takes up surprisingly much space when assembling these!
Some pictures: http://imgur.com/xlmRlpO,kn15CWr,BZRXtN3
Now the testing may begin, things we are testing are heat build up, PSU efficiency, performance (ofc), reliability. Our intention is to push these to the limits, and test all kinds of malfunction scenarios. Just building up these nodes showed couple drawbacks in the design, for example the last 5 hard drives with standard size ATX motherboard are ridiculously hard to get into their slot or out, so we might need to drop the storage server size down to just 8 drives.
No infiniband tests yet.
- Aleksi
Terça-feira, Mai 28, 2013