Today we are going to do maintenance on multiple parts.
The primary router will be changed to more suitable hardware, this will result in approximate 45mins downtime. Basicly we are downgrading it. Yes, downgrading. We over configured the gateway by vast margin, which results in more expensive, more power hungry than required, we will do a tiny downgrade on it, CPU processing wise it will remain the same which is the only thing that matters. Even during heaviest load so far we noticed less than 1% CPU peaks due to the routing.
Storage will be reworked today, due to flaws in ZFS design, it will cause intermittent downtimes. We will move a vnode at a time, which will mean this vnode is down for the duration of the copy operation, we will restart it once copied to another raid array, after which we build a new storage brick, and do another array, to which we copy from this moving array. Again, causing intermittent downtimes.
The storage work is likely to last 8+ hrs due to the sheer volume of data to be moved. After this maintenance we should expect to see multiple times the storage performance what we see currently, and this should be a setup to stay with for a long period of time and we start to rollup the storage network and cluster after we verify this gives us the "holy triangle of storage": Cheap, Fast and Reliable.
This time around we are going to rely on older, proven, well known FS: EXT4 + Raid5/6. Sure, we lose snapshotting, checksumming, inline compression, but via bcache we should achieve better usable cache for our systems and *solid* performance.
The problem is with the way ZFS was designed with only one goal in mind: Data integrity. This means that a RaidZ[1,2,3] volume only *has* single disk performance - despite being advertised by many (Including one big old time enterprise which has been always in forefront of industrial computing) to be even faster than Raid5/6. This simply is not true. Your only other option is to have weaker redundancy and loose 50% of your storage by doing mirrored pairs, or do multitude of RaidZ, and thus loose much more than 50% of performance. Neither of these scenarios are acceptable. If i have understood correctly, even with mirrored pairs you still loose 50% of performance. So this translates to roughly 2 options: 50% storage + 50% performance loss OR 20-25% storage + 80-90% performance loss.
Bottomline is: ZFS is good for backup and streaming loads, it has magnificent write speeds, but poor read speeds, cache warms up too slow, and cache expiry is vastly too long for our type of loads (hotspots change frequently), and very poor IOPS figures in RaidZ and cache wise. Also there is absolutely 0 pool reshaping features apart from expanding by swapping larger disks. The nice features such as inline compression, checksumming do not outweight the lack of performance unfortunately. In any case, on our load data volumes have 0% compressible ratio.
EXT4 + Mdadm RAID5/6: Raid5 achieves near hardware read speeds while suffering on write, Raid6 is not much slower than that, we expect to see 75%+ of raw hardware performance, while loosing 25% of the storage initially. EXT4 is overally well performing, especially with a little bit of tuning, EXT4 features are limited but what it does, it does well. Mdadm is slightly harder to manage (ZFS failed disks are very easy to replace!), and while write speeds suffer, write is not a heavy load for our use: Generally 1:25 write to read, meaning 25 times more is read than written!
Further, ZFS on Linux fails to compile on latest kernel, there is some hacks to get it compile but we haven't attempted this, while EXT4 is built-in the kernel. ZoL documentation is also extremely weak, and it still has some regressions which are worrying. EXT4 has no regressions, and documentation while not needed, is vast.
BtrFS: It's getting more mature by the day, reading up on the BtrFS site the only major drawback we noticed is the lack of needed redundancy levels, we'd like to use Parity 2 or 3. BtrFS is in the kernel, so we will look into it someday.
After we validate the storage is performing top-notch we start to roll out more machines again, we got tens of nodes in standby to be built, tested and configured, and several already up but not being utilized.
Also we are about to purchase a big lot of hard drives within several weeks. Initially for every disk we put online we can put 1-1.5 servers online, and add the missing storage on as required basis. This should quickly rise to 1:2-3 ratio as we grow the pool and add more SSD caching. Finally when storage starts to be utilized, we estimated the total need of 1.25 disks performance per node, but the scale out architecture we are planning for allows us to trivially keep on adding storage on a *as needed* basis, allowing us to offer these extremely competitively priced services.
1.25 disks performance per node means it's relative to modern 7200 RPM SATA disks, whether the performance (IOPS) is delivered via SSD or Magnetic drives.
Saturday, July 20, 2013