Redirecting efforts from Smart servers to Bare hardware (PDS series) and lessons learned

What we tried to accomplish with the PDS series and our own DC was essentially Super Smart Servers, we've been working to make it a reality furiously for the past 6months, but never got to realize even half of it, while backlog kept on piling etc. This has been to-date the most expensive project Pulsed Media has undergone, and the most extensive as well, encompassing work starting from electronics, power distribution, networking, finally to actual motherboards, software, storage etc. and i'm sad to say we have to redirect our course of efforts.

For the time being, we are going to redirect our efforts on bare hardware setups, this means local drives, PXE setup etc. things to redevelop on our infrastructure. It will *greatly* simplify our setup and stabilize things.

What is Smart Server then?
Basicly, in the cloud era, while virtual private servers are deployed from SAN, provides absolutely painless management, reinstall, migrations from HW node to another etc. There's no server provider which such features for the barehardware, essentially providing barehardware with the flexibility of vps and the performance of dedicated server. This is what we tried to achieve. Some of the proof of concept work was done all the way back in 2011.

The biggest thing we needed was to achieve storage flexibility, scalability and steady performance. We wanted to start simple, and build our way vertically as clustering software matures to a point we can utilize them. We never realized there's no such thing as simple and cost effective in the storage industry.

Reliability, Performance and Cost is all important to us, in order to be able to compete with the big vendors we needed to achieve exemplary storage efficiency. This meant that our plan was to use an array of inexpensive SATA drives with a SSD caching layer. Naturally, if there is 1% of SSD of the pool size, we expected minimum 1% hit rate at SSD performance levels (40k+ IOPS, 400M/s+ in our maths). Nothing could be further from the truth.

SSDs are a joke in the server environment.

I'm serious. They are an absolute joke. Their failure rate is so insanely high, that slightest issue with a node and the instant assumption is SSD has failed. Even when it's brand new, out of the box and been in use less than 10mins. Of course, many of them do work, but their failure rate is stupendously high.

SSD performance in a big array and as a cache is pathethic.

The bottomline is that almost no caching software has any kind of sane algorithm behind it, it's basicly random blocks in the cache unless you have SSD caching in the range of 5%+. When it's the correct blocks the SSDs go into crawl level because no caching software implement TRIM/DISCARD functionality. They mystically expect the server admin to do that for them (?!?). This is true for the convenient methods suitable for our use, we couldn't use something like bcache which needs a format to take into use, potentially leading to reliability issues. Reliability is a big question when you are running potentially hundreds of end users from that single array (20+ disk with big SSD cache arrays).

SSDs are good if you can dedicated it fully to a task, like a RAID10 array of SSDs for DBs, or desktop computers, they are absolutely brilliant on that use scenario.

We tried ZFS L2ARC, EnhanceIO, Flashcache among others. Lastly we tried autonomous tiering but CPU utilization + latency became an issue with that, performance until the core was fully utilized was simply brilliant, but since the software is not properly multi-threading and apparently inefficient CPU utlization it didn't work for a big array on random access pattern. For sequential loads ZFS performance was absolutely the best we've seen, however ZFS falls flat on it's face on random access and linux implementation on reliability as well.

Further issue was that all but L2ARC wanted to do all writes to the SSDs, wether or not on writeback mode. When you are expecting continuous loads of upto 300M/s+ on writes that becomes an issue.

All but the tiering method suffered from wear leveling issues, it was left to the firmware which meant you could only utilize about 75% max of the SSD because activity was so high that 10-15% was not sufficient for wear leveling. Firmwares as well struggled to do proper wear leveling, even the best SSDs did occasionally fail at this. We tried most of the major brands. With the tiering method we went overboard and put of 30% of the SSD capacity for wear leveling, so it's not entirely certain is it because of the tiering software, or so much taken off that performance doesn't degrade over time, but there's no easy way to test this neither, and we prefer to err on the safer side when it comes to production systems.

ZFS: Anything but what it's advertised for

ZFS L2ARC warm up times were ridiculous, the warmup could be 2 weeks (seriously). Further, ZFS linux implementation fails on reliability very badly. On our first, and last ZFS box we used the best feeling and looking premium SATA cables we had and solid feeling power connectors, but the premium sata cables were actually worst i've ever seen: Those sata cables had failure rate of 80%+, further this batch of HDDs had a 60% failure rate, along with some of the power connectors failing miserably.
This lead to intermittent connection issues to the hard drives, sometimes causing a 10-20sec "freeze", sometimes the drive would disappear from the OS.
Unlike any RAID array we've ever encountered, ZFS didn't go to read only mode. W.T.F?! It continued writing happily on the remaining disks, with only reading manually the status of the array telling it's degraded.
When the drive was reconnected: It would start resilvering (resync) with now faulty data. Most data was readable (80%+) up until that point, but due to the resilver/resync process ZFS linux implementation would corrupt the remaining data, and there was no way to stop or pause that process. We ended up loosing a huge amount of customer data.

ZFS Random access performance was ridiculous as well. Our node had 13 SATA drives + couple latest model big SSDs for L2ARC, and it could peak only at 3 disk worth of random access. We were baffled! Afterall, big vendors were advertising ZFS as defacto highest performance around. After a little research and lurking in the ZFS On Linux mailing list, it was found out that the way ZFS is designed it doesn't work for random access *at all*. Every single read or write would engage every drive on the array.
Only way to mitigate this was to have multiple VDEVs, ie. Mirror, ZFS Raid pools. This meant that highest random access performance (our load is 95% random) would be 50% of the hardware performance, while only having 50% of the capacity in use. This is a total no go in environment like ours: Customer demand is high storage with 1-2 disks worth of performance. We would need to double up the disk quantity which would mean huge cost increase, not just outright HW purchases, but also in electricity consumption, failure rates etc.

Further, the failure in reliability was a total no go. A raid array should be by default sane enough to stop all writes if drives are missing.

L2ARC was not usable performance wise neither: The maximum each SSD drive, the fastest on market at the time, would benefit us was slightly over SATA 7200RPM disk speeds, occasionally peaking to almost 2 drive's worth. The good thing with L2ARC + ZFS was that certain things were actually insanely fast. Things like our GUI loaded up as fast as end user internet connection + browser could. In that sense, ZFS L2ARC caching algorithm is by far, orders of magnitude better than anything we've seen since.

iSCSI Backend

We started utilizing ISTGT because it was the default on some big name vendor's products, which were based on *BSD. However, after a while it was clear ISTGT is anything but ready for major production: On Linux we couldn't online load new targets etc. but had to restart the daemon, and it was going to be hard work to hack ISTGT for multiple daemons to increase performance etc.
Every time we added a new vnode or removed there was high probability that another or all other clients would go into read only mode due to the restart time taken. This obviously is a no go. ISTGT has a method for reloading the targets, but this does not work under Debian 7, after researching into this, it looked like that functionality was not finished.
ISTGT as well lacked any proper documentation

Hindsight 20/20: ISTGT has been so far in the end the most stable and highest performing one and easiest to manage.

We moved towards LIO, using targetcli, since this was built into Linux kernel and had some great performance promises and features were all there what we needed, including thin provisioning. That was on paper. For a new node thin provisioning doesn't really matter, so we didn't look into it initially, but after a storage node has been running for a month or two, it's a must have. There was also some performance regressions. One would expect that something built into Linux kernel would have even semi decent documentation and everything open source.

Looking into the documentation: There's not a single open piece of documentation about thin provisioning, explaining how to enable it. Looking from the software it requires some obscure seeming parameters which are not explained anywhere.

Same goes for the management API: Something we def need since as is, targetcli/LIO requires HUGE amounts of manual typing, mostly repeating yourself. Something a script could do with one line command, ie.: ./setupVnodeIscsi.php vnodeId vnodeUser vnodePass dataStorageInGiB osTemplate
Right now we have to create the files manually, and type something like 30 lines or so for each vnode manually, a little of this can be copy pasted reducing the risk of typos, but typos still happen. There is no exampls of how to use the API, only documentaiton is reference manual of commands, nothing really how to utilize it. The documentation is at: http://linux-iscsi.org/Doc/rtslib/html/
Hindsight: Looking into configshell and it's code might have helped in figuring it out.

Don't be fooled by the URL. It say's Linux-iscsi.org but really it was RisingTide Systems LLC site, now owned by Datera. As a business man, i congratulate them for their brilliant marketing work: They have marketed themselves as the de facto Linux ISCSI open source solution - despite reality is that "community edition" users are just beta testers for them.

We approached RisingTide systems and later on Datera, before Datera acquried we didn't even get a reply. Finally when Datera acquired them we got a reply, we were curious where the Core-ISCSI files had gone, only links i found directed to kernel.org but they were not there anymore: Turns out they had turned Core-ISCSI paid only closed source, part of RTS Director/RTS OS. Datera didn't even bother to tell us the price. I doubt the price would make any sense.

RTS OS, without the cluster management features of RTS Director that is, would have had price tag of 950 euro per node. Considering that most of our storage nodes would be 24 disk arrays with max 8 SSDs, the combined maximum cost being around 6000€ per storage node, that would mean roughly 1/6th price increase even on that case, and bulk of our storage nodes were at this point 10 disks max with 1-2 SSDs the cost is unacceptable. Never mind no proper info would it benefit at all, would we receive support, could we utilize SSD caching etc. what features it includes etc. Combined to the fact that a reply i got from them i considered having kind of attitude, the reply made me feel like the attitude is "Muah! You are so scr**** now that you are using our products, you are just forced to pay for it!". I also got the sense that only way to get proper performance out of LIO was with the Core-ISCSI, they wouldn't bother to optimize for the way Open-ISCSI communicates.

By suggestion of a friend we looked into Open-E, but this has a price tag of more than 2000€ per node per year, but would at least have evaluation option for 60 days. We decided against it, as it would force us to use minimum 45 disk pods and that is too risky to put so much business in a single storage node at our current scale. But it uses SCST in the backend, and apparently has very high performance charactestics, we decided to look into SCST. We never iSCSI authentication to work due to lack of proper documentation, and this is the point where we are now at.
SCST has a multitude of ways of configuring it, and the official documentation is basicly snippets of code copy pasted, so really no help at all. Not once in my life have i been confused and dizzy after reading documentation, SCST has been the first one. Obviously a no go, such a high maintenance endeavour with custom kernels, compiling tools from repo etc.

Managing storage nodes: Not as easy as it sounds

There are many dangerous aspects, just last week we had a disk failing while raid was rechecking the array, resulting in completely broken array since the failed disk was different from the one which was being resynced. Caveat of RAID5.
Resync would always cause the storage node to crawl, and without forcing it to do high speed (end nodes crawling) it would take weeks of no redundancy(!!!).
The obvious conclusion form this is that Legacy RAID methods don't work for what we are using.

Looking into the future: Ceph

We are waiting for some developments on Ceph to start testing that on the side, the features should be released by Summer 2014. However, there are serious performance considerations with Ceph. By the scarce benchmark it looks like Ceph performance is very bad for what we are trying to achieve, but we will not know until the time comes.

We are hoping that by then the tiering software has become better or there is better SSD caching solutions available, if Ceph management is good, we would develop gateway machines with RAID10 SAS 15k drives + SSD caches and huge amounts of RAM between Ceph and SAN.

This would mean we would have 2 discreet performance and reliability domains: Ceph cluster and Gateway. Ceph cluster would provide the bulk storage at a premium on performance, and the Gateway machines would provide the performance. SAS 15k drives provide stable performance at 2.5-3x SATA 7200RPM Random IOPS performance, and smaller models don't cost *that much* and with stable high performance they will far outperform SSD with the downside of larger electrical draw.

If a proper algorithm and sane logic SSD caching software comes along, they could proof to be the key to drive the ultimate performance for end users to where it needs to be. We want to be able to offer to every single node 4 SATA 7200RPM worth of random io performance. Currently we achieve more like 1 disk worth.

With the management features of Ceph we can have multiple failure zones and storage domains. By defining "Crush maps", we can make it so that the redundancy backups are on other server room, or even on entirely different building to ensure maximum data reliability. With the upcoming Erasure coding features, we can have say 10 disks for redundancy from 100 disks, and failure of 10 disks simultaneously would result to no data loss what so ever. Due to the automation of Ceph, it would immediately begin make new redundancy copies without waiting, so in a day or 2, from the remaining 90 disks 10 are again as redundancy, only drawback being less free storage.
Complete freedom of storage capacity and redundancy will ease management by a lot. In theory, with sufficient backup, when disk failures happen, we don't need to react it to at all, we can just go and swap the failed disks once a week, instead of immediately, taking the burden away from our support staff.

Further, we can then add storage as we go, for example, just racking 10x24 disk chassis's ready to accept disks, put 1 disk o neach and have a 10 disk array ready for use, and as storage is getting used just go and add the quantity of disks required. This would make things so so much more efficient.

For example, in that scenario, we could just rack all the 100 or so nodes waiting for assemble + testing + racking right now, and start putting them online, as nodes get taken into production just go add more disks.

Since performance is made on the gateway machines, the Cluster disk quantities don't affect that greatly on the end user performance for it to matter. We could simply maintain a healthy 30-60TiB buffer on storage and utilize whatever disks happen to be on hand or at the shelf at our vendors.

*This is what we wanted to achieve to begin with!*
But too many unexpected issues arose that we never got around to coding the management system and too many caveats were discovered in key pieces of software to make it a reliable reality. As a small business we simply lack the resources to hire developers to develop something like Ceph with the performance characteristics we need ourselves. That would need a dedication of several thousand man hours at the very least. ie. 3-4 person development team to get it achieved in sensible time frame, and these 3-4 persons need to be all A players in their respective development fields.

Other hardware issues

We had plenty of other hardware issues as well. Some of them funny, most of them totally unexpected. For example, failed power buttons. Duh! One would never expect a *power button* to fail :)

Some motherboards would fail network boot, unless hard power cycled. This is a totally wierd issue, happens every now and then on all types of motherboards. Power cycling solves it. They would simply not even see there is PXE firmware on the NIC.

Pierced Riser cables: Occasional networking failure, disappearing NICs etc. This took a while to notice what was causing it. Since we are currently stacking the nodes close to each other, sometimes the upper node motherboard's bottom component pins would pierce riser cable of the lower node, causing the NIC PXE firmware to load, poor networking performance, random crashes etc. This one was hard to troubleshoot, as they always looked just fine, and required a bit of luck to even notice this was happening, some of the mobos were just 2-7millimeters too close to each other. Since then we began putting hot glue on the riser cables as extra protection and not assembling the mobos as close to each other.

Poor networking performance: Bad NIC chips or overheating. Some RTL chips have very weak performance, a, b and c models of RTL 1Gbps NIC chip is a no go. Only e is now accepted. A few times poor networking performance was the result of a nic chip overheating. Duh! one would expect that if they use such a big process node it creates significant power draw (we found out that upto 4W !! ) they would put even a small heatsink on it. Ever since, we've been adding tiny heatsinks to the chips which don't have, even the latest model Intel GT NICs which are "Green" and very power efficient.

Overheating chipsets: Some mobos don't have heatsink of the chipset, since some came used to us, they were all brown around the area of chipset huge chip. The odd thing was: This issue was first noted on *Intel* motherboards, duh! Initially we were "OK! They have designed these for passive cooling on cramped space and decided that no cooling is required". Eventually tho, we started adding heatsinks on all and every one of them.

Heatsinks: Since those 2 above, we've been adding heatsinks to some chips despite no obvious need. With some motherboards we put them under load, and used a infrared thermometer to gauge the chip heat levels and added heatsinks to the chips heating beyond 40C. Some ran as hot as nearly 60C after just 5mins or so!

NICs, NICs, NICs.... We had to go through 10+ models until we found what we want to use... The best Intel has to offer. This too took months to realize that despite RTL advertised specs, they are not up to the task at all. Now we use only 2 models of Intel cards for bulk of our mobos: Intel PRO 1000/GT and Intel PRO 1000/MT. These cost multiple times more, but less headaches, better performance makes it all worth it. We still use the latest RTL 1Gbps chip on the PCI-E adapters, but most mini-itx motherboards have a PCI 32bit connector.

MOBOS: Initially we assumed that all mobos delivered are working and useable. We found out that is not true, we had as high failure rates per stack as 60% ! even new Intel mobos tend to have significant failure rates initially, but then again, with those they were an experimental model to begin with, and many of which were engineering / tech samples which were delivered. Eventually intel stopped the production of the motherboard DN2800MT. Shame! We really really really liked this mobo, only thing it lacked was second 1Gbps network adapter.

Network booting (PXE booting): To add to the per stack failure rates is a bizarre Linux network booting bug: After loading initramfs Linux would use DHCP to get IP address + network boot info (iSCSI target info). On some mobos this fails, on any of the nics, some mobos it fails on the other NIC. Fortunately, this issue rarely happens on addond network adapter card and is usually a symptom of other issue (see riser cable above). Certain MOBO models would claim PXE, but fail at various stages of the boot up stage. Some mobos would refuse to load addon NIC PXE firmware. Most often these issues were with Foxconn brand mobos, which is wierd since Foxconn MFGs so many of the mobos which are in production!

RAM: This mostly pertains to actually acquiring the models, since most ATOMs still utilize DDR2, and DDR2 memory modules are in very scarce supply nowadays since MFG has been stopped or nearly stopped on these. This means 2G modules are *very* expensive on retail. We have acquired A LOT of the memory modules via eBay since our local vendors dried up from supply really fast, and we need 2G modules for the bulk of the mobos. Most of the memory ads in eBay are a scam in one way or other, they go beyond and above any sane effort to scam you into buying server ram or 2x1G kits instead of the seemingly advertised 2G modules. Now we have useless 1G modules in the *hundreds* on shelf, and useless server memory modules in the dozens. Duh!
On the other hand, we have found out that second hand memory modules have freakishly low failure rate. We've had higher failure rate with brand new modules than used. I think the grand total of failed used modules is in the vicinity of 4 or so, while new modules at least double of that!

SATA CABLES! Huh! Don't get me started on this topic! Bottomline is: The cheapest looking red cables seems to be the most reliable. If it's premium cable, it's trash, just cut it with sidecutters and throw it away. You don't want that pain. I'm serious. If it's a normal cable, infact, cheaper the better, locking clip or no locking clip, the more reliable it seems to be. More expensive it is, that much more likely it is to be crap. But hey! That's science for you! That's why Mythbusters so love to do what they do (apart from getting to blow stuff up): The totally unexpeted results. This is one of them. We didn't do proper science of these (ie. didn't write down the exact failure rates) BUT the verdict is beyond obvious: Now we throw certain types of SATA cables away immediately. For example those black ones with white strip with metal locking clips: Almost none of them actually work.
If you have periodic disk slow downs, or disks disappearing: Swap the sata cable first. If it persists, then swap the power connector and issues are very likely gone. if not, the disk is failing probably, seagates fail grafecully like this, giving symptoms before eventually failing. Other brands tend to just go off.

Hard drives! Out of many models and brands we've tried, Seagate Barracuda offers the best performance, price and reliability ratio. Despite our first batch being screwed up with 60% failure rate, which was obviously messed up in freight (obvious physical damage on connectors etc.), remainder has been very reliable. Also, performance is greater than that of WD anywhere near the price. Only WD Black is able to compete on today's drives, it's just sligtly faster, but the power draw is so much higher and the price is 50-60% more it makes no sense to get them. Also, it looks like Seagates are the only ones failing "gracefully", ie. giving symptoms before failure. Also on, RMAs, we have concluded it is enough to say that these don't work on our workload and they get swapped, no hassle, no fuss. If we say that these are having symptom X, even if they can't recreate it, the disk gets replaced.

There are tons of hardware issues i've now forgotten, these were the most memorable ones, there has been dozens upon dozens of little gotchas, like how to route power wiring, power led + button wiring and their attachemnt (we now use hot glue to put them in their place). We also need to modify all the picoPSUs for the right connectors, and all the PSUs we use for the connector types we use etc.
There's been network cabling gotchas as well. Btw, we've spent thousands on wholesale network cabling alone! :O There's just so freakishly much cabling going around, many colors, many lengths, managing them etc. It's hard work to put network wiring for a stack of nodes!

Software issues has been plentiful as well, many of which has been described above with the storage issues. One of the most stupendous and annoying ones is Debian installer bugs tho! Since we still have to install Debian on so many nodes with local drives, it's been annoying as hell.
Debian installer is nowadays so full of bugs it's insane it never passed into distribution! First of all, you need to make sure you are installing on SDA, or at least your boot sector is going to be on SDA. You can forget installing debian on 2Tb or larger disks, pretty much guaranteed failure. UEFI bios as well causes quite a bit of grief. The partitioner sometimes don't allow you to make the partitioning you need to make a bootable system. Installing on RAID1 seems to be a crapshoot as well.
For some one with deeper knowledge, it's just easier to manually partition by hand in rescue system and debootstrap the system than even try using the installer.

So what now?

We are going to take our time to develop the smart server system, we tried to achieve too many things at once, too many things are being in flux and moving constantly, that no one is able to keep track of things. Management has been a total pain in the storage system, we've been in such a hurry to get produciton up and running, but all the little problems have sucked all the time we've got etc. We are basicly putting delivery of these on a halt, and going to traditional setups with local drives etc.
If you have an open order: Don't worry, your node will eventually be delivered. We just don't want to make any hasty mistakes for the final storages as in order to free up the maximum number of disks.

We would have LOTS of capacity to deliver LOTS of suboptimal smart servers, but we don't see that as a good choice: Bad manageability, performance not upto our standards, bad reliability.

So we are moving in short order to the local disks variety, since this can simplify the setup by removing second nic, and each node with their own local drives creates what i call smaller reliability zones, for disk failures, less nodes are affected (just the one in question, unlike right now 10 might fail in a bad occurence). This means we need to develop some things before we get upto speed with that, we want to finish our DIY blade design for one, and we need to do a PXE installer setup as well. Fortunately, for PXE server installation there is readily available software both open source and closed source commercial applications, but all of this takes time.

In the meantime, we have enterprise varierities you can order, Dual Opteron and Dual Xeon options, these have a known lead time based mostly on freight time from the USA. All of which you can get IPMI/DRAC for remote management, this means you can opt to install the OS yourself etc. It's essentially a barebones leased server option, where everything is tailored to the customer, 1 to 4 disks on each system, of your chosen type and size etc. We also have a limited supply of 20 and 24 disk options and a few 16 bay DAS's for utilizations. You can opt for SATA or SAS, HW RAID etc. 100Mbps or 1Gbps, Guaranteed and Bulk volume varieties.

Just contact sales if you need such.
We still also offer leased servers from 3rd party DCs just like we used to, just contact our sales with the specifications you need.

Custom DIY Blade design is expected to be finished around march or so, and at which time we will not offer an preorder, all orders would be deliverable within 1 business day.

Bazar, Dekabr 29, 2013

<< Geri

Redirecting efforts from Smart servers to Bare hardware (PDS series) and lessons learned

Nizamlamalara görə

Nizamlamalara görə

Yardım

Redirecting efforts from Smart servers to Bare hardware (PDS series) and lessons learned

Nizamlamalara görə

Nizamlamalara görə

Yardım

Şifrə yaradın