Managing a lot of servers
It's less trivial than one would think, they are just plain servers, right?!
While that is true, they are plain servers, but when you keep growing, configurations will slightly differ, such as package version, backend software versions, configurations.
And when a software with no backward compatibility such as rTorrent is introduced, it just makes things that much harder.
So to maintain easily a lot of servers in a streamlined machine, we've built quite a few scripts ourselves to do that, ranging from on server management, to management node scripts, to monitoring, to automating software stack.
One of the things is, how to maitain 40+ SSH sessions concurrently you need infrequently, but still so frequently it annoys to close and restart them? How to manage passwords, or to use keyless logins? How to distribute login certificates? How to distribute software?
Small things become non-trivial when you have enough mass.
Things like slight changes to FTP config changes. It's no fun to login manually to say 60 servers.
Or when you get 40 orders in a single day and how to setup all those accounts, is another question all together!
However, just being proactive about it and making progress to streamline processes every and single day yields results.
Just today, we solved several problems before even users on the servers noticed that, simply because we've built tools to manage servers we could notice things you wouldn't ordinarily notice.
Another added benefit is performance characteristics, we know how things change, due to the mass network metrics actually yield relevance! With single server, you make an change hoping for better performance characteristics, you simply do not know what was the end result because it's just a single server, and due to nature of internet. Torrents don't transfer at stable speed, they bounce up and down. Only way to see effect is when the effect is cumulative from tens of servers.
Another thing we did to make things faster is build our own shell environment to make it fast to connect servers, run a few commands. It's nothing fancy, just some autocompletions built into the shell environment, some visual changes, some aliases etc. to speed up things.
One of the things is task abstraction and separation into smaller pieces. Smaller the pieces are, easier the pieces are to maintain.
It's also important to test. For this we operate lots of local VMs which have separate versions, or software stacks to see how they interoperate and how it works etc.
So at times when you feel like why setup is taking so long, or why we cannot support your special request, it's simply because we have a large mass to manage, and we need to have streamlined operations. We cannot maintain N*N different configurations, it's unmaintainable and we would sufficate under the administrative overhead, leading that we would need to hire additional support people. So, when you request something trivial sounding like "Can i have public.MYUSERNAME.pulsedmedia.com so i can share files?", to make that happen infact requires us to add that support for every user, on every server in an autonomous way.
Otherwise we'd end up with yet another custom configuration, which is hard to maintain. Nevermind how to account for the custom configurations, so we don't accidentally wipe them out?
Therefore, we can only implement most requested features. Not every single feature, or custom configuration.
So next time you think "why can't they support feature XYZ?", it's infact because we need to make sure it doesn't have negative impact, we need to give it to everyone, to every single server etc.
We operate on new servers a newer customized rTorrent, and this has been HUGE headache as of late, infact, it has caused most of setup errors etc. Why you may ask? Config is not 100% same on each server. We didn't envision a situation where rTorrent configuration backwards compatibility is broken, so we are currently supporting 2 different versions of rTorrent. It's almost as bad as supporting 2 different torrent clients all together, and sometimes we simply forget which is what.
That's why we are constantly developing in the background, and making things better! You may think "oh nothing has happened in a while", while in the backend we might have changed all together how we operate things.
We changed our management style just a month ago quite radically, it's not a visible change, but it yielded immense time savings. Now we are under progress to streamline this further to save even more time.
Friday, October 22, 2010