Published by nick on 02 Dec 2007 at 11:22 pm
Be the best system administrator in the world
I’ve worked as a sys admin. I’ve managed system administrators. I’ve been a customer of a system admin. I’ve worked with great system administrators, and the not-so-great. Here are some thoughts on how to be the best there is.
- Monitoring — This is the #1 secret. Monitoring is your automated Quality Assurance team. It will be there to save you, be there to point our your flaws, and enable you to fix problems before they are noticed by your customers. What’s worse than having something down for 24 hours? Having it down for two weeks before you even noticed. Monitor everything imaginable. In fact, monitor your monitors.
- Service level monitoring — are the ports listening for every important service?
- Smoketest monitoring — Regularly fetch a page just as an end user would, with a reasonable timeout. This ensures that all the pieces are working together and delivering real pages to end users.
- Network monitoring — Is the network up? From how many different locations? Have at least one offsite network monitor.
- Server stats — disk space, memory usage, cpu usage, network traffic, you get the idea. Ideally these are logged historically too, so you can pinpoint when the box started getting busy.
How to administer all this monitoring? Nagios is the industry standard. It’s quite a learning curve and lots of set up, but well worth it.
- Minimize manual processes — Let’s pretend you’re going on vacation for 3 months. What will break while you are gone? Will disks fill up because backups are rotated by hand? What process will stop working because it needs some sort of manual intervention? Put mechanisms in place so that these items are self managed. Your goal should be to automate yourself out of a job. You’ll never get there, but this type of thinking will improve the stability and robustness of your systems.
- Be a Consistency Zealot - Do you have configuration files that are different for each server? Do you have one way of managing files for dev boxes and another for production? Staging? QA? Here’s a quick test that you can give yourself from bash to see how consistent you are. Try this:
for box in $myboxes; do
ssh $box md5sum /etc/my.cnf
doneHow did you do? If you are like most systems, there were at least one file that was out of sync. Minimize the amount of these wherever practical, and aggressively defend this consistency. There must be a compelling reason to have things different. Tip: How about putting some monitors on critical configuration files so you know when they are changed?
- Be a Simplicity Zealot — The number one rule of software also applies to the system administration world. Sure, you can build a hot failover mechanism so that you save the 5 minutes of time to switch manually, but are you sure the extra complexity is worth it? If your lead developer can’t come in and figure out what is going on and fix something in an emergency, you are probably doing something wrong. I have had a couple of people that didn’t understand my simplicity push at first, but after living and breathing it for a while, they now also swear by it. If in doubt, go with the simpler path.
- Don’t be a dick — we’ve all worked with "that guy". He always talks about how stupid users are, and everyone is beneath them. He can’t even bother to stop and talk to you, because you obviously wouldn’t understand. He answers almost every question with either "It depends" or "Read the F^@$#ing manual". He enjoys pointing out how stupid your question was, and makes you ask it perfectly before he’ll give you an answer. Don’t be this guy. Instead, try to be the guy that everyone likes to be around.
- Restores are worth more than gold — It’s critical to get backups right. People don’t want to have worry about backups, they just expect that they are there. With backups, you can either be a hero, or a severe disappointment. Run around without people knowing and back up everything they expect, plus some things they don’t. Try to make someone’s day at least twice a year by having something backed up that they need. Tip: Also make it easy to restore. It’s awful to have a backup of something, but the restore process takes longer than the usefulness of the backup.
- Be Trustworthy. People have to trust you. You can read these people’s e-mails, and you often have access to their personal files. You are expected to answer your phone at all times. A solid working relationship requires a high degree of trust with all jobs, but it’s even more important as a system administrator. Hold your word as something to be cherished. Maintain high credibility by holding back your thoughts until you verify what you say. Make sure that everyone knows that they can trust you with their secrets, and be there to catch you with a much needed backup when they mess up.
These principals will guide you towards System Administration heaven, where users are smart, disks never fill up, network lights are always green, and developers can’t screw up your systems.
Clint on 03 Dec 2007 at 8:25 am #
Ugh, you users. Don’t you know you are just supposed to trust us that everything is right?!
As a sysadmin, I have to stress #5.. Don’t be a dick. If you can’t be open and communicate with your users, people will do everything in their power to work around your controls.. and you’ll waste lots of time on keeping them in check (and of course, on cleaning out their workstations cause they get fired for hacking on your network.. ;).
On #3 .. montoring when your config files change is not just good for uptime, its a good security practice. There’s a program called tripwire that is great for this.
For consistency, something like cfengine, or systemimager are key if you have more than 15-20 boxes.
matt on 03 Dec 2007 at 9:57 am #
Perfect systems are generally impossible to build, and when possible not practical.
If you strive for perfect projects, you will likely miss the forest for the trees. Accept this, and make sure your systems are built to be "as good as it needs to be" …. or maybe when you have resources, maybe even much better.
Make sure that when things do go wrong, you have this planned as well …. many of the above points will also ensure that you have an easier time fixing it (1,2,3,4,5,6). Yes even #5– nobody wants to help a dick.