kief's blog

Looking to hire - again

We're looking to hire again. This time I've got 6 positions to fill in my team, and we've got 4 others.

In my team, we need a Lead/Manager for the Support Team, and a Support Engineer. This has proved a tricky position to specify, since these people need to be all-rounders. They need to be able to manage our infrastructure, from OS up, so they need to be strong on Linux, scripting, and "infrastructure" tools and concepts, like monitoring, configuration management, etc. But they also need to be able to support the applications themselves, which requires knowing about Java app servers (Tomcat in particular), databases, and web servers. On top of that, they need to be customer-support focused - they don't deal with the end users directly, but do need to deal with internal customers and key people, particularly technical contacts, at our customers and partners.

On top of that we need a Senior Java Architect and an Architect/Senior Developer type. These are again tricky, I firmly believe in Architects being hands-on, and we are an agile team, so they can't just be UML monkeys. But the core development team is offshore in Poland, and our software is complex, so it will be a challenge to get up to speed with it, and be able to interface between the non-technical people in London and the developers there, take ownership of planning a major revision. Cool roles really, but a fairly new thing for us, so we can't be 100% certain how the role will work in practice.

We're also going to hire a QA manager and test engineer.

Fun times here!

We're looking to hire a sysadmin team lead

We need someone to take up the leadership of the London-based support team at the Map of Medicine, which manages and supports our server-based software products. This is a hands-on technical role suitable for someone with senior-level technical skills, ready to move into or improve in a team-lead position. We are also looking for a junior syadmin type, probably someone looking for their second job in the field. The contact email is tech dash jobs at map of medicine dot c o m. No recruiters wanted.

What the team does

The responsibilities for the support team involve:

  • Delivering and supporting our applications to our customers (we provide second and third line support, first-line helpdesk support is handled by client organizations), internally and externally,
  • Helping new and existing customers with the technical aspects of implementing and integrating our software into their systems and processes
  • Running, improving, and expanding our hosting infrastructure, particularly as we grow internationally.

Our company combines the agility and rapid pace of a four year old software start-up with the stability and resources of an international, publically listed publishing company.

Details

We develop and provide a software service that serves as a healthcare knowledge source across multiple healthcare organisations. We deliver this as a hosted service, and also as software that partner organisations can host.

The Map of Medicine application suite is a Java-server based web application that runs on Linux or Windows on a Tomcat application server using a MySQL, SQL Server, or Oracle database.

Technologies we use

Our hosting infrastructure currently spans a production farm in redundant data centres, and a development/test "lab", and will expand to another data centre internationally. Our production systems are primarily Linux with Apache and Tomcat, with some Windows servers running SQL Server. Our test lab uses virtual servers on VMWare ESX, running Oracle, MySQL, and various other bits of technology.

Ensuring that our applications are highly available, reliable, and responsive is a key driver, so we implement monitoring, security, and centralized configuration management using mainly open source tools such as Nagios, Puppet, Subversion, and OpenLDAP, as well as custom scripts. Load balancing, database clustering, and data-centre failover are other cornerstones of our availability strategy.

We must be able to rapidly configure and deploy our applications for testing, pilot projects, and expansion. Centralized configuration management is essential to this, as is our use of virtualization technologies.

We also provide knowledge management services internally and externally, and provide leadership to the rest of the company on how to effectively use these. These are based around our wiki and bug tracker using Confluence and Jira.

The Role

The Support and Infrastructure Team Lead needs to be an experienced system administrator who can lead a team of 4+ in planning and implementing new technologies and expanding and improving use of existing technologies. This is a hands-on technical role. The Team Lead needs to be highly customer-focused, able to discuss daily operational problems and issues with non-business people and customers. They will also need to be able to think longer term and be involved in planning new applications and services and consult with client-organisations to plan their technical implementation of our software.

Required Skills

  • Strong knowledge of Linux and Windows server operating systems,
  • Strong knowledge of web-application technologies including Apache, load-balancing, virtual web-hosting, HTTP protocol,
  • Knowledge of SQL databases, including installation and principles of tuning, and at least basic SQL,
  • Experience in managing multi-server infrastructures including monitoring, configuration, user and host directory services (LDAP, DNS), shared storage, backups, etc.,
  • Scripting languages (Bourne/Bash shell scripting, Perl),
  • Excellent written and verbal communication,
  • Experience working in a technical support environment, understanding of issue tracking systems, Service Level Agreements (SLAs), first/second/third line support arrangements and other technical support processes,
  • Ability to manage and prioritize a fast-paced, rapidly changing workload keeping customer and business needs at the fore,

Desirable Skills

  • Java application servers, particularly Tomcat,
  • Server virtualization technology, particularly VMWare ESX,
  • Shared storage, particularly iSCSI,
  • Experience of healthcare systems or working with or in a clinical environment
  • Presentation skills

Agile service management pattern: Deploy often

I'd like to start on a set of rules for running and supporting onlines services in a way that takes advantage of lean production principles. This is carrying on the thoughts stirred up by my recent exposure to ITIL, which is pretty much the IT infrastructure equivalent of the waterfall development principle.

So the first pattern I came up with in that post was around keeping people involved throughout the process of planning, rolling out, and running a service, rather than having each of these things done by a different set of people, relying on the mythical "knowledge transfer" process. I'm sure there's a lot more to say on that one, but for the moment I'd like to get another idea out.

This pattern can be called DeployOften. I've found that a good rule in most areas of life is: if you find something difficult to get right, do it more often. This is obvious in some contexts - it's called practice - but in day to day business it goes against the grain to seek out painful tasks and repeat them more than is necessary to get through the job. The benefit is that the more you do something, the better you get at it, and it becomes less painful.

The principle of doing difficult things more often is found throughout agile development methodologies, an obvious example being test-driven development. Where testing is usually a painful and dull process, typically skipped or skimped, agile makes it a central behaviour, in fact the very first thing you should do when coding is to write up automated tests.

A difficult and painful part of service management is deploying new or updated software, part of what ITIL calls the transition phase. One organisation I know of tries to release updates to their software every 3 months, and each time it's a trial. Deploying the software to the server inevitably turns up surprises, and acceptance testing by users drags on with multiple rounds as updated releases are built to fix problems discovered.

This is in spite of automated nightly and iteration builds, which somehow never bring out the same issues that come out even on staging servers, which use snapshots of current live data sets, and more rigidly mimic the live deployment environment.

The DeployOften pattern suggests that the operations team should deploy each iteration build onto a production-like environment rather than waiting for the nearly-complete release. This will raise cries from the ops people, who don't exactly have light workloads already. But by deploying every two weeks they will get it down to a very quick process, and also turn up deployment problems much sooner in the development cycle, which should raise the developers' awareness of the kinds of things they need to keep in mind to make deployment easier.

ITIL can suck, but shouldn't

The pattern of my career over the past five years or so has involved moving newish, smallish internet/software companies to a post-startup hosting infrastructure. My past three companies were all small companies that developed and hosted internet applications, either for clients (in one case) or their own products. My role in each case was to move things to a more mature infrastructure, with configuration management, monitoring, directory services, and the other pieces needed to be able to manage a growing sprawl of servers and applications.

My focus has been much more on the technical than on the people processes for running and supporting the infrastructures. In my current job, the team I've brought in and the infrastructure we've built has reached a decent level, although there's certainly plenty more to do technically. But looking at what we've done and what I want to do next, I've realized thatrather than moving on and doing the same thing at another company, the more interesting challenge will be to take things to the next level.

The next level for me is going beyond the technical to focus on the people and processes. The technology infrastructure is going to grow in size and sophistication, including spreading to multiple data centres globally, but the technical challenges seem like more of the same to me. The challenges that seem newer and more intriguing to me personally are more along the lines of how the hell we're going to organize and coordinate people doing development, infrastructure, and support in three, four, or more countries.

So I went on a course in ITIL version 3. Yikes. ITIL is basically a blueprint for organizing a huge IT operation with lots of bureaucratic processes, forms, and signoffs that will make it nearly impossible to get anything done, and ensure that responsibilities are divided so that nobody who is doing anything productive sees the big picture.

I don't think it has to be this way. I actually did find the course useful, although not as useful as it could have been given that most of the people on it were more interested in ticking off the certification than getting ideas on how to improve the organisations they work at. There were some pretty interesting people there, some of whom were obviously interested in fixing real problems "back home". If the course had been more of a workshop where we shared war stories and ideas, it would have rocked.

A lot of the concepts in ITIL are useful, I think it's more a matter of using your head when applying them, making sure to adapt the ideas in ways that fit your needs and objectives. It's very easy to see how an organisations, especially large ones, take the ITIL material and use it to build horribly inefficient IT structures. I've worked with companies that use ITIL this way, and the course shed light on how they got this way.

The biggest problem with ITIL is that it's presented with clearly defined "phases" of strategizing, desiging, deploying ("transitioning"), operating, and improving IT services. This is an invitation to a waterfall model, where (as in at least one organization I know of) each phase can even be run by a completely different team of people.

So one group designs the service, hands it off to another than rolls it out (tests and installs it), and then hands off to a completely separate team that supports it. In the organisation I've encountered, the operations team hasn't got the vaguest clue about the service.

Of course the transition process involves "knowledge transfer" where the people who set up the service train the support team, but anybody who's done this stuff in the real world should know better.

Knowledge that is transferred in a handover process is never, ever, ever going to be learned as well as knowledge that comes from actually being involved throughout the whole process. Having some hands-off manager (ahem) overseeing things all the way through doesn't cut it, the people who will actually be diving into runtime problems with an application need to have gotten their hands dirty trying to install the application, and even have pitched into meetings where the details of how the application should be integrated into the infrastructure.

Otherwise, you're going to end up in the situation of my nameless organisation, one which is actually often held up as an exemplar of ITIL. They host an application on their servers, installed by the transition team, and their support team had training on how to log into the server and investigate problems with it. But when users call up with problems, the support people, who probably support dozens of applications, have forgotten all of this. They call up the software vendor - who have no access to the servers.

Can you imagine how incompetent your organisation looks when it's clear that your support people have no idea that the application they support is run by their own company?

But I do think it's possible to take many of the ideas of ITIL and apply them in a more agile manner. A bit of Googling shows I'm not the only one who thinks so, but that there doesn't seem to have been much work done on the idea, at least publicly. It's certainly something that would take a bit of thought and work.

My first thought, clearly, is that an agile IT services process would have to embrace the lean management principle of empowerment by having the "workers" (for lack of a better word) involved throughout the process.

I've also thought that the kanban approach to agile is paricularly suited to a sysadmin team, since it does away with the iterations/release cycle in favor of a queue of tasks that people pull from when they find they've got spare capacity.

Anyway, I'm looking forward to thinking this stuff through and trying out ideas over the next year. Although I'm going to be far less hands-on technically, my focus does need to involve a thorough understanding of the technical aspects of what we're doing, so I don't think I'm going to become a total suit.

SAN Virtualization

Virtualization in the server fram relies heavily on storage, something I understand only to a certain level. This
review by InfoStor discusses virtualization of the SAN itself to get the flexibility you really need to get the best value out of server virtualization. It's somewhat over my head at the moment - I haven't gotten to this point yet - so I'm putting this link here for future reference.

Found via virtualization.info.

Recommended blogs for virtualisation

I've added a few virtualization-oriented blogs to my feed reading. My favorite is VMUNIX Blues, by Mark Mayo. His posts aren't as frequent as the other two below, (although much more frequent than mine!) but the quality is high, he's doing hands-on stuff. I first found him by Googling for thoughts on shared storage approaches, which turned up this post about using NFS with vmware server.

Virtualization.info is by Alessandro Perilli, a consultant who specialises in this stuff.

VMblog seems to be driven by industry press-releases, but is nevertheless a worthwhile resource.

Comments off

Grr. Turn your back on your blog for a minute, and it gets filled up with comment spam. Ok, maybe it's been longer than a minute.

Anyway, I've upgraded to a newer version of Drupal, and turned off comments for the time being. I had to switch off my old theme because it has issues. The current theme will probably stick around for a while. I do plan on tackling the comments though.

Whipping up a solid LDAP infranstructure

I've been much too quiet lately. I'm still hard at work putting together what I hope will be a very strong infrastructure for my company's application hosting operations, with about 15 servers for production, content management, and staging and testing.

One of the core components of this infrastructure is an OpenLDAP server, which I've been working on over the past week. Up until now it's been enough to have a couple of accounts which are created locally on all of the servers by puppet. I've got a chunk of disk space on a SAN which is shared across the machines, which is handy for having a common home area for key accounts I use to login and administer the machines, as well as the puppet templates and manifests.

The cool kids talk about operations

Tim O'Reilly, the boss of O'Reilly publishing and a key booster of the Web 2.0 meme, recently posted an article about operations.

One of the big ideas I have about Web 2.0 [is] that once we move to software as a service, everything we thought we knew about competitive advantage has to be rethought. Operations becomes the elephant in the room.

O'Reilly laments that most of the tools for deploying systems and applications on open source platforms (i.e. Linux) are not themselves open source. Luke Kaines and others have commented on the article with examples of open source deployment and operations management tools, including Puppet, and others I've mentioned for system configuration and network monitoring.

I disagree with Lessig's evaluation of the Net Neutrality camps

Lawrence Lessig declares that the two camps of the Net Neutrality debate are those who built the Net vs. those who never got it. I don't think that's accurate, the telecomms and cable networks have been a pretty key part of the Net. (Found via Rafe, btw).

I think the real division here is content providers vs. pipe owners, and the attempt to do away with network neutrality is essentially a coup attempt by the people who own the pipes.

People use the Net for the content, so that's where the value is. The pipes are just a commodity, which are expected to simply deliver the content. Selling a commodity service means competing on price, which means low margins.