I’m accustomed to a certain amount of bluster and grim cynicism when I talk to customers. It’s a bad time to be running an IT shop these days, especially in government.
Even before we meet, my relationship with a customer is already strained: I’m a vendor, and most vendors have only two interests: 1) the budget, 2) how much of that budget is coming to me. There’s a whirlwind of double-speak and buzzwords that I can use to confuse, inspire, and otherwise distract from that. Talking about clouds or Something-as-a-Service is useful for this, since nobody knows what’s happening in that market, and nobody knows what IT is going to look like in five years, but everyone’s certain that it’s Really Important™. Before cloud, it was virtualization. Before virt, it was Service Oriented Architectures. And so on, forever. For my customer, that means the world is more confusing than it’s ever been, and I (as a vendor) will make it worse. He’s come by this bluster and grim cynicism honestly; a world-weary impatience is a great way to get those endless product roadmap presentations over as quickly as possible.
Sometimes, though, I get surprised. Point blank, a customer recently asked me what I would do with his $1 billion IT budget. He’s looking down the barrel of a 10-25% cut in budget, a raft of new, expensive mandates from Congress and the White House. As far as he’s concerned, his budget is $100, because $1 billion will only keep the lights on. His honesty, directness, and vulnerability floored me.
My friend isn’t dumb, and he hasn’t made any bad decisions. He’s working in a bad system. Let’s call it the “Cult of the Product.” It’s an almost religious conviction that we all share: buy this thing, and your problems will go away. Both the vendors and the customers are well-trained for this because we grew up saturated with consumer advertising. The recipe is simple: play on an IT manager’s fears, and then show her how to buy enough stuff to make that fear go away. In this respect, anti-virus and mouthwash vendors are in the same business.
Let’s pass over the icky psychology of it, and look at the consequences. The Cult of the Product underlies some of the most pernicious problems in this industry. We spend far more on capital expenses than we should, because we’re buying products we don’t need and never use. We’ve built acquisition systems that take 48 months to produce a requirements document because it’s optimized to find products, not solve problems. An “IT strategy” today, or what passes for it, is not much more than a tedious (and futile) process of aligning vendor roadmaps in the vain hope that a new requirement will be satisfied on time and on budget. We spend nearly all our time making product choices, and very little time thinking about how we’d like our IT shops to actually operate.
So my friend is right to hate his vendors, and he’s asking the right question. A bill of materials is not an answer. He needs to leave the Cult of the Product and focus on making his operation functional again.
The relentless churn of technology is the problem. Innovation from industry is relentless. Kryder’s Law suggests that the storage capacity doubles every year. Butter’s Law says that the cost of network capacity halves every nine months. Moore’s Law says that the industry doubles computing power every 18 months. The cruel irony, as my colleague Michael Tiemann enjoys reminding us, is that we are not doing twice as well as we were 18 months ago. With every innovation, we accumulate more options, more products and more complexity. We’re doing worse.
What we’re missing are the tools and processes that must surround a set of product choices. There must be a way to yoke our performance to Moore’s Law. Why do we spend so much more on capital expenses than our peers in industry? If we’ve been doing this for almost 50 years, why is IT so hard?
It starts with acquisition.
The disconnect between the performance of government agencies and innovation in the industry has nothing to do with the products we buy. The problem is how we buy them, and how they’re used: the existing acquisition and procurement patterns are unable to quickly incorporate new innovations and bring them to bear on useful work.
The acquisition rules are built for large-scale programs with multi-year milestones. Unfortunately, industry moves far too quickly for five-year plans. By the time a government program has defined requirements, milestones, and project plans, industry has made the underlying assumptions about efficiency, cost, and return on investment obsolete. This isn’t specific to government, either: large IT projects are 20 times more likely to fail than large projects in other industries. You can read the gory details in “Achieving Effective Acquisition of Information Technology in the Department of Defense,” from the National Research Council.
Congress acknowledged this in the last DoD appropriations and offered some direction. Section 804 of the bill ordered the DOD CIO to identify alternative acquisition strategies for IT which should include:
(A) early and continual involvement of the user; (B) multiple, rapidly executed increments or releases of capability; (C) early, successive prototyping to support an evolutionary approach; and (D) a modular, open-systems approach.
We can see the DOD CIO is still figuring this out, because more agile acquisitions are part of the DOD Cloud Strategy. Meanwhile, the Federal CIO announced a new Digital Government strategy which together with the 25-point Plan for IT Reform relies heavily on alternative acquisitions strategies like Shared First. At the lofty heights of policy, then, there’s some consensus on the solution: iterate quicker, on smaller pieces, and collaborate more.
Still, complaints about the sclerotic IT acquisition model have been around for years. These new strategies and budget pressures may compel some reform, it’s safe to say that it will be years before we see a material improvement. In the meantime, the gap between what my friend should do and what he can do gets larger. It’s why his $1 billion budget isn’t an effective tool for change: he’s spending too much money solving the wrong problems. As much as he’d like to, he can’t burn everything to the ground and start over, so what’s the plan?
Making lots of small bets.
In 1807, DeWitt Clinton, the Governor of New York with the unimprovable nickname of “Magnus Apollo”, believed he could revolutionize trade in the young country by connecting the interior of the United States with its costal urban centers. “Clinton’s Big Ditch,” as it was called, would stretch 363 miles from Buffalo to Albany — one of the largest public works in history. Clinton asked Thomas Jefferson for federal funds to pay for the mammoth project. Jefferson declined: “It is a splendid project, and may be executed a century hence.” This response should sound familiar to anyone who’s dealt with an appropriations committee.
Instead of asking for all the money up-front, Clinton took an incremental approach: build one part of the canal, and allow tolls from a completed segment to fund construction of the next. Using this method, the canal had paid for itself before it was even finished, eight years later.
With an IT environment where the stakes are so high, budgets are cut by at least 10%, and there’s uncertainty everywhere, it’s the wrong time to start building 363-mile canals. It’s too expensive, and we’re in no position to plan projects that finish 8 years from now.
Instead, a series of tiny, targeted investments makes much more sense. Let’s take, as an example, a cloud infrastructure. They’re new, they’re complex, and they’re expensive – and there are hundreds being built all over government.
Some take a product-driven approach: pay millions of dollars, let a vendor roll in a “cloud in a box,” or a tightly integrated software stack and declare the mission accomplished. This is the Product Cult at work: spend enough money on something shiny, and it’ll fix the problem.
The trouble is that products often create more problems than they solve. In the case of clouds, will it work with the rest of your infrastructure? Maybe. Will this be the cloud technology everyone’s using in 5 years? Possibly. Are you prepared to negotiate the maintenance renewal, now that one vendor owns your entire infrastructure? Probably not. What you’ve purchased with those millions is a whole lot of uncertainty. You’ve made the problem worse.
Instead, consider a more incremental approach: a series of tiny steps, each one paying for the next. For example:
- Standardize your system configurations, so each system is like the other and can be cheaply managed at scale. This lowers your operational expenses by making your staff much more efficient, freeing them to work on other improvements.
- Create a monitoring framework, so you know when to migrate an overtaxed system to a system with spare capacity. You don’t know if you’re spending wisely unless you can measure the results.
- Arrange for chargebacks, so you can use your monitoring system to charge your customers for the resources they use. This is as much an acquisition question as a technical one.
- Ensure portability and interoperability between infrastructures so your tools will work with many clouds, whether they’re elsewhere in your agency, in the private sector, or run by system integrators, so workloads can be moved from one to the other based on cost and demand.
- Secure your infrastructure and the systems that reside there, and make sure you have a continuous compliance system in place. This is nearly impossible without standard configurations and monitoring.
- Rationalize your application portfolio so you get rid of redundancy and enforce some discipline in your software stacks. Most folks look towards a PaaS model to drive this kind of change. I’ll talk much more about this later, but in the meantime you should read Danny Bradbury’s excellent treatment of the subject.
The list of improvements you need for an effective cloud infrastructure is long, and when treated as a single, monolithic project, it makes carving a 363-mile canal out of the Upstate New York wilderness an attractive career choice. Instead, find the small steps you can make now that will immediately lower your operating expenses. Each step should be a success in its own right, either by making your staff more efficient or helping you better use the resources you already have. When you make mistakes, they’ll be smaller, more manageable mistakes that can be more easily corrected. You’ll avoid wasting money on large gambles, avoid being held hostage by a single vendor, and each step towards the cloud frees up budget for the next improvement.
You’ll notice that my list of improvements don’t mention products at all. They describe a process – the process we should have been using a long time ago.
Monoliths and Modularity
Besides the other benefits, an incremental, evolutionary process necessarily means more modularity in the systems we build. We have seen this behavior in open source projects, and it’s explicit in the Federal CIO’s “Contracting Guidance to Support Modular Development” report:
Modular development focuses on an investment, project, or activity of the overall vision and progressively expands upon the agencies’ capabilities, until the overall vision is realized. Investments may be broken down into discrete projects, increments, or useful segments, each of which are undertaken to develop and implement the products and capabilities that the larger investment must deliver… Modular development must be viewed within the larger context of capital programming and the different levels at which program development is accomplished.
More linear, “waterfall” approaches create the monolithic, brittle systems we use today. They can’t be easily upgraded, and probably rely on a handful of locked-in vendors to give them life. These monoliths reduce your choice, your flexibility, and your ability to change your mind during our iterative, more agile approach. Monoliths plague my friend with the $1 billion budget. There are second-order effects of monolithic systems, too, that we often overlook. I heard a story about the development of one of our new fighters which is illustrates the point. It’s apocryphal, like all good homilies.
A Federal contractor began building a new fighter, and that meant writing a lot of software. They made the decision that this software should not be rebooted while the plane was flying. That may sound obvious, but it’s surprisingly common to reboot aircraft in flight. If something goes wrong with a particular subsystem, the quickest resolution is to turn it off and turn it back on again. Everyone with a personal computer will understand this.
The engineers of this new aircraft, though, put their foot down. For good reason, they believed that in the 21st century, we should be able to write software that is so reliable it won’t need to be rebooted. To make sure this was true, they wired the computer to the landing gear, so you could only reboot the plane if there was pressure on the landing gear, meaning the plane was on the ground. You can guess what happens next.
During a test flight, the computer freezes. The pilot was unable to reboot, so he ejected. The prototype was lost. The chastened designers were now confronted with the daunting task of rewiring the plane to allow in-air reboots. Rather than wander through the nightmare of schematics, plans, and a whole bunch of software, they did what any good engineering team would do: they hacked it. An override switch would send fake data into the computer to make it appear that the landing gear was deployed and sitting on the ground, allowing the reboot to take place.
It’s safe to assume that this isn’t the only nasty hack on that airplane. Think about the cumulative effect that years of hacks like this have on system complexity, the maintenance burden, and reliance on the original engineering team. When you work with monoliths, you inevitably confront hacks like these and they have disastrous effects on your ability to upgrade, improve, or otherwise tinker with the overall system.
Hacks, though, are notoriously common to an IT operation. There’s all kinds of monolithic software and systems that you can’t touch, can’t fix, and are frankly intimidating. It’s the reason we still have antiquated FORTRAN and COBOL systems running significant missions: they work well enough, and we’re terrified of screwing with them. Hacks are also why it’s hard to link two systems together, like the merger of MHS and VA’s electronic records:
VA and DOD planned to integrate 12 specific areas in the project, and IT is the only one to be delayed, the GAO found. The others are complete or in progress…. In a previous GAO report issued in 2011, all three of the components were delayed, the new report says. The HCC implemented “costly workarounds to address the needs these capabilities were intended to serve.”
Modularity and Change
The modularity that comes from an incremental approach allows you to more easily replace individual components of your infrastructure, reducing the need for hacks. For instance, you may decide that the monitoring system you have isn’t good enough. If that monitoring system is tightly bound to the provisioning and security systems, you’re in trouble: because of a procurement decision made years ago, you’re married to this sub-par monitoring system for the useful life of the system. If it’s a loosely coupled component, on the other hand, it’s easier to change. I love the story of the Springfield 1861 rifle that illustrates this:
The Springfield Rifle cost $20 each at the Springfield Armory where they were officially made. Overwhelmed by the demand, the armory opened its weapons patterns up to twenty private contractors. The most notable producer of contract Model 1861 Springfields was Colt, who made several minor design changes in their version, the “Colt Special” rifled musket. These changes included redesigned barrel bands, a new hammer, and a redesigned bolster. Several of these changes were eventually adopted by the Ordnance Department and incorporated into the model 1863 rifled musket.”
— “Springfield Model 1861,” Wikipedia.
Colt’s improvements to the rifle were made simple by the fact that the rifle was composed of interchangeable parts. This allowed them to focus on specific, iterative improvements to the rifle’s design. If one component wasn’t working properly, it could be replaced with another with a minimum of friction.
A more modular system is therefore more available to change and makes it easier for us to lash our future to phenomena like Moore’s Law. We already see this in commodity computer hardware: as soon as a newer, faster CPU is available, chances are very good that we can buy it from the lowest bidder and make it immediately useful. Before x86 architectures made our hardware modular in this way, that wasn’t possible. We should have the same flexibility in our software.
This “availability to change” also makes for happy accidents. The Springfield Armory was able to benefit from Colt’s experiments only because modularity made it easier for Colt to tinker with the design, and for Springfield to incorporate Colt’s improvements. More recently, Red Hat Enterprise Linux recently gained its 15th Common Criteria certification, and one of the key features of this new certification, secure virtualization, is a “happy accident.” You can read all about that in another article I’ve written, “How Linux, sandboxes and happy accidents can help a soldier.” We should do more to draw system agility and this kind of serendipity from our IT systems.
A real transformation.
When we talk about the government’s IT reform, shared services, and cloud initiatives, we’re talking about an opportunity for redemption. These policies are disruptive, and we can use them as a lever to introduce more flexibility, innovation, and choice into our IT systems. Too many of us are squandering this opportunity. We introduce a proprietary, monolithic virtualization layer and call it a day. We buy “clouds in a box”. Our systems get more, not less, monolithic because the Product Cult has us thinking about these reforms a set of product requirements, rather than a change in process. If we focus on process, instead of all these products we’ve glued together, we can make our transformation profound and much more meaningful.
I recently finished Arthur Herman’s excellent history of America’s industrial mobilization during World War II, “Freedom’s Forge.” It’s inspiring for all kinds of reasons, but the chapter on the history of Consolidated Aircraft’s B-24 Liberator bomber stuck me in particular.
When the Liberator was first designed, aircraft were still hand-crafted. They used some modern manufacturing methods, like the assembly line, but by-and-large each plane was unique. The 483,000 parts were cast from soft metal or rubber molds, which meant that no two parts were exactly identical. This required each part (which were supposed to be interchangeable) to be individually customized before they would fit in an assembly. The Consolidated Aircraft hangar where the Liberators were built couldn’t hold an assembled plane, so the fuselage had to be rolled onto the tarmac, in the San Diego sun, before the wings were affixed. The parts naturally expanded in the heat, making it impossible to assemble the entire plane without further customization. Meanwhile, the Air Force showered Consolidated with change orders as they developed more mission requirements. The challenges of building a system with unheard-of complexity, at unprecedented scale, under constantly shifting requirements, should resonate with anyone who’s run an IT shop.
Charlie Sorensen from the Ford Motor Company was brought in to fix the problem. He made some marginal improvements to tools and products, like using steel molds which would more reliably reproduce certain parts, but most of their improvements were to the process. The first thing Ford discovered was that there were no plans for the Liberator’s design. It’s astonishing to think about, but there was no single blueprint that described how nearly half a million parts were supposed to fit together. That required Ford engineers to draft two railcars’ worth of blueprints. Sorensen then halted the stream of change orders from the Air Force, and declared the Liberator “finished.” Any changes would be retro-fitted onto the mass-produced version at a number of depots throughout the country. With a final, well-defined design, they could finally address the real problem.
Sorenson approached the process the same way he approached automobile manufacturing:
- Identify the essential units.
- Produce a factory layout for each of those units.
- Define the manufacturing flow.
This allowed him to understand each portion of the manufacturing process as a kind of interchangeable part itself, so each process step became something measurable and manageable – a module. This allowed Ford to move from a single assembly line to “multi-line”; four parallel assembly lines which merged into two, which merged into one, which produced a fully-assembled aircraft. To accomodate this plan, they built the largest manufacturing facility in the world: Willow’s Run, just outside of Detroit. With a well-defined design, a well-defined process, and the modular infrastructure necessary to bring it all together, Liberator productivity went from less than 350 a year to 650 a month.
Consolidated Aircraft’s Liberator should serve as a cautionary tale for IT. Components that should work together don’t. Requirements change faster than our ability to respond. Facilities aren’t up to the task. Without a focus on process, the only option is to buy our way out and like Consolidated, IT isn’t in a position to spend that kind of money. So how do we take these manufacturing lessons and apply them to IT?
The well-defined component.
Ford used steel molds to ensure consistent, well-defined components. What’s the IT equivalent of the steel mold? Two things conspire against ushere: a lack of standards and the act of customization. Like the inconsistent parts from Consolidated’s soft rubber molds, we are far too eager to customize, tweak, and hack software components in the vain belief that we’re “tuning” or “optimizing.” In fact, we’re breaking everything. The technical debt we accumulate with this kind of customization can overwhelm whatever marginal improvement we’ve gained in component performance. We don’t want hand-crafted parts. We want interchangeable parts: commodities that we can competitively source, exchange, and become completely unremarkable contributors to the success of the overall system.
This requires standards. When an organization consolidates their existing infrastructure into virtualized environments, one of the first things they notice is just how much redundancy they have. Four messaging system? Eight different databases? Think of the cost of maintaining these variants, and then think of all the second-order effects these non-standard components have on your process. What’s your equivalent of the landing gear hack? Again, I’ll direct you to Danny Bradbury’s insight on application rationalization.
This is what’s behind the DOD’s stated desire for “load and run” systems in their cloud strategy. I fear, though, that they’ve not sufficiently distinguished between what’s a commodity and what’s crafted. Conflating the two, as we’ve already demonstrated, can be a disaster. The National Association of State CIOs has published “Leveraging Enterprise Architecture for Improved IT Procurement,” which is a fuller treatment of this idea.
In case you’re feel like this “IT as Manufacturing” metaphor is a little too brainy to be practical, I’ll direct you to my company’s Casio/IBM Red Hat Enterprise Virtualization case study:
Previously, we used to do a total budget estimate for each project, covering everything from development to infrastructure. Now, we can separate the infrastructure element and budget in packages. therefore, when it comes to important projects we know about in advance, not only can we prepare the infrastructure using the budget for the previous fiscal year, but even if urgent projects occur, we can now change the priority order and handle things with more flexibility. Our procurement is faster too; preparations that previously took up to a month can now be achieved in a few days, or sometimes in just a single day.
The well-defined process.
I realize that it’s not practical to say that we can never optimize or customize. Sometimes it’s quite necessary. To make those customizations scale, though, we cannot think of them as customizations to a product, but rather customizations to process. Customizations have to be confined, repeatable, and verifiable so we know what’s different and can do it the same, every time. They are a natural part of the system integration process. Distinguishing components from the integration process, like Sorenson did with the Liberator’s retro-fitting centers, lets us properly scale and use our own resources most effectively.
While Ford’s created the idea of mass production, Chevrolet created flexible mass production. They relied on machine tools that could be quickly retasked for different jobs. This meant that the assembly line wasn’t optimized for producing one product, but was optimized for creating any product. This flexibility meant that Chevrolet could produce a different car every year and come to market a full year ahead of Ford. Again, they focused on their process, not their product.
You may have heard the DevOps credo, “Infrastructure as code,” referring to the increasingly vague distinction between operations and development. What we’re really talking about is Chevrolet’s flexible mass production: a more process-oriented IT infrastructure, built on well-understood components and tools.
With DevOps, though, our infrastructure is actually the result of our process, like a snake eating its tail. We can imagine a process that allows us to constantly commoditize components of our infrastructure as they become better-defined and understood. We’re improving our technique as we improve our understanding of the problem. Do you build your own electrical generators? Write your own device drivers? Of course you don’t. The customized, hand-crafted solutions we build should eventually disappear into the infrastructure as commodities.
Commodity hardware is already part of most systems. Commoditized virtualization is getting pretty common. Commodity application platforms have a lot of promise, as well. What these have in common is the ability to abstract problems away by confining them to specific layers of the stack. We don’t particularly care what hardware we’re using as long as we have a well-understood virtualization layer. We don’t care what virtualization layer we have if we have a well-understood set of application environments. And so forth. Once defined and compartmentalized, users should be able to deploy these components easily and cheaply as possible.
That’s why the *-as-a-Service model so popular now, and well-demonstrated by one of my favorite Red Hat products, OpenShift. By providing a Platform-as-a-Service, which is just an efficient way to provide commodity IT “parts” to the people that need them, we can draw a clear line between what’s a commodity and what’s hand-crafted, and users of the system can play and innovate atop the commodities as much as they like.
If our infrastructure is to the point where we have well-defined parts, a well-defined process, and the facilities to deliver them, we’re already in great shape. For a long time, I thought a Platform-as-a-Service was the final step. If I can spin up a JBoss platform and a standardized Ruby-on-Rails instance whenever I need it, what more could IT possibly provide?
The snake is not yet eating its tail. We’ve given the consumers of all this effort a tremendous platform for innovation, with the solved problems abstracted away such that they can focus on their mission. What we’re missing is the means to capture that innovation. That’s where sharing comes in, and that’s why open source is central to this vision.
When we solve a problem, we want everyone to benefit from it. That means we need software we can customize, and software licenses that let us share. We need a repository – a GitHub, a SourceForge – where we can show our work and encourage others to improve it. Once we’re happy with it, we should be able to easily turn the solution into another commoditized component, available to everyone through our Platform-as-a-Service. In the context of OpenShift, we call these Cartridges. That name invokes the Springfield Rifle, and the similarity between information systems and manufacturing. That’s no coincidence.
An Engine for Innovation
If we take all the industry trends and the Federal initiatives together, they all point to this new approach to enterprise IT. Cloud First, Shared First, modular contracting approaches, and Congressional mandates are all driving us in this direction. It’s important that we focus less on the individual mandates and market pressures, and begin instead of focus on an IT enterprise that can reinvent itself, one that is able to quickly and easily incorporate whatever disruptions Kryder’s Law or Moore’s Law can throw at us.
We need an approach to products that encourage commoditization. We need a flexible mass production process that can quickly produce new commodities. We need facilities that can quickly deliver those commodities to our users. We need shared catalogs of those commodities that allow us to feed our missions.
The scope of this is intimidating, but again: any one of these steps are valuable on their own and we need not burn everything to the ground just yet. Make sure your developers are sharing their code repositories and using open licenses. Make sure their work can be easily discovered by others. Establish standard builds for your most frequently-used components. Deliver those components through a platform-as-a-service, and write those components into your contracts. Generate those two railcars of blueprints for your infrastructure, so everyone know where you are and where you’re headed. Over time, these incremental improvements accumulate. You’ve stopped building a data center and started building an engine for innovation.
John Scott for turning me on to Freedom’s Forge. Erich Morisse, David Egts, and James Labocki for hours of conversation on this. Christopher Dale and Skip Cole for copyediting help.