Stop Optimising Everything: How to Design Business Systems That Don't Snap Under Pressure
Why efficiency taken too far becomes a liability
The School of Knowledge is the weekly newsletter for SME owners and investors who want frameworks they can actually use — frameworks, checklists, and operating manuals every weekend, built to read on Sunday and use on Monday.
Photo by Barnabas Sani
There’s a video doing the rounds at the minute of Steven Bartlett banging on about how a few glasses of wine wasted—no, ruined—three days of his life. I will refrain from passing judgement on how anybody can have such a sad view of reality, but the guy’s getting hammered for his (and others’) cultish pursuit of ‘optimisation.’ Optimisation, once a word that got you appreciative nods from your boss in company meetings, is now a dirty—no, boring—word. In fact, it’s so boring a story I knew I had to use it as an opener to the third instalment of the How to Design the Systems That Run Your Business series, because it’s not just podcasters that fall prey to this type of thinking—so do businesses.
For part 1 click here, and for part 2 click here.
When you optimise for something in your business, what you’re really doing is cutting, trimming, or tightening something. Firming it up. What can look like ‘waste’ though is often your lifeline in times of great stress. It’s like scraping lifeboats on a cruise ship so you can save on the fuel.
Efficiency taken to its extreme strangles the system it’s supposed to be optimising for.
Over-optimisation: the efficiency trap
Every cost-cutting programme, every consolidation of suppliers, and every operational optimisation tightens the system’s grip, leaving next to no margin for error. COVID-19 laid bare the ramifications of operational efficiency when taken to its extreme. Prior to COVID, most Western manufacturers practised just-in-time supply chains, and when the pandemic struck the impact was immediate and severe: Apple postponed product deliveries, Samsung and LG suspended production, Tesla closed factories, and US automotive inventory dropped from seventy-seven days of supply in January 2020 to fifty-five days by February 2021.
What Western commentary failed to grasp wasn’t that JIT supply chains are bad, but that their bastardised versions of it were. Toyota’s original conception of JIT, as practised within their Toyota Production System, relied on short, stable supply chains, production levelling, built-in quality control, and long-term thinking. It’s a learning system designed to make problems visible so that they can be solved. What Western companies were doing was offshoring work to low-cost suppliers, aggressively cutting inventory, increasing lead times, and removing any buffers for financial ‘optimisation.’ Such are the pressures of having quarterly earnings calls.
The difference between Toyota’s JIT model and Nissan’s was summed up by their COO, Ashwani Gupta: The just-in-time model is designed for supply-chain efficiencies and economies of scale. The repercussions of an unprecedented crisis like COVID highlight the fragility of our supply-chain model.
I know you can’t just sit there and plan for pandemics, but if you’re going to remove your company’s emergency lifelines—and therefore increase the tension on your company’s proverbial elastic band—you shouldn’t be surprised when it snaps.
Normal accidents and the governance paradox
The Governance Paradox explains why system accidents aren’t anomalies but inevitable properties of the system’s structure.
Charles Perrow, a Yale sociologist, provided a theoretical framework for why some systems are inherently resistant to resilience engineering. In his 1984 book Normal Accidents, Perrow argued that when a system exhibits both interactive complexity (components interact in hidden, non-linear ways to produce unexpected failures) and tight coupling (processes happen fast, cannot be easily stopped, and leave no room for improvisation), accidents are not anomalies but inevitable properties of the system’s structure. He called them “normal accidents”—normal not because they are frequent, but because they are inherent.
For systems builders, this paradox is of great importance. Tight coupling must be centrally managed, where quick decisions can be made by those in charge. But, interactive complexity benefits from decentralised, slower decision-making as a consequence of novelty and informational delay. This might sound contradictory, but Perrow’s framework predicted the Chernobyl disaster two years before it happened. When you add complex safety systems to already complex systems, interactive complexity grows, not shrinks.
When you have systems that are tightly coupled and interactively complex, it creates an impossible governance issue.
When designing business systems, you cannot simply bolt on resilience to an already tightly coupled and interactively complex system. It must be designed in from the beginning, through modularity and loose coupling, to allow failures to remain local. Amazon took this paradox seriously when they created their “two-pizza teams”: no more than eight people—the number of slices you typically get from a takeaway pizza—when they decomposed their monolithic codebase into microservices. If one team failed, it remained local and allowed the others to keep operating rather than bringing the whole house down with them.
So how can you build resilience into your systems?
The four structural features of resilient systems
Resilient systems have four structural features: they have excess capacity in the form of redundancy; there is no single point of failure in the business; they diversify away risk; and they have internal buffers capable of absorbing external variety.
Redundancy means having more capacity than you would normally need in the event of unforeseen circumstances. In 2011, Japan was ravaged by an earthquake that damaged a nuclear reactor in Fukushima. Toyota’s supply chain was devastated and it took them six months to restore production outside of Japan. The reason it took so long was that Toyota couldn’t even name all of its sub-tier contractors, let alone contact them. Toyota conducted a comprehensive vulnerability assessment that identified up to 1,200 parts and materials that could potentially be affected by future disruptions. They drew up a list of 500 priority items to be secured with suppliers—including two to six months’ worth of semiconductors. When the global 2021 chip shortage hit, Toyota was the best-positioned automaker in the world, overtaking General Motors in sales for the first time.
Modularity means designing systems that can absorb local points of failure without cascading. Haier, the Chinese appliance manufacturer, took this principle to its extreme under the leadership of Zhang Ruimin. From 2005 to 2012, Zhang eliminated over 12,000 middle management positions and reduced its hierarchical layers from twelve to three. The entire company was reorganised into some 4,000 micro-enterprises, each consisting of ten to fifteen people, with hiring, strategy, and compensation packages left for them to decide. Multiple micro-enterprises linked together to form ecosystems of micro-communities that share common goals but remain operationally independent. Zhang is quoted as describing his transformation as “changing a whole garden into a rainforest.” The resilience logic is straightforward: if one micro-enterprise fails, it doesn’t kill the rainforest.
Diversity ensures that the system’s components are not all vulnerable to the same failure mode. A portfolio of different suppliers spread across different regions or geographies, a leadership team with different cognitive skills, a technology stack that does not rely on a single vendor—these are all expressions of diversity as a resilience strategy. This connects directly to Ashby’s Law: a system with greater internal variety is better able to withstand external variety when circumstances change.
Buffers offer a cushion and a lifeline. Cash is the easiest tangible buffer there is. When companies don’t have cash—especially in downturns—they are headed to the company graveyard. When Toyota stockpiled six months’ worth of semiconductors, it might have looked like waste to some. But you don’t want to realise you need something, only to realise it’s too late.
The role of slack: apparent waste as structural insurance
Organisational slack might look like structural inefficiency to one person, but opportunity—or insurance—to another. It takes on many forms: redundant employees, unused capacity, unexploited opportunities, discretionary budgets, and unstructured time.
In 1948, 3M’s fifteen per cent policy allowed its employees to devote a portion of their paid working hours to self-directed, unfunded, and unmanaged projects. It directly led to Spencer Silver’s Post-it Note invention in 1968, which was then commercialised in 1980 by Art Fry. Google, inspired by 3M’s approach, not only adopted this policy but bettered it to twenty per cent. This reportedly led to Gmail, AdSense, and Google Earth.
The purpose of organisational slack isn’t to act as a buffer to take your foot off the pedal, but to allow employees to explore their curiosity in the hope of it producing something tangible for your organisation. You’re dedicating time and care to your future business. Nick Sleep famously had ‘thinking sessions’ on Friday afternoons, and he did alright.
Stress-testing before reality does it for you
If resilience must be designed in, it must also be tested before a real crisis does the testing for you. Three methods stand out for their practical effectiveness. They are particularly effective because, as Daniel Kahneman said, teams suffer from WYSIATI (What You See Is All There Is) and fall prey to groupthink once a decision is made. These methods go some way in actively hunting for blind spots.
Red teaming designates an individual or team to purposefully attack a strategy to find its vulnerabilities, blind spots, or poorly constructed assumptions. The concept’s roots date back to the Vatican’s Advocatus Diaboli—meaning devil’s advocate. The history of red teaming is littered with examples of it in use: after intelligence failures during the 1973 Yom Kippur War, Israel developed the “Tenth Man Rule.” When all analysts agree, a designated person must disagree and explore alternatives. The US Army formalised the discipline in 2004 at the University of Foreign Military and Cultural Studies at Fort Leavenworth. Netflix’s Chaos Monkey, created in 2010, is an infrastructure version of red teaming in action. When the company was migrating to AWS in 2012, a script ran continuously that randomly terminated server instances during business hours. The philosophy was clear: the best way to avoid failure is to fail consistently.
Scenario planning is a strategic framework for creating and analysing multiple possible futures to prepare for uncertainty. The concept was pioneered by Pierre Wack at Royal Dutch Shell in 1971 and does not attempt to predict the future, but rather to expand the range of futures an organisation can perceive and prepare for. This very concept prepared Royal Dutch Shell for the October 1973 oil crisis, after a three-hour presentation by Wack in September 1972 had outlined precisely such a scenario, turning the company from an ugly sister to a behemoth.
The Premortem removes complacent thinking by forcing genuine quality analysis. When organisations and businesses make plans, they’re usually optimistic with risk analysis done as an afterthought: “we think this will work, but let’s consider some risks.” The problem with this type of thinking is that people stay overly optimistic, self-censor, and don’t want to seem negative—especially in group settings. The Premortem takes a page from Charlie Munger’s preference for inversion and starts at failure. When a plan is near being finalised, instead of identifying potential risks for the sake of it, Gary Klein proposes giving a short speech that goes something like this: “It’s a year from now and the plan has miserably failed. What happened?” It shifts the thinking exercise from theoretical to factual.
Final thoughts and the barbell strategy
Systems need room to expand and contract and over-optimising everything down to the tenth degree puts stress on a system. Stress that at some point, is bound to fail. If you need evidence that businesses can’t handle indefinitely growing stress, you need only look at biology for what stress can do to a human body and mind when overworked. Or just ask poor Steven.
Operationalising this framework relies on one last piece of the puzzle: Nassim Taleb’s Barbell Strategy. You build buffers into your system and use scenario planning, red teaming, and premortems to avoid ruin. But you also use organisational slack for small trial-and-error bets that won’t blow up the company if they fail—but can produce meaningful upside if they come off.
Until next time, Karl.


