Skip to content

Observability 2.0: Why Engineering Teams Are Moving Beyond the Three Pillars Model

Table of Contents

Modern observability is broken. Engineering teams spend most of their time fighting cardinality explosions and vendor bills that multiply overnight, while struggling to understand their increasingly complex software systems. Charity Majors, co-founder of Honeycomb and author of "Observability Engineering," explains why the industry is shifting from the traditional three-pillars approach to unified storage solutions that treat observability as a development tool, not just an operational afterthought.

Key Takeaways

  • The three-pillars model (metrics, logs, traces) creates expensive data silos that force engineers to correlate information across 15 different tools for every request
  • Observability 2.0 uses unified storage and structured events, enabling real-time slicing and dicing across high-cardinality dimensions without cost explosions
  • Cardinality governance consumes most observability engineering time because traditional metrics tools break when unique identifiers exceed 100 values
  • Open Telemetry has emerged as the solution to vendor lock-in, becoming the top CNCF project by commits and enabling portable instrumentation
  • Modern observability should underpin development workflows, not just operational debugging, with SLOs serving as the API for engineering teams
  • AI observability requires tracing entire workflows from software inputs through models to human feedback, not just monitoring model inputs and outputs
  • The best time to implement observability is while writing code, similar to writing tests—retrofitting loses the original intent and context

Timeline Overview

  • Early Career & Parse (2012-2013) — Charity experiences professionally humiliating outages at Parse due to unpredictable mobile app traffic spikes and inadequate tooling
  • Facebook & Scuba Discovery (2013-2014) — First exposure to high-cardinality observability through Facebook's internal Scuba tool, reducing debugging time from hours to seconds
  • Honeycomb Foundation (2016) — Co-founding Honeycomb with Christine based on the life-changing Scuba experience, building their own database despite conventional wisdom
  • Industry Evolution (2017-2020) — Three-pillars model emerges and dominates, while Honeycomb develops serverless database architecture and unified storage approach
  • Modern Challenges (2020-Present) — Rising costs force industry consolidation, Open Telemetry gains traction, AI creates new observability requirements for non-deterministic software

The Fatal Flaws of Three-Pillars Observability

  • The three-pillars model emerged in 2017 when Peter Bourgon coined the phrase describing observability as metrics, logs, and traces, but vendors adopted it primarily because they had separate products to sell for each pillar
  • Every request entering traditional systems gets stored across 15 different tools—metrics storage, dashboards, structured logs, unstructured logs, tracing tools, profiling tools, and analytics platforms—with no automatic correlation between them
  • Engineers become the human correlation layer, manually connecting data by recognizing patterns or copy-pasting IDs between systems, creating bottlenecks and increasing time to resolution
  • The cost multiplier effect makes this approach unsustainable as companies scale, with some organizations unknowingly storing identical request data in multiple expensive tools simultaneously
  • Data Dog and similar platforms offer pre-defined bridges between tools, but these require advance planning about which information will be important and where connections need to exist
  • The fundamental problem is having many sources of truth instead of unified storage, leading to dead ends where clicking on data points cannot reveal deeper context or related information

Understanding Cardinality: The Hidden Cost Explosion

  • Cardinality refers to the number of unique items in a set—user IDs representing 100 million users create maximum cardinality, while static values like "species equals human" create minimum cardinality
  • Traditional metrics tools are built exclusively for low-cardinality data and break catastrophically when unique combinations exceed roughly 100 values, causing exponential cost increases
  • The infamous "custom metrics" billing model charges for every unique combination of metric name and tag values, not just custom code—adding an IP address field can increase bills 100x overnight
  • World-class observability teams spend the majority of their time governing cardinality rather than solving actual engineering problems, representing a massive opportunity cost and productivity drain
  • High-cardinality data provides the most debugging value because unique identifiers make it easier to isolate specific problems and understand system behavior patterns
  • The cruel irony is that the most expensive data to store in traditional systems is also the most valuable for understanding and debugging complex distributed applications

Observability 2.0: Unified Storage and Structured Events

  • Observability 2.0 centers on unified storage where data exists once but can be visualized and accessed through multiple entry points, eliminating dead ends and correlation gaps
  • The architecture uses wide structured events organized around units of work, stored in columnar databases that enable real-time slicing and dicing without predefined schemas
  • Instead of storing request data 15 times across different tools, teams emit fewer but wider logs with rich context attached to each event, dramatically reducing storage multiplication
  • Honeycomb's "Bubble Up" feature exemplifies this approach—drawing a bubble around any graph anomaly automatically computes dimensional differences between the bubble contents and baseline data
  • Click house, Snowflake, and other columnar stores enable this shift by handling high-cardinality data efficiently without requiring advance index or schema definition
  • The transition represents moving from an operational tool focused on errors and outages to a development tool that underpins entire software feedback loops and accelerates time to value

The Open Telemetry Revolution and Vendor Portability

  • Open Telemetry has become the largest CNCF project by commits and contributors, surpassing even Kubernetes in developer engagement and industry adoption
  • The core promise allows teams to instrument code once with OTel and redirect the telemetry fire hose to any compatible vendor, forcing competition based on value rather than lock-in
  • Semantic conventions and consistent naming within OTel pipelines enable vendors to build sophisticated automated features because they understand the data structure and meaning
  • The project inherits from failed predecessors like Open Census and Open Tracing, with Ben Sigelman and Light Step providing crucial architectural leadership for broad industry adoption
  • While critics argue OTel feels "big and bloated," most teams can extract value without understanding every component, and the tooling increasingly "just works" without deep configuration
  • This shift represents the first time in observability history where vendor migration is practically feasible, fundamentally changing purchasing dynamics and reducing long-term risk

AI, Unknown Software, and the Future of Debugging

  • AI intersects observability in three critical areas: building and training models, developing with LLMs, and managing the universal problem of software of unknown origin
  • The proliferation of AI-generated code creates a worldwide version of Parse's original challenge—making unknown software uploaded by developers around the globe work reliably in production
  • AI observability cannot exist in isolation but must integrate with comprehensive software observability, tracing from all service inputs through models to human feedback loops
  • Many AI observability startups focus narrowly on model inputs and outputs while ignoring the broader trace-shaped, high-cardinality problem that spans entire application stacks
  • Production remains where code meets reality regardless of its origin—human-written or AI-generated code must be observed running in real environments to validate quality and behavior
  • The solution involves treating AI as intensified existing challenges rather than fundamentally new problems, requiring better traditional observability practices to handle non-deterministic software components

Engineering Culture and Organizational Transformation

  • Observability should underpin development feedback loops similar to how tests accelerate development—the best time to instrument is while writing code, capturing original intent and context
  • SLOs function as "the API for engineering teams," providing clear service level agreements that prevent micromanagement while enabling meaningful negotiations about reliability versus feature work
  • Modern observability enables engineering teams to explain their work in business language, helping CTOs and VPs of Engineering earn first-class seats at executive tables alongside other functions
  • Progressive deployment practices combined with observability 2.0 create greater value than either approach alone—feature flags, canaries, and rich telemetry enable confident rapid iteration
  • The shift from DevOps collaboration models toward platform engineering reflects observability ownership moving to teams whose internal customers are other engineers, not external operations
  • Testing in production becomes safer and more valuable when engineers can slice, dice, and immediately understand the impact of changes through high-cardinality dimensional analysis

Common Questions

Q: What makes observability costs spiral out of control?
A: The three-pillars model stores every request in 15+ different tools, while cardinality explosions in metrics systems can increase bills 100x overnight without code changes.

Q: How does observability 2.0 handle high-cardinality data differently?
A: Unified storage in columnar databases allows unlimited unique identifiers without exponential cost increases, enabling real-time exploration across any dimension.

Q: When should startups begin investing in observability?
A: As soon as code becomes real and intended for users—similar to writing tests, observability should be built while writing code to capture intent.

Q: Why is vendor lock-in less concerning in modern observability?
A: Open Telemetry enables portable instrumentation, allowing teams to switch vendors based on value rather than being trapped by proprietary data formats.

Q: How does AI change observability requirements?
A: AI creates more software of unknown origin requiring trace-shaped observability from inputs through models to feedback, not just model monitoring.

The future of observability lies in treating it as essential development infrastructure rather than operational overhead. Teams that embrace unified storage, structured events, and development-centric workflows will build better software faster while avoiding the cost explosions and correlation nightmares plaguing traditional approaches.

Latest