Elevating Search Resilience: A New Architecture for GitHub Enterprise Server High Availability

Introduction

Search is the backbone of almost every interaction on GitHub. From the obvious search bars and filtering mechanisms on the Issues page to the less obvious counts on releases, projects, and pull requests, a reliable search engine is critical. Recognizing this, GitHub's engineering team spent the past year reinforcing the search architecture of GitHub Enterprise Server (GHES) to make it more durable. The goal: reduce administrative overhead so that administrators can focus on what matters to their users.

Elevating Search Resilience: A New Architecture for GitHub Enterprise Server High Availability — Source: github.blog

The Challenges of Search in High Availability Setups

High Availability (HA) configurations are designed to keep GHES running even when parts of the system fail. An HA setup typically includes a primary node handling all writes and traffic, and replica nodes that stay synchronized and can take over if the primary fails. However, managing search indexes in such environments has historically been tricky.

Administrators had to be extremely careful with maintenance and upgrade steps. An incorrect order could damage search indexes or lock them, causing disruptions. This fragility stemmed largely from how Elasticsearch, the underlying search database, was integrated into GHES.

How Elasticsearch Clustering Created Problems

In an HA setup, GHES uses a leader/follower pattern: the leader (primary) handles all writes and updates, while followers (replicas) are read-only. This pattern is deeply embedded in GHES operations. Elasticsearch, however, does not natively support such a primary–replica relationship. To work around this, GitHub engineering created an Elasticsearch cluster that spanned both primary and replica nodes. This approach made data replication straightforward and provided performance benefits, as each node could handle search requests locally.

Nevertheless, the downsides of clustering across servers eventually outweighed these advantages. For instance, Elasticsearch could decide to move a primary shard (responsible for receiving writes) to a replica node. If that replica was then taken down for maintenance, the system could enter a deadlock: the replica would wait for Elasticsearch to become healthy, but Elasticsearch couldn't recover until the replica rejoined. This lock step caused significant instability.

Past Attempts to Stabilize the System

Over several GHES releases, engineers tried to make the clustered mode more robust. They introduced checks to ensure Elasticsearch remained in a healthy state and built processes to correct drifting states. A more ambitious effort involved building a “search mirroring” system to move away from clustering altogether. However, database replication is inherently challenging, and these early attempts struggled with consistency.

The New Search Architecture for High Availability

After years of iterative work, the GitHub engineering team successfully redesigned the search architecture to achieve high availability without the pitfalls of cross-server clustering. The new system eliminates the deadlock scenarios and reduces the risk of index corruption during maintenance or upgrades.

Key improvements include:

Decoupled search service: Search is now run as a separate, independent service that can be restarted or upgraded without affecting the core GHES operations.
Improved failover handling: Replica nodes can now seamlessly take over search responsibilities if the primary node becomes unavailable, without waiting for Elasticsearch cluster health.
Simplified administration: The new architecture reduces the number of manual steps required for maintenance, lowering the risk of user error.

These changes mean administrators spend less time managing search infrastructure and more time delivering value to their users. The result is a more resilient GitHub Enterprise Server that maintains search performance even under failure conditions.

Conclusion

The journey to rebuild the search architecture for high availability was long and required deep understanding of both Elasticsearch and GHES operations. By moving away from a clustered Elasticsearch model and introducing a dedicated search service with robust failover, GitHub has made GHES significantly more reliable. This blog post covers the key challenges and the solution; for deeper technical details, refer to the official documentation.

This article is based on the original post “How we rebuilt the search architecture for high availability in GitHub Enterprise Server.”

Container Orchestration