Blog

Understanding load balancers for high performance and reliability

A technical overview of load balancers, including algorithms, architectures, and design considerations for low-latency and large-scale production systems.

November 4, 2024 | 26 min read
Alex Patino
Alexander Patino
Solutions Content Leader

A load balancer distributes incoming work across multiple servers or resources so no one server becomes a bottleneck. In practical terms, it acts as an intermediary between users and a group of backend servers, directing each user request to one of the servers to make overall processing more efficient. By spreading tasks more evenly, load balancing improves response times and prevents some nodes from being overloaded while others sit idle.

Applications often serve millions of users concurrently, which requires more than one server to handle the load. A load balancer sits in front of these servers, sometimes called a server farm or cluster, so all servers share the traffic roughly equally. In essence, it is like a traffic cop for your application: Clients send requests to the load balancer, and the load balancer forwards each request to an appropriate backend server. This uses available computing resources more efficiently and also makes the service more reliable because if one server fails or slows down, the load balancer routes new requests to other servers that are working better.

Moreover, a load balancer is usually transparent to users. Users interact with one endpoint, such as a public IP or domain name, unaware that behind the scenes their requests may be handled by any one of several servers. The load balancer makes the group of servers look like one unified service. As a result, applications serve more users and deliver faster responses than would be possible with one machine.

How load balancers work

A load balancer listens for incoming client requests, then decides which backend server should handle each request, according to a distribution policy or algorithm and the current status of each server. Once the choice is made, the load balancer forwards the request to that server and, in many cases, acts as the client to the server and as the server to the client, which allows it to manage the conversation in both directions transparently.

For reliability, load balancers typically perform health checks on servers. This means the load balancer regularly pings or sends test requests to each backend server to verify it’s responsive. If a server is down or not responding properly, the load balancer stops sending user traffic to it and diverts new requests to other servers that are running better. Users don’t experience errors and might not even notice that one of the servers went offline. When the server failure recovers, or a new server is added, the load balancer detects it and resumes routing traffic there as needed.

Another aspect of how load balancers work is session persistence, also known as sticky sessions, in situations where it’s needed. By default, a load balancer might send each request to a different server, and for many stateless web services, this is fine. 

However, consider a user shopping on an e-commerce site: As they add items to their cart, those cart details might be stored in the memory of the server that handled the first request. In such cases, it’s important for all later requests from that same user’s session to go to the same server, so the user’s cart and login state don’t get lost. 

Load balancers provide session persistence by marking a client’s browser with a cookie or using the client’s IP address to consistently map to the same server. This means one user’s session data stays on a single server, at the slight expense of evenly spreading those specific requests. If that server becomes unavailable, the load balancer routes the user to a new server and, depending on application design, might initiate session failover or a new session on the new server.

Because the load balancer is important infrastructure, it is usually designed with redundancy and high availability in mind. In production setups, there are often at least two load balancer instances, in active-passive or active-active mode, so if one load balancer instance fails, the other takes over. This prevents the load balancer itself from becoming a single point of failure. Cloud load balancing services and on-premises appliances often have this redundancy built in, running in clusters so the failure of one unit does not bring down the service.

When a client request comes in, the load balancer selects a target server based on its algorithm and the servers’ health, forwards the request, and then returns the server’s response back to the client. If a server is overloaded or offline, the load balancer dynamically adjusts to protect the overall service. This process happens for every single request, usually in milliseconds or less, making the distribution invisible to the end user. By centralizing these traffic decisions, a load balancer simplifies application scaling and resilience: you can add or remove servers behind it without changing the client-facing endpoint, and you can survive server outages without disrupting users.

White paper: Achieving resiliency with Aerospike’s real-time data platform

Zero downtime. Real-time speed. Resiliency at scale. Get the architecture that makes it happen.

Load balancing algorithms

When directing traffic, a load balancer follows a policy or algorithm to decide which server should handle each request. Choosing the right algorithm is important because it affects how evenly and efficiently traffic is spread. Load balancing algorithms generally fall into two categories: static algorithms, which use a fixed method not dependent on current server state, and dynamic algorithms, which consider real-time server conditions. Here are some of the most common load balancing algorithms and how they work:

Round robin (and other static methods)

Round robin is one of the simplest and most widely used load balancing methods. In a round-robin scheme, the load balancer sends each new request to the next server in a list, cycling through all servers sequentially. 

For example, if there are three servers and the first request goes to Server A, the next request goes to Server B, then Server C, and then back to Server A, and so on. This approach doesn’t examine how busy each server is; it just assumes an even distribution will even out the load over time. Variations of round robin include weighted round robin, where servers are assigned weights or proportions, and a server with a higher weight gets more of the requests. This is useful if some machines are more powerful than others. Round-robin DNS is used at the domain name system level as a basic form of load distribution, by rotating through a list of IP addresses for a given hostname.

Static algorithms, such as round robin, work well when all servers have similar capacity and incoming requests are relatively uniform. They are simple and have low overhead. However, because they ignore the current load on each server, static methods sometimes lead to suboptimal distribution. For instance, if one server has become slow due to external load or hardware issues, round robin will still send it traffic in turn, potentially loading more work on an already slow node.

Least connections and other dynamic methods

Dynamic load balancing algorithms try to account for the live state of each server. One of the most common dynamic strategies is least connections. In the least connections method, the load balancer monitors how many active connections each server currently has (or, in a web context, how many active sessions or pending requests). When a new request arrives, it is directed to the server with the fewest active connections. 

The assumption is that a server handling fewer connections is likely less busy and can accept new work more quickly. This helps prevent overloading a server that’s already handling a lot of traffic. A refinement is weighted least connections, which, similar to weighted round robin, accounts for servers having different capacities by multiplying the connection counts by a weight.

Another dynamic approach is least response time, where the load balancer tries to send traffic to the server that is responding the fastest on average. In this case, the balancer might measure or be informed of the average response latency of each server in addition to connection counts, and direct new requests to the server that currently offers the quickest response. This method reacts to differences in server performance or workload; if one server slows down, the algorithm favors others to maintain a better overall response time.

More advanced load balancers use resource-based or load-aware algorithms, which take into account server metrics, such as CPU usage or memory usage. In these setups, an agent on each server reports its health and load metrics to the load balancer, which then distributes traffic efficiently, such as not sending new jobs to a server that is high on CPU or about to run out of memory.

Session affinity and hashing

As discussed earlier, sometimes it’s necessary to stick a particular user’s traffic to the same server, or session persistence. Load balancers do this with algorithms that use hashing or cookies. 

One such method is source IP hash. In a source IP hash algorithm, the load balancer takes the client’s IP address, and sometimes the requested URL or other information, and runs a hash function to produce a consistent output that maps to one of the servers. The result is that the same client IP always hashes to the same backend server, as long as the pool of servers remains the same. This way, a returning client is likely to be sent to the same server they were using earlier, which is useful for session persistence without storing any session data on the load balancer itself.

Another common technique at the application layer is to use a cookie-based affinity. In this case, when a load balancer routes a client to a server for the first time, it sets a cookie in the client’s browser identifying that server. On later requests, the balancer sees the cookie and routes the client to the same server. Many HTTP load balancers, such as AWS Application Load Balancer or NGINX, support this out of the box, often called “sticky sessions.”

Consistent hashing algorithms, used in systems such as distributed caches and some load balancers, reduce disruption when servers are added or removed. In consistent hashing, both servers and keys, such as client identifiers, are hashed onto the same ring or continuum of values; each client or request maps to the next server on the ring. This way, if one server goes down or a new server is added, the mapping adjusts slightly, but many clients will still hash to the same server as before. This provides a balance between load distribution and routing stability. 

The choice of algorithm affects application performance and resource use. In practice, load balancers allow configuration of different algorithms, and the best choice depends on the workload pattern. For instance, round robin might be sufficient and fast for homogeneous, stateless web requests, while least time or least connections might be better for workloads with variable processing times per request. Some systems use a combination of switch algorithms on the fly based on conditions. Regardless of the specifics, the goal is always to use all servers effectively without overloading any one server unnecessarily.

Video cover

Types of load balancers

Not all load balancers work at the same level of the network stack. The two most common categories are Layer 4 load balancers and Layer 7 load balancers, referencing the OSI model layers. There are also different form factors for load balancers, from dedicated hardware appliances to software-based balancers to cloud-managed services. Each type has its use cases and advantages, especially when considering high-performance, low-latency requirements.

Layer 4 load balancers (transport level)

Layer 4 load balancing operates at the transport layer (TCP/UDP) and is concerned only with network and transport information, such as IP addresses, ports, and protocols, without understanding the application data. A Layer 4 load balancer receives a TCP connection or UDP datagram and forwards it to one of the backend servers based on its algorithm. It doesn’t inspect the content of the message beyond the basic headers. 

Because of this narrow focus, L4 load balancers are fast and efficient. They handle high throughput and introduce little latency, because they’re just shuttling packets between client and server after making a routing decision. This is why L4 load balancing is often used when ultra-low latency is important or when dealing with non-HTTP protocols.

For example, a network load balancer at Layer 4 might be used for streaming traffic, VoIP, or as a general TCP proxy for database traffic. It considers factors such as the client’s IP and port and uses a method such as round robin or least connections to pick a target server. Then it will typically perform network address translation to send the packets to that server. The client and server complete their TCP handshake through the load balancer, and thereafter, the load balancer efficiently relays the byte stream. L4 load balancers handle millions of connections per second and high packets-per-second rates, making them suitable for high-performance networking scenarios.

However, L4 load balancers do not understand anything about the actual web request or application behavior. They cannot make routing decisions based on HTTP paths, headers, or content types. They also cannot easily provide features such as compressing responses or injecting cookies, because they don’t look at the payload. Their strength is simplicity and speed.

Layer 7 load balancers (application level)

Layer 7 load balancers operate at the application layer. For web services, this usually means they understand protocols such as HTTP, HTTPS, and gRPC. These load balancers look at the request data to make much more granular decisions. 

For instance, a Layer 7 balancer reads the URL path or host header of an HTTP request and decides to route traffic for /api/* endpoints to one cluster of servers, but traffic for /images/* to a different cluster that might be better for serving static content. They also terminate SSL by decrypting incoming HTTPS requests and possibly re-encrypting when sending to the backend, insert or strip HTTP headers, and implement policies such as content-based routing or user-based routing. In short, L7 load balancing adds content switching and awareness to the distribution of requests.

Because they work at the level of application messages, L7 load balancers support features such as web application firewalls, cookie-based session stickiness, and more sophisticated health checks, such as checking specific URLs on a server. They are essential in microservices architectures and web APIs, where you might route requests not just based on load but on the function of the request. For example, many microservice deployments use an L7 load balancer, or “API gateway” or ingress controller in Kubernetes terms, to route incoming requests to the correct service based on the URL or other attributes.

The tradeoff is that Layer 7 load balancers are more complex and slightly slower than L4 balancers, because they must parse messages and possibly buffer and modify them. Inspecting HTTP headers and payloads takes CPU and memory. In high-throughput systems, this means an L7 balancer might become a bottleneck if not scaled properly. 

Nevertheless, L7 load balancers are optimized, often built on efficient event-driven I/O and C/C++ code or eBPF in the kernel, and still have low latency for most situations. The additional latency introduced is usually on the order of milliseconds or less, which is acceptable in exchange for the advanced routing and control capabilities.

Most enterprise deployments use both L4 and L7 load balancing in combination. A common pattern is to have a Layer 7 load balancer at the edge handling HTTP termination and routing among web services, and then perhaps use Layer 4 load balancing for internal traffic between tiers. 

For example, in a cloud environment, you might use an L7 Application Load Balancer to direct user traffic, which then forwards to a pool of application servers. Those application servers might themselves use an L4 load balancer to distribute requests to a cluster of database or cache servers deeper in the system. Using L7 where you need intelligence and L4 where you need raw performance creates an efficient, scalable overall architecture.

Hardware, software, and cloud load balancers

Beyond the OSI layer, load balancers are categorized by how they are deployed:

Hardware load balancers 

These are physical appliances, or specialized network devices with proprietary software from vendors such as F5 Networks, Citrix, or Cisco, dedicated to load balancing and often related functions such as encryption offload and intrusion detection. 

Hardware load balancers, sometimes called Application Delivery Controllers, were popular in traditional data centers. They often have custom ASICs or FPGAs optimized for network processing, which means they handle high throughput with low latency. Enterprises historically relied on hardware appliances for mission-critical, high-volume sites because of their performance and stability. They typically support both L4 and L7 load balancing. 

The downside is cost and flexibility: they are expensive and physical because you scale by buying more boxes. In today’s cloud-centric world, reliance on hardware is lessening, but they are still used, especially in environments with ultra-low latency demands and in industries that require dedicated on-premises networking gear.

Software load balancers

A software load balancer is an application or OS-level service running on standard servers that performs the load balancing function. Examples include HAProxy, NGINX, Envoy, and Traefik in the open-source world, or cloud-provided software such as AWS’s Elastic Load Balancer or Azure’s load balancer services. These run on general-purpose hardware or virtual machine/containers and use the server’s CPU to examine and route traffic. 

Software load balancers are flexible because you can deploy them on commodity hardware, configure or script them, and scale horizontally by running more instances as needed. They have kept pace with performance improvements; for instance, software load balancers such as NGINX or Envoy handle hundreds of thousands of requests per second per instance on CPUs. 

In many cases, software load balancers are only a little slower than hardware devices and are cheaper. They also integrate well with automation and DevOps processes, because they can be treated like any other software service.

Cloud-managed load balancers

Major cloud providers offer load balancing as a service, such as AWS ELB/ALB/NLB, Google Cloud Load Balancing, Azure Load Balancer, and Application Gateway. These are software load balancers managed and distributed by the cloud platform. When you use a cloud load balancer, you don’t worry about individual instances because the cloud provider keeps the load balancing service available, scalable, and maintained. Cloud load balancers often do global routing too. For example, Google’s cloud load balancer provides one global anycast IP that serves users from multiple regions. 

For companies already in the cloud, these services simplify operations because they handle patching, scaling, and redundancy behind the scenes. They are usually designed to be multi-tenant and robust, capable of handling spikes in traffic. One potential consideration is cost, as cloud providers charge for data processed and time used, but they save the effort of managing your own load balancing servers.

Many enterprises use a hybrid approach: For internal traffic or special needs, they might run software load balancers on-premises or in virtual machines, while using cloud load balancers for user-facing traffic. The core concepts remain the same, so the knowledge of load balancing transfers across these environments.

In terms of performance, hardware and cloud load balancers have edge advantages, such as hardware for performance in a data center, and cloud for scalability close to users globally. But a well-tuned software load balancer on hardware meets the needs of most high-performance, low-latency systems today. The choice often comes down to existing infrastructure, expertise, and specific feature needs such as SSL offloading, custom routing logic, or integration with container orchestration.

Global load balancing

So far, we’ve discussed load balancing within a set of servers in one location. Many large-scale systems also use global server load balancing to distribute traffic across multiple data centers or geographic regions. Global load balancing typically operates via DNS or at the application layer using anycast networks. It directs a user’s request to the best region, usually the nearest or healthiest, rather than just balancing among servers in one site.

For example, a global load balancer keeps a European user in Europe on the European data center of a service, while a user in Asia uses an Asian data center, reducing latency for each user. It is also more resilient against site outages, because if one region goes down due to a network failure or power outage, the global load balancer routes all users to another region that is still running. This makes applications more available than with local load balancing.

DNS-based global load balancing is common: the DNS server responds to queries with different IP addresses based on the user’s location or where capacity is available. More advanced systems use anycast IP addresses announced from multiple locations; the network routes a user to the closest instance of that IP. At the application level, some cloud load balancers offer global anycast services where one load balancer IP or URL fronts servers in many regions and directs each request to the nearest running backend based on real-time latency measurements.

Global load balancing often involves health monitoring just like local load balancers, but extended to data centers or clusters. For instance, it might detect when a site’s service is down or performing poorly and temporarily stop sending users there. It also incorporates geographic policies, such as sticking users to their home region for data sovereignty or regulatory reasons, unless there’s a failure.

Enterprises with a worldwide user base or multiple active-active data centers rely on global load balancing to meet their performance and uptime requirements. It adds another tier of distribution: first deciding which location should serve the user, and then the local L4/L7 balancer at that location picks the specific server. The result is a resilient, distributed system where both local resources and global resources are used efficiently.

Five signs you have outgrown Redis

If you deploy Redis for mission-critical applications, you are likely experiencing scalability and performance issues. Not with Aerospike. Check out our white paper to learn how Aerospike can help you.

Benefits of load balancing for high-performance systems

Load balancing is fundamental to building high-performance, low-latency services that scale. By distributing workload and providing redundancy, load balancers deliver several benefits enterprises need. 

High availability and reliability

A primary benefit of load balancers is improving system availability. If one server goes offline due to a crash or maintenance, the load balancer stops sending traffic to it and redistributes work to the remaining servers. This means the overall application stays available to users despite individual component failures. 

Maintain servers one at a time with rolling upgrades without taking down the service; the load balancer directs users to other servers while one is being serviced. In high-uptime environments, multiple load balancer instances are also used, so the load balancing itself isn’t a single point of failure. The net effect is a more fault-tolerant system: Users experience fewer outages and fewer errors because the load balancer handles failures behind the scenes.

Load balancers also perform continual health checks to detect a troubled server, even if it hasn’t crashed, such as if it’s responding slowly or returning error status codes. They proactively remove such a server from rotation until it recovers. This self-healing is important in large-scale systems where, at any given time, some servers might be experiencing issues. By isolating problems, the load balancer means users may never hit the bad node at all, or only a few requests do before it’s fenced off.

For global services, load balancing across data centers improves reliability even in the face of an entire site outage. If a region fails, global load balancing routes users to backup sites for disaster recovery. This level of resiliency, surviving not just machine failures but cluster or zone failures, is increasingly expected for enterprise systems.

Scalability and elasticity

Load balancing also supports horizontal scaling. Instead of scaling one supercomputer, add many commodity servers and use a load balancer to distribute work among them. This increases capacity by adding more servers behind the load balancer. For enterprises, this lets you design your service to handle growing traffic by scaling out, which is often more cost-effective and flexible than scaling up one machine.

By preventing any one server from getting swamped with too much load, a load balancer allows the overall system throughput to increase almost linearly with each server added. For example, if one web server handles 1,000 requests per second (req/s), then 10 servers with a good load balancer might handle almost 10,000 req/s, because the balancer spreads the clients roughly evenly among them. Without a balancer, clients might all go to one server and overload it while others do nothing.

This intelligent distribution also avoids bottlenecks. A load balancer senses if traffic is spiking and spreads the load so no one server becomes the choke point that slows everything down. Many load balancers, especially cloud ones, also allow auto scaling, where they integrate with orchestration systems to spin up new server instances when load increases and then later spin them down when load decreases. The load balancer in those scenarios sends traffic to new servers as they come online. This elasticity helps cope with bursty workloads or rapid growth, common in high-traffic enterprise applications.

Ultimately, load balancing gives architects the freedom to design systems that meet demand by adding resources on the fly. It decouples the client’s view of capacity from the number of servers. This not only improves performance under load but is more flexible; add replicated production environments, scale parts of an application differently, and scale geographically, all using the load balancer to route traffic appropriately.

Security and attack mitigation

While the main job of a load balancer is to handle traffic distribution, it also contributes to security as a frontline defense tool. By spreading requests among many machines, a load balancer makes it harder for an attacker to overwhelm any one server. 

For instance, in a basic denial-of-service attack scenario, if an attacker directs a flood of traffic at your service, a load-balanced environment absorbs that across multiple servers, while one server would fail under the load. 

Some cloud load balancers provide DDoS protection by scaling up their capacity and using scrubbing centers to filter bad traffic. Even in on-premises setups, spreading traffic across multiple links or servers via a load balancer buys time and capacity during an attack and allows other security appliances or software to engage. 

Many load balancers also come with features that improve security: they enforce consistent SSL/TLS configurations, acting as a central point for certificate management and encryption policies; sanitize or validate HTTP requests; and block traffic from suspicious sources. In a multi-tier architecture, a load balancer can be configured to allow traffic only on certain ports or with certain protocols through to the backend, acting as a smart gatekeeper. 

In addition, load balancers integrate with firewall and IDS/IPS solutions by directing traffic through them. For example, a load balancer might route all incoming requests through a cluster of web application firewall nodes.

Finally, by centralizing ingress through a load balancer, an enterprise gains one point to enforce security policies. Lock down backend servers so only the load balancer communicates with them, reducing their exposure. The load balancer will be the only part exposed to the internet in many cases, which simplifies auditing and securing the perimeter.

Load balancers add redundancy by making it harder for an attacker to take you down by eliminating single targets, and manage and mitigate unusual traffic patterns. This helps maintain performance and uptime even under hostile conditions.

Consistent performance and low latency

For high-performance, low-latency systems, avoiding slowdowns is a top priority. Load balancing helps provide consistent, predictable response times. By preventing overload on individual servers, requests aren’t stuck in long queues waiting to be processed. If one server were to take more load than it can handle, processes using that server would see high response times, or even timeouts, creating a “long tail” of slow requests. A good load balancer avoids that by keeping the work more evenly distributed, so all servers run in a comfortable range and most requests complete in similar times. In other words, balancing the load helps eliminate hotspots that cause latency outliers.

Uneven traffic distribution increases variability in response times: Some users get fast responses, others get slow responses from a busy server. Even distribution, by contrast, reduces this variability. In large-scale systems, those rare slow responses, or tail latency, harm user experience and system throughput disproportionately. Load balancing is one of the primary strategies to mitigate tail latency because no server is consistently slower than the others due to overload.

Additionally, load balancers improve overall latency through features, such as sending users to the nearest server, as in global load balancing. A user’s request that goes to a data center across the globe takes longer, due to network propagation delay, than if it goes to a nearby data center. Global load balancing takes advantage of geography to reduce network latency, such as by directing European users to a European server rather than making them reach a U.S. server. The result is faster content delivery and a snappier experience for users worldwide.

Load balancers introduce little overhead of their own. In many cases, the added latency from a load balancer hop is on the order of microseconds to a few milliseconds, less than the variability caused by an overburdened server or a congested network path. By terminating connections and then using efficient protocols to the backends, they improve the efficiency clients see, especially for short-lived connections.

Load balancing contributes to a stable and efficient performance profile: It keeps response times more uniform and prevents localized slowdowns. This consistency is important in enterprise systems where service level objectives often specify that, say, 99th percentile latency must stay below a certain threshold. Meet those tail latency goals by not overloading any one part of the system. 

The environmental costs of Redis server sprawl

Learn how Aerospike’s efficiency reduces server sprawl while optimizing environmental and operational costs.

Aerospike and load balancing

Aerospike is a real-time data platform with high performance and low latency at scale that exemplifies many of these principles. An Aerospike database cluster partitions and distributes data across nodes for an internal form of load balancing, so no one node handles all the traffic. 

This intelligent distribution is part of how Aerospike delivers predictable, sub-millisecond response times even as workloads grow. In the same way a network load balancer spreads client requests among servers, Aerospike’s cluster design spreads data and queries efficiently among its database nodes to prevent hotspots and bottlenecks.

For enterprises building high-performance applications, technologies such as Aerospike complement load balancers at the application tier. A robust load balancing strategy directs incoming requests to services, and Aerospike then fetches or updates the data quickly. Both layers work together for speed and reliability. By using Aerospike alongside load balancing, organizations confidently scale out their systems, knowing the data layer and the traffic layer will each remain balanced and resilient under pressure.

Aerospike’s focus on consistent low-latency operations, even under heavy load, aligns with the goals of load balancing: maximize throughput and avoid slowdowns.

Try Aerospike Cloud

Break through barriers with the lightning-fast, scalable, yet affordable Aerospike distributed NoSQL database. With this fully managed DBaaS, you can go from start to scale in minutes.