[Originally published 09/24/2018]
In terms of distributed systems there has been an evolution from simply marshaling resources of multiple devices (usually lower-end or commodity hardware) to actively controlling entire ecosystems of related processes and assets across multiple datacenters and types. In its earliest form, a distributed system was basically a cluster. While the term cluster has many different connotations depending on the field in which it is used, the earliest of these beasts generally contained locally accessible hardware and the hardware was generally considered to be symmetrical. That is, the managed device pool tended to contain devices that were roughly similar in terms of:
- Processor Architecture
- Disk space and I/O
- Network bandwidth
The Single Director Model
From these early clusters, we see the first design pattern for distributed management. Which is the single director that gives marching orders to agents on other (as mention above usually identical) devices to progress a given distributed workload (See also: A Simple example of a Distributed Workload). In the picture below this primordial pattern is illustrated as our starting point.
This is also known as a “Master-Slave” configuration, but I tend to shun the utilization of those terms as being uninformative. In modern distributed systems, a “Slave” device could become the “Master” under the right conditions or vice-versa. We will utilize instead the terms “Director” and “Agent”. A Director is the process which gives orders to the Agent from our point of view in the current job sequence. An Agent is the process which receives requests and acts on them according to the behavior coded into it for that class of request.
The rationale behind calling one machine a “Master” is drawn from infrastructure domains as well. Traditionally devices that serve as “Masters” tend to be asymmetrical from the rest of the infrastructure footprint. As a single machine that handles requests for multiple agents, legacy “Master” type systems tended to be very high performance where the “Slaves” tended to be lower tier hardware. The rationale behind this was intuitive. The “Master” (I will stop torturing you with quotation marks and capitalizations now, I think I have made my point) was usually a single point of failure so redundant disks, network interfaces and power supplies were often included in the hardware profile that were not necessary or cost-effective to include in the slaves.
This most apparent issue with the earlier arrangements was fault tolerance. The Single Director Model (or SDM as I will refer to it henceforth for brevity) was also a single point of failure. Although agents in well-designed systems could fail or move in and out of the computational fabric with little or no service disruption to the overall plant, the loss of the director was a catastrophe. The director could fail for a variety of reasons, including:
- Overload (with regards to request handling)
- Network segregations
- Password updates, patches and firewall modifications
- Failures of supporting services (i.e. databases, middleware, NFS, etc.)
The SDM also suffers from capacity issues. It is not always apparent how to scale an SDM up. Although issues of overload can plague the SDM, it is not trivial to simply add another director to the infrastructure and it is not always possible (or as noted above) cost-effective to increase the physical capabilities of the machine which serves the director function.
What this leads to commonly is an overabundance of directors each running their own fiefdom of agent devices. Consequently, these devices cannot always be utilized by other directors when their capacity demands are low as they may have very different configurations based on what they are commonly tasked to do. Moving an agent seamlessly from one director to another can be a daunting task in this case. Usually, you will be faced with either creating agents which are unnecessarily bloated (meaning that they contain a superset of every library needed by every known request they could be asked to handle in the environment) or by rebuilding the agent to a specific purpose each time it enters or leaves the computational fabric. Either one of these options puts an enormous strain on administrative resources (both human and automated) to track and execute.
To illustrate this more concretely, let us suppose that we work in a global financial institution. We have several request classes that we could consider but let us just consider a distributed system which processes EOD batches. We will immediately notice that this utilization profile will lead to large gaps in utilization simply because EOD falls at a different time for each geographical region. This implies that there will be times when agents allocated to processing requests from London will be bored while agents processing requests from New York will be overburdened. As moving agents around between directors can be problematic for reasons mentioned above, usually the response tends towards adding more agents (hardware) to the overburdened location which only exacerbates the utilization issue since there will be more bored agents when the New York location is not at EOD (which equates to greater power loss more of the time than the gains of reducing the burden).
The Dual Director Model
Now let us return to the director models (agent utilization and fabric transfers are a deeper topic and deserve their own article). A way that some entities have approached resiliency issues is simply to add another director which acts as a shadow to the active director. In the event of a failure of the primary director, the data replication between the primary and the shadow can be severed (usually is by default in most cases at this point anyway), the shadow director becomes primary and the agent processes which were bound to the primary can be re-homed to the shadow so that services may continue (as illustrated below).
While this strategy can be effective, it too suffers from several issues.
- Proper replication strategies can be difficult to define as directors are usually reliant on several different systems (i.e. databases, NFS, etc.) in the back-end to effectively coordinate their agent pool
- Each of these systems may possess their own strategy which may not work as expected during system failures leading to dropped or incomplete requests
- The shadow director adds another layer of complexity but does not give you an equal level of pay off since it primarily sits idle
- Transitioning active agents to a new director can entail a great deal of administrative work (usually during a DR event, those administrators will likely have their hands full with other tasks as well)
- For compliance purposes in most businesses you will also need to regularly test your shadow director to ensure that it can take the load over
- If you have a shadow director located off-site (to increase geo-diversity) it can be problematic as the shadow cannot always know if it has been segregated
- Failing back to the primary presents all the challenges listed above but in the opposite direction
There are hybrids of this model that address some of the concerns noted above. A common one I encountered was to have Director-A responsible for half of the job flow required from the plant while Director-B was responsible for the other half. They would both also replicate their information to the other director so that, if a failure occurred, the opposite side director could take over for the entire flow. There is some elegance to this solution as neither director sits idle. However, while you may have more burst capacity on each director, ultimately you are still leaving a machine idle (or at least two halves) so the performance gain is usually minimal. Furthermore, now you are saddling yourself with two-way replication as opposed to one-way (and the problems associated with that simply double, especially during failback).
Regardless of how elegantly a dual-director solution is crafted and maintained (and discounting completely how one defines what “half” of the job flow means), you will be forced by the nature of dual-sided design to accept some compromises. And still your greatest day to day issue will be that, outside of adding additional capacity (disk, cores, RAM, etc.) to the directors themselves or hosting the directors on more powerful machines altogether, you cannot scale it.
OK, so what’s the answer?
So that is a pithy headline. The truth is that there is no answer. Single and dual headed director systems can be very effective if managed correctly, but their full management cost is not generally considered until things go wrong. Also, the management of those systems is not generally as cut and dried as the sellers of those systems would lead you to believe.
Many of the limitations that are placed on these systems are administrative and managerial. Some are purely subjective. The proper management of a distributed system crosses many of the traditional silos in an IT organization. Coordinating replication and system backups, patching, business continuity functions, networking and compliance concerns all factor into (or at least should be considered). Another factor to be considered is the actual services being provided by the computational fabric and what their relative critically actually is to the business environment. You will need to navigate a complex terrain of processes (and egos most likely) to form a cohesive picture of what is needed from the system.
My focus has primarily been on resilience and scalability for this article (I will tackle the other dimensions in future posts). To illustrate the answer to this question, I will illustrate a distributed platform which solves for these variables. The platform should have the following elements:
- Multiple directors – the system has three or more directors generally available to an agent operating in the computational fabric
- Geographic diversity – the directors should not need to be collocated physically to achieve their goals
- Fault tolerance – the failure of a director should not be able to bring down the whole fabric (although physical devices hosting agents within the same location as that director may become unreachable)
- Scalability – the addition of a director should be possible to increase bandwidth (for reasons I will outline below, you will always add two directors at a time)
Quorum Based Management
There is an old brain teaser I remember about a man in WW2 London who is forced to run from his home during a bombing raid. He is a bit particular (far more than I would be in that situation) about his socks matching and only wears either white or black. Since he is operating in total darkness, he cannot know what color sock he is grabbing from his drawer. What is the minimum number of socks he would need to grab to ensure that he has at least two pairs that match (you can look up the “Pigeonhole Principle” if you need help)?
The answer is five. It is possible that he could get lucky and grab four white or four black (or two of each), but to truly guarantee a result he needs five. A simpler example would be to drop his requirement to a single matching pair in which case he would need three for a guarantee (although he could get lucky with two). This is illustrated in the sequences below:
- White -> White -> Irrelevant = Match
- Black -> Black -> Irrelevant = Match
- Black -> White -> [Black/White] = Match
- White -> Black -> [Black/White] = Match
Similarly if you had two friends and two possible choices (let us say either going to a movie or playing cards) you could make your own choice – playing cards, ask one of your friends their choice, and if you agree it does not really matter what the third friend thinks (although that is not the greatest way to keep friends). If you disagree because the friend you asked wants to go to a movie, then your third friend becomes the decider since what they want to do mathematically is bound to agree with one of you (I sure went a long way to make the title funny, hope it makes sense now).
These are esoteric examples perhaps, but they illustrate a simple principle, in an odd-number based system there cannot be a deadlock. If a majority can be agreed, progress can be made. There are tools for distributed systems that capitalize on this mathematical principal to ensure that even if a component is lost your system can still make progress.
Zookeeper is one such system and has been used to build many resilient systems from the perspective of a synchronization backbone. Each agent in a zookeeper menagerie (an appropriate term in this case) has a list of the addresses of all the nodes in the management system (in zookeeper parlance, this is called the ensemble). At any given time, there is a designated “leader”, but this leader can be changed if the current leader becomes unavailable. This is done through a built-in voting process in the ensemble and is transparent to the attached agents.
At any given time, there are three (or more) managers, but only two need to agree to progress the job stream (it follows that three need to agree is you have five and so on). This insulates the fabric from one of the major problems experienced by distributed systems: split-brain. Split-brain occurs when two managers believe they are in control of the fabric at the same time (usually due to network segregation). For further details on split-brain and other topics relating to it academically, please refer to the “Byzantine General Problem”.
Another bonus to utilizing quorum management is that it is scalable. If you need more request processing you can add another pair of managers (I’ll leave it as an exercise for the reader to prove why it needs to be two, but that should not be too challenging if you have followed along this far). Five is a good number for high performance/heavily loaded environments. Seven is the highest I have seen in practice. Any higher and the synchronization costs can tend to diminish the returns significantly. Please see the end of this article for a full diagram of what an enterprise deployment might look like.
Although quorum-based management is not a silver bullet, it does tend to solve certain problems rather nicely. It also tends to cost less for the actual machines. Furthermore, as long as a system can reach quorum, it doesn’t matter if you reboot a node and rolling updates become less problematic (although they must still be handled carefully). The important factor is that failure or maintenance on a single node need not stall the entire fabric. Also, with proper design, the re-homing of agents to their new management shards can be done without manual intervention from support staff.
Now you will also need a scalable data storage solution for administrative purposes of your quorum management system (an article on these systems is forthcoming), but it is fairly simple to implement in tandem.
Or you could take a look at one of our enterprise offerings: Odin.
Thanks for your time. I hope you found this informative. Please feel free to fight with me in the comments below.
- Split-brain overview – https://en.m.wikipedia.org/wiki/Split-brain_(computing)
- Byzantine Problem Overview – https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
- Pigeonhole Principle – https://en.wikipedia.org/wiki/Pigeonhole_principle
- Apache Zookeeper – https://zookeeper.apache.org/
The Bigger Picture