As systems become both more complex and more resilient it is becoming increasingly important to provide distributed configuration and coordination services to a plethora of agents (coded in various languages and paradigms) across a businesses framework. Consequently providing appropriate tooling, in various languages, has become a central task in many environments for operation success. Quick recovery to a known state of operations and continued progress of computational workloads poses a challenge to some frameworks, however as long as success is not dependent on two processes sharing literal address space for communication of configuration/environment data (and if that is an issue for you in this day and age you likely have bigger problems to tackle first), distributed configuration management, like many things, can be extremely complicated depending on your operational environment. Some of these complexities may be out of your control, but the basics can still be straightforward.
Deciding on Storage
One of the first things you will need to tackle is deciding which option for the actual storage of configurations suits you best. The simplest option is to have a shared filesystem on which all of you applications rely as a source for their configuration information. While this is the simplest in theory, in practice it can tend to have a large cost behind it. Running a shared filesystem with any level of enterprise-wide presence/resilience can be fairly costly from an operational perspective (especially if you need to provide real time access across physical zones or datacenters). The upside of this approach is that if you already have a group that manages this from an enterprise perspective, the lion’s share of that operational expense is likely already built into you business budget for storage.
For larger organizations this is a decent approach, although even when operational costs are factored out, it is not without its downside. By relying on another group in the organization to provide this service, you are likely adding another layer of complexity to every action you need to take. Likely every change will be another ticket in somebody else’s queue at the end of the day and therefore your services may suffer from a lack of agility.
Another thing to consider when including another business group as a link in your application delivery chain is what the consequences to recoverability might be. If there is a global event that is impacting storage, where are you with respects to recovery? Unless there is an explicit (and tested) recovery plan in place where your data has a known priority and SLA for recovery, you may find yourself in a long queue when unexpected events impact storage.
For these reasons, if you decide to utilize a shared filesystem and your security situation permits, you should look into utilization of a cloud-based provider with known SLAs and recovery capabilities.
A quick warning: you may be tempted to roll your own filesystem solution using a distributed filesystem framework (ceph, lustre, etc.) and if these are already present and available to you in your environment, go right ahead! If they are not, don’t. These frameworks are fantastic, but from an administrative/resource perspective they tend to be far too heavy for just this single purpose.
In many ways, utilizing a database has many of the same pitfalls as noted above with filesystems. In order to provide resilience you will likely need several, heavy instances scattered across your environment. If you keep the management of those instances confined to your local group, you will likely be on the hook for security updates, backups, etc. Outsourcing the management to an enterprise-wide group saddles you with the exact same risk profile associated with the filesystem-based approach.
Another thing to be aware of, operational failures of databases can lead to far more intricate recovery issues than straight filesystems. This is especially true if your database of choice has aged a few major versions (which is not uncommon) since it has been installed. In these cases, successful recovery from an archived image becomes increasingly unlikely and may necessitate the rebuilding of the server completely in order for a recovery to occur. This also assumes that you have the appropriate version of every impacted binary available to you when this situation occurs (which incurs administrative overhead that you would rather not have to say the least).
It is worth noting that generally configuration information tends to be fairly flat (usually contained in a file or single database entry per agent). Utilizing a fully managed and resilient database just to house this class of information is very much like building a parking garage to store a golf cart.
As with filesystems, if this is something that can be provided via a cloud service with known/dependable practices and SLAs, it is probably a palatable option.
There are two players in this space that I will highlight: zookeeper and etcd. Both are adequate for the general tasks. I will highlight the common characteristics of both that make them a solid choice for the storage of configuration information, before going into specifics about which should be chosen over the other in certain situations.
- Lightweight: both services are extremely lightweight. They do not consume inordinate amounts of host resources or network bandwidth. They do not require high-octane hardware to be effective nor do they require a great deal of deployed dependencies to operate.
- Deployment: both services are easy to deploy. They can both easily coexist on infrastructure that is servicing other needs and both are amenable to containerization. Deployment can easily be scripted with multiple tools that your organization is very familiar with (likely tools that are already part of your dev-ops toolkit).
- Availability: hosts in the wild will always be going up and down. Assuming that the services are deployed across five diverse hosts, two hosts can be down at any given time and the service will still be viable and ready to respond to requests.
- Migration: services can be moved, without interruption to new hosts provided that quorum is maintained throughout the operation (meaning that more than half of the total hosts remain up at all times). It is worth noting here that zookeeper is slightly more cumbersome in this respect, but it makes up for it in other ways.
- Presentation: to an end-user of these services the presentation is very similar to a directory structure which maps a tree of nodes to stored values. This enables many analogs to using a filesystem and is very intuitive.
- Versioning: both automatically increment and store node values upon update. This allows you to seamlessly rollback to an earlier configuration if a recent update has suddenly begun to cause you to have issues.
So why choose one over the other? There are a few things to consider which might sway you towards a specific service.
- API Integration: if you intend to directly integrate a service into your applications you should like choose zookeeper. It is a more mature platform and therefore has a greater range of libraries available in more languages than etcd. Although you should not be directly integrating the services if you can avoid it (more on that in the next section).
- Dependencies: if you cannot install java on the target hosts under any circumstance, you need to pick etcd.
- Dynamic Locking: if your system needs the ability to manage dynamic updates/locks you should use zookeeper. It is possible to do this in etcd, but not as intuitive.
- Volatility: while both services can contend with hosts becoming unavailable, etcd is slightly easier to re-home than zookeeper.
One of the largest downsides with utilizing these distributed services is generally around user access management. While some built-in functionality exists in these packages for user access control, truly granular management requires a highly specialized configuration in some cases and assumes a certain level of administrative control and know how that may not be readily available in your environment.
Defining a DAL (Data Access Layer)
Regardless of the choice of storage layer, the group that is responsible for managing the distributed configuration service should provide a set of well-defined and maintained accessor functions tailored to their application groups. As noted above, you generally should not be directly integrating third-party libraries into your application landscape for this specific purpose. In reality, you application groups should not really know or be dependent on the choice of storage layer. The configuration management group should provide the generic interface for accessing configuration information without revealing the intricacies of the mechanism utilized for storage of that information.
While maintaining in-house interfaces can be cumbersome at times, this interface can be fairly simple in most cases. Configuration information is not something that is generally requested multiple times in a short timeframe, nor is configuration information generally large or cumbersome to send/receive. Another benefit of this approach is that you can switch out your storage layer at a later time with minimal disruption to your deployed application footprint.
For a deeper dive into creating a simple DAL for accessing distributed configurations please refer to this article: DAL Interface for Distributed Configuration Management in Python.
Standardize the Data Storage Model
You need not be overly draconian in specifying a rigid data model or standard for what your configuration data should like while in storage. However, you will make you life easier if you have at least some high level idea of what that data should nominally look like. Making modest assumptions and setting broad guidelines will make the creation and maintenance of an appropriate DAL structure far easier in the long run.
In this respect, mandating that all hostname entries start with “__hostname” is likely a step too far. But mandating, for instance, that all configuration data be stored as JSON or XML is probably the right level of control and gives appropriate flexibility to your application development groups to work with the configuration storage as they see fit without feeling too imposed on. Making this a high-level decree will make it far easier for the configuration management group to maintain systems, interfaces, and related tools.