Using Stack and App to configure and manage infrastructure

The aim is to replace the misused, misunderstood and fundamentally flawed concepts of role and mainclass with two new 'facts' called stack and app. This page lays out the motivation of doing this and provides basic rules.

Motivation

The aim of this change is to improve how we organize information about hosts and services. This is for the purpose of configuring them, discovering them, deploying to them, and grouping and collecting metrics for them.

The idea is that we want to automatically categorize stacks together for things like ganglia grids or alerting or deploying, we want to make it easy to share information across a whole stack/stage tuple (think r2 and the db connectors, but there are plenty more examples) and also we want to make it hard to cross stack boundaries - enforcing swimlanes by convention.

There are some key pieces of information that are useful:

datacentre (eg. dc1, dc2, eu-west-1, gul3)
stack (eg. r2, discussion, identity)
app (eg. frontend, microapp, api)
stage (eg. CODE, PROD)

The current wisdom is that this is (or should be) all the information we need to end up with a machine that:

provides a useful service
shows up in deployment systems
is displayed in dashboards

Knowing the datacentre allows us to correctly set up all the 'platform' things that should be invisible to devs, like DNS servers, NTP servers, etc, and may also be useful for things like which TIP to talk to for services that are multiply colocated.

The stack should be a single string - eg, the respubs are only part of the r2 stack. Having a variable that cuts across roles allows us to, firstly, share information between roles in a way that's mildly tedious now, and also allows us to group roles into useful collections (like ganglia grids).

The app variable roughly maps to our current role variable, but is allowed to be an array. This allows something to show up as, eg, an api server and a solr server if they've been combined on some machines in some stages. Where necessary, this can easily be a string that is interpreted as a comma separated list.

The stage variable is the one everyone knows and loves - we don't need to change this.

There are of course other pieces of information coming out of facter that are useful for all sorts of things - this is not meant to say "stop using $::osfamily!"

Some examples of this:

stack	app	inferred mainclass equivalent
identity	api	identity::api
identity	admin	identity::admin
identity	db	identity::db
flexible	microapp	flexible::microapp
flexible	db	flexible::db
r2	frontend	r2::frontend
r2	admin	r2::admin
soulmates	api,solr	soulmates::api,soulmates::solr

The benefits of doing this work are:

it makes it easier to group related things in puppet so that logically close things are close together; i.e. all the pieces of soulmates can live in a soulmates module - this helps for clarity if nothing else and reduces the proliferation of top level wrapper modules in puppet/modules
it makes it easier to share related information (eg, db connector strings for an entire stack can go in $stack::params, etc)
it makes it easier to make automated things that consume this information, like monitoring grids or deployment schemes.
it makes it easier to treat an entire stack as a 'foreign' module that we manage with something like librarian-puppet, so that we can host, eg, the flexible module on github and give the flexible team commit access to it.

Rules and guidelines

Due to a proliferation of different systems that will model their configuration around these variables there are some rules.

Firstly, it is important note that stack is a string; a single server cannot be in more than one stack (i.e. a database cannot belong to both identity and content) - swim-laning is enforced in our manifest, which is considered a good thing.

Secondly (a rule): stack and app values must adhere to [a-z][a-zA-Z0-9]{1,11} (i.e. it must start with a lowercase letter followed by one to eleven characters that can be lower or upper case letters or digits - hyphens, underscores (actually we may allow underscores), colons and other symbols are not allowed); this is due to restrictions in systems consuming this data such as puppet and ganglia.

Thirdly, it is recommended that you avoid implementation details in naming the apps - i.e. app should be db rather than mongo or mysql.

Enforcing rules and guidelines

These rules and guidelines should be enforced and highlighted as soon as possible. We often find that people have to redo work when they haven't followed rules that they are not really aware of or are not enforced. These should be easy to enforce at the time that instances are created, when deployment scripts are parsed or by warnings in the discovery tools that highlight configuration issues.

Implementation and transition

By creating new variables instead of trying to fix the meaning of the existing variables we can skip some of the complexity we'd get by trying to find and fix all of our uses of the role fact, and also means we can do a patented Guardian transition that leaves old stuff around forever - hurrah! No, really, it should be straight forward enough to kill.

As we transition there are a bunch of things that currently consume mainclass or role that we need to move over to use stack and app.

GC2 launch.py script

This script has been modified to ensure that the new concepts of stack and app are picked up from the YAML launch configurations and added as facts in launched instances. It currently creates pseudo mainclass and role facts for backwards compatibility. Here is the relevant section:

if struct.has_key('mainclass'):
    facts['mainclass'] = struct['mainclass']
    facts['role']      = struct['role']
elif struct.has_key('stack'):
    facts['stack']     = struct['stack']
    facts['app']       = struct['app']
    apps = struct['app'].split(',')
    mainclass = [("%s::%s" % (struct['stack'], app)) for app in apps]
    facts['mainclass'] = mainclass
    apps.sort()
    facts['role']      = "%s-%s" % (struct['stack'], "".join(apps))

Puppet entry point for GC2 nodes

The entry point in puppet (that determines from the mainclass - or stack and app - facts which puppet classes should be run on a particular instance) is modified appropriately:

if $::mainclass {
  $classes = split($::mainclass, ',')
  class { $classes: }
}

With:

# $stack == soulmates
# $app   == 'api,solr'
if $::stack and $::app {
  $legacy_mode = false
} elsif $::role {
  $legacy_mode = true
} else {
  fail('no role mode determined as no useful combination of facts exists')
}

if ! $legacy_mode {
  $apps = split($::app, ',')
  $classes = prefix($apps, "${::stack}::")
  class { $classes: }
} elsif $::mainclass {
  $classes = split($::mainclass, ',')
  class { $classes: }
}

Discovery (cloud-find-hosts, deployinfo and Prism)

The cloud-find-hosts tool has been improved to pick up all tags and provide a special case for stack and apps variables. apps is used instead of app as that was previously used to hold the mainclass value.

The deployinfo mechanism provided by Riff-Raff is being extracted into a new stand alone tool called Prism.

Prism is being built with the new concepts in mind from the start with various discovery endpoints that look up the stacks and apps.

Transition to-do list

There are more consumers that need to be transitioned and things that need to be tidied up:

EC2 - the above needs to be replicated for the EC2 launch scripts
RiffRaff - the deployment system needs to lookup instances using stack and app instead of mainclass (known as app)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly