Smashing the Stack

Smashing the Stack

Hello everyone.

I hope you had a good holiday season.

I took time off over the holiday to redeploy my infrastructure to make it more automated, more reliable and to be more organised.

This multi-part blog post will cover the entire journey, from planning to execution and finally, the firefighting.

TLDR: I redeployed my infrastructure and documented it. Link here

Before we begin, I want to cover the reasons why I did all this.

My homelab in 2019

My infrastructure in 2019 was a single Desktop computer. Through the year it grew from that to 1 server, then to two, then to 5.

my homelab now

While I had done my best to manage them as separate entities, my infrastructure had become large enough that I should begin treating them the way an enterprise would. Additionally, the way my infrastructure had grown had resulted in many issues arising from the gradual expansion.

Drift

As my infrastructure slowly grew from one server to two, then to five; they began to drift heavily. To me, drift is the slight changes in configuration between machines that make them incompatible with one another when attempting to automate tasks.

One such example of drift is filesystem layout relating to Docker.

On machine A, the mount points for docker volumes might be /srv/docker. On machine B, this would be /opt/docker. While this wasn't an issue when managing 2 or 3 machines, it became an issue when I wanted to automate the deployment of docker images onto these machines using ansible. While I could have hardcoded the values to point to the correct locations, it would have resulted in much unnecessary complexity in my ansible scripts.

This kind of drift was present all over my machines, from sysctl variables to daemon configurations.

Drift happens because of a lack of proper change control and planning, and in its infancy, I didn't feel the need to enforce either.

Bloat & Complexity

Bloat became a problem for my hypervisors because I wasn't treating them as hypervisors; I was treating them as general-purpose do all servers.

One served the purpose of a DNS server and a web server anda build server. Each of these purposes should have been segregated into a virtual machine; so that in the event that I need to shut the hypervisor down, the virtual machine could be moved onto a different hypervisor.

In hindsight, I saw it as easier to install the necessary packages and be on my way; instead of setting up a virtual machine and then the packages. This approach lacked the understanding that I would need to maintain an additional component on top of the hypervisor; rather than maintain a component next to the hypervisor. The system is a lot less complex when it doesn't have 20 other services running with it.

This bloat became a big problem when I had to deal with a cascading failure of one service, causing the hypervisor to enter emergency mode, resulting in the entire system being unusable until I fixed the issue. If I had put the service into a virtual machine, then I wouldn't have been in that mess.

Bloat also added to the total number of packages installed, and thus, updates also became more of an issue as the hypervisors grew in complexity.

Bloat happens when systems become too monolithic, and from a failure to segregate services properly.

(a lack of) Abstraction

This is somewhat related to my previous two points, but it's different enough that I feel the need to point it out. In my infrastructure, I had failed to abstract my services properly. This failure to abstract meant that (for example), my NFS server was hydrogen.srv.oxide.one, not nfs1.nix.oxide.one.

In an ideal world, I shouldn't have to care that NFS is served by hydrogen; I can point my services to nfs1, and I'll be fine.

While this isn’t so much of an issue, it does become annoying when attempting to keep track of my infrastructure and keep things running smoothly.

Lack of repeat-ability

One of the problems that I realised I had was that my hypervisors were not repeatable. I had failed to document how to set up the machines, along with any fixes or configuration changes I had done to make that machine into what it was. Some configuration changes were useful and needed, but others were not.

What this means in practice is that if I suddenly lost the OS on the hypervisor, I would be lost when attempting to rebuild it. I had initially built it by googling my way forward, but this was not sustainable.

The goal

With these issues in mind, I concluded that it would generally be easier to redeploy my infrastructure, rather than retrofit any fixes. I wanted this redeployment to be the lastone I need to do. To make sure that the redeployment was the last, I need to make sure that the process was Documented and Automated, where possible.

I gave myself some goals to be able to mark this infrastructure deployment as ‘successful’. They are as follows:

  • Be Repeatable
  • Do not treat any machines differently.
  • Keep the hypervisors as simple as possible.

First, I wanted my infrastructure to be repeatable. If I wipe a hypervisor, take a fresh install of the OS and make the changes defined in the Documentation, I should end up in the same place as I was before. If not, then I need better documentation. Ensuring that this is kept, I can also eliminate drift.

Second, I should not treat my machines differently; unless there is a justifiable reason to do so. This will help to eliminate drift and complexity.

Third, keep the hypervisors as simple as possible. If something doesn't need to run on the hypervisor, don't run it. Ensure that they do not depend on each other so that they can run independently. Anything that creates a cross-machine dependency should have checks in place to ensure that cascading failures do not happen. This will eliminate my lack of abstraction and bloat.

So, I had my issues; I had my goal. Now to plan.

The plan

I began to document everything about my machines. I had little visibility of what my infrastructure actually was. If I wanted to redeploy, I should at least document the mistakes I made so I don’t make them again.This task alone took up the first week of my holiday, but it was worth it.

Having a good scope of everything in my domain gave me clarity on how much drift had been going on, both on the hardware scale and software scale.  One of my hypervisors had over 100GB of ram, yet the other had 32. On the software scale, one of my hypervisors was setup and synchronised with LDAP; yet the other was not.

It was impressive that everything was running as smoothly as it was, in all honesty.

Once I began building out the documentation; I began to have a nagging thought in my head:

Am I creating a spec, or am I creating documentation?

A spec states what the requirements are for a system. It states how a system should function and behave, whereas documentation states how the system currently functions. While it was in essence Documentation, it felt a lot like a spec, and so I treated it as such.

For each component I wrote about, I determined whether it was useful to the end goal. If it was, I documented it, and it became part of the spec. If the component wasn't relevant to my end goal, and I couldn't justify the component existing, then it wouldn't become part of the spec.

One such example of this is the use of Nvidia drivers in the kernel. At some point in my infra, I had installed a GPU into each server so that it could perform hardware encoding of video. This resulted in the Nvidia kernel drivers being installed.

If I were treating this as Documentation, I would point this out and include instructions for how to set this up on a new hypervisor.

I determined that I shouldn’t really use out of tree kernel drivers if I wanted to ensure stability; so it didn’t become part of the spec. This kind of approach lead to me writing a hybrid Spec/Doc sheet; that would take the best parts of my old infra and build a rock-solid foundation for future expansion.

Once redeployed, the spec will become the documentation, and all will be well.

I call this infrastructure oxide one. It’s currently at version 1, but will be updated well into the future as it is now the single source of truth for all of my machines.

The next blog post will cover how I got around to turning the spec into actions using Ansible.

The wiki is available here