This post is the first of a two-part miniseries identifying and correcting old mistakes. Part Deux is also available.
Today's atonement for old mistakes: Git repos used in production in which nested/disparate accounts run code.
The 'How We Got Here'
The production environment we spun up several years ago at work for my data warehouse sourcing/ETL/shipping jobs was very literally the first environment we'd set up in which Git (and Github Enterprise) were heavily involved. This mindset shift came about as we switched hosts after a major server OS change, and the idea was to much more nimbly stand up a new "fully functional" production environment with as little institutional memory as necessary. Think disaster recovery, but more functionally the ability to move to a new production host. Scripts to configure the basics of the host/environment are in their own repos, and there are a small number of repos configured based on the nature of the jobs/project. As such, rather than having twenty five distinct repos for different jobs/projects, we have five that contain one or more jobs/projects by type. This decision was made to simplify the rebuild/port and generally speaking has been a solid choice.
Almost all changes are made outside of production, committed, pushed (PRs are rarely involved due to the nature of most changes which are "date/time/term related"), then pulled into production where they happily run. Prior to this change, almost all changes were made live, in production, and without any practical ability to back out of a borked change.
In other words, mostly all good things! Except...
The original design grouped jobs and projects together by their nature, and that resulted in most of our "vendor systems" jobs being grouped together into one repository. This seemed fine "on paper," but a problem arose when we first put this into production. Most of our "vendor systems" jobs run under their own accounts due to access rights, key authorizations, and so forth. The design ensures role and access separation, but injected a unique issue when these were all rolled into one central Git repository:
git pull would occasionally fail depending on the
pulling user and the changes coming down, specifically when changes crossed 'projects' as they might during term changes (many similar changes rolled into one commit). Since the pulling user might not have access to the contents of other project directories in the repo, it was a known design problem, and something "we'd address sometime in the future."
The Interim Solution
We built and used a number of ways to control for the problem, including:
- Setting repo folder permissions with sticky bits,
umask settings, and a common group for the .git folder;
git pull as root when changes were known to cross accounts;
- Creating a script to re-apply permissions for the entire repo contents (after
pulling as root); and
- Generally committing/pushing/pulling changes in sequence with each user account on the production host.
At the time this was all developed, we were not at all familiar enough with the deeper workings of Git to understand there are ways to help address this situation with configuration variables and hooks, but the reality is the repos with the most frequent and numerous changes were not affected, so anyone (me) who would be making changes in the affected repos only did so 1-3 times per year. Merely knowing there was a sequence and process was "good enough." For other folks, some of this detail existed in the repos README.
The Change Catalyst
For several years this worked out "okay," but this spring we were required to make more substantial changes due to our database environment undergoing a major upgrade. This required some client-side (our production host) client/software changes/upgrades, password/functional account changes, and some other changes that I'll cover in part two. The takeaway, though, is that these changes touched a lot of things across and throughout the repos, and this created a whole mess of permissions and file changes. Sure, we'd created a workaround, but it was time to really "fix" the problem.
The true solution was to extract and split up the offending repos based not solely on their function, but also by their account/access rights/permissions. Reviewing the various permissions, I discovered that it would be simplest/best to:
- Leave seven projects alone in repo A;
- Leave two projects alone in repo B;
- Move one project from repo A into repo B;
- Move two projects from repo B into repo A; and
- Move three projects from repo B into their own distinct repos.
Making these changes would completely eliminate the need to ever use root to perform a
git pull and also eliminate manual or semi-automatic (scripted) reconfiguration of permissions following a
pull. But it's a lot of moving around and moving parts, so it needed to be done methodically. This usually came down to the following steps:
- Move files from one repo to the other;
commit the deleted files from the source repo;
- Update paths and other repo-specific details on the moved code/scripts;
- Sanity check (creative searches for defunct paths,etc.); and
commit the changes on the destination repo.
push everything to Github.
The steps on the production host then more or less boiled down to
git pulls and re-installing the cron files for each repo since they'd changed so dramatically.
So Far, So Good!
Overall, I know why 'procrastinating me' put this off for ... five-plus years, but at the same time the whole set of migrations/changes identified above only took about an hour and a half from end to end. This 90 minutes wasn't even contiguous time.
Other than a few harrowing minutes as one job was moved between repos, everything has also completed without error or notice. It's nice when things just work!
The harrowing minutes, however, were a bit tricky. One job/project moved between repos is a middleware solution. It receives data from a source, transforms it, and then a destination pulls said data. This project (the middleware part) also runs every minute. During the couple minutes of actual cutover, I received several error notifications due to paths not found, etc., while cron jobs were modified and source/destination jobs were adjusted to point to new paths.
I'm glad we made the move to Git so many years ago, but I'm ever so much more glad to have finally fixed this stupid design issue that I'd just been putting up with for future me to figure out.
Now time to go rewrite those READMEs...