Thoughts on owning Change Control in your Splunk environment
Change Control and the Deployment Server
Background Recently, I was talking to my buddy Todd about deployment servers and their purpose in a Splunk architecture. You can see his excellent blog on the subject here (https://www.oldlogsnewtricks.com/post/splunk-deployment-server-cluster-might-be-your-solution ). Funnily enough, I was chatting with another buddy about change control and Splunk just a few days before that. Neither one of these guys told me off, so I am guessing that means I have some insight into change control (and I guess Splunk) that is worth sharing. So I thought I’d write out a blog post on what we all talked about.
As tech professionals we often refer to change as something we do - but we don’t think of it strategically. This leads others to step in and do it for us. By the end of this post I am hoping to change your mind about change control.
Let me paint a picture here: you’re running your niche Splunk Enterprise instance that is growing fast, or perhaps your company signed a major contract with Splunk for a massive implementation that will take 18 months to roll out.
In either case you suddenly find you need to install Splunk forwarder (basically agents) on Linux hosts in the PCI zone to collect data. Perhaps you need to open firewalls to allow syslog through. Or maybe you need to enable cpu/ram metrics collection from a critical app.
At a smaller scale you might be allowed to walk in and change things, but once real money is on the line rest assured these changes are not going to be allowed without SOME sort of process.
So you sit in a couple meetings and a manager that you like well enough, (but who doesn’t understand the risk profile of Splunk) pontificates about how the change process will work. Alright – that’s fine. You’re not the boss and who cares? You want your fingers on the keyboard creating new value! Every hour Splunk isn’t crunching data is money wasted right? So you smile nod maybe toss in your $0.02 and move on.
In the coming months you’ll notice it can take days to get a log added to Splunk. It might take weeks to get a firewall changed and even the simplest update of a Splunk app can take countless hours of talking and tickets moving around just to get approved to go into the CICD for Puppet to release.
You’ll find that fixing a Splunk App security bug can take days. Patch a forwarder? Weeks. Just adding value can slow to a snail’s pace.
You’re producing tickets with diagrams and notes that no one is really reading, neither creating value nor reducing risk. As a result - You’re always waiting while the ticket is cycled from team to manager to team. - You’re physically tapping on shoulders to get approvals. - Your work is starting to pool up, with a ticket backlog.
I’ve seen teams create “fake work” just to look busy while all this is happening.
In short: your talent is wasted!
You can see how change control has created a joyless working experience. It’s bad for the employee and it’s just as bad for the organization. Is it any wonder that you’ve lost the joy and excitement of working with a great product like Splunk when this happens? In devops terms we call this problem the loss of locality. That is: you can’t make atomic changes to the larger system.
Now, no one sets out to create a wasteful, broken change control. That’s the scary part, this all comes from a good place. We add over-worked and undertrained people from understaffed teams to a process. Then, we introduce complex ticketing systems somewhere in there. And the people leading this? They’re not accountable for the waste and delays they create!
You Need to Offer Change Control
YOU MUST take an interest in the Change Control process BEFORE someone who has no idea what you do for your company writes your policy for you. YOU need to stand up and offer a complete change control model before someone else does.
So how do you win this? It’s NOT someone else’s job to control the risk Splunk introduces to your organization. YOU need to make it YOUR job.
How often has a job fallen in your lap you didn’t want? It’s shockingly easy to assume large chunks of change control responsibility. Sure, not all of it, but if you come in with a plan and confidence you can control the CR narrative. The time you get back from a better aligned CR process will pay you back in dividends.
So where do you start? Splunk‘s Success Framework! Splunk success framework is a Center of Excellence model that provides a fill-in-the-blanks document to ensure that your Splunk deployment is well run.
At the time of this writing Splunk has provided a matrix on change control. I’ll go ahead and link you, but the reality is links change. Try Googling something like “Splunk SSF Change Management” if you’re in the future and this link doesn’t work. Or try the https://archive.org/
The first time I saw this document, I really needed to read it twice, breathe, and relax.
I’ve had VERY professional people flip out when they read this – screaming about “overhead” and “busy work” etc. and… I get it. We’re talking about no less than 30 questions that need answers. And many of these are “no brainer” answers to technical people in the know.
But I need you to step back and try and understand THIS is the problem your company is solving when they create anti-agile change processes. Try and have some empathy, this isn’t busy work; uptime, security and communication are what writes your checks.
The reality is so many IT folks hate change control so much that that the word “CR” alone has left them emotionally compromised – and honestly I’m not sure I can help those people. But those who embrace CR will realize it’s their ally to getting things done FASTER and with just the amount of quality the business needs.
There is a type of change we call “standard change” or in Splunk terms “Out of scope”. These things don’t require a major conversation, don’t need excessive documentation and are the “day to day” work. You might see this in something as simple as fixing a password, adding users, or rebooting Splunk boxes when there is no impact. This type of change is the key to success; you will define this change type very clearly early on and expand it. If you don’t take the time to consciously define your levels of CR, your processes will fall into the most cumbersome default process on the table, and this will undoubtedly be either too much or too little CR for what you’re trying to get done.
Personally I see it as 3 levels, but you need to customize this to the culture of your organization.
Standard Change (Out of scope as Splunk calls it) These are things you can do as long as you have the most basic of permissions. For example, an email, a single ticket, or a text from the boss or user.
Non-Scary Change (In-scope as Splunk calls it) These are things you know have impact, but you’ve done them before. You have well thought out automations and/or runbooks that shield your company from risk, but you should still inform people since there might be social or technical impact.
Scary Change- This is my highest level of change control. For example, when we’re doing something we don’t have script for, we don’t have automation, or it hasn’t been done in a lab first for some reason. These are the scary kinds.
Once you get everyone to agree to your categories, guess what? You’ve won. This is your angle. As you present what is standard and non-standard don’t paint yourself into a corner. Start to include things like “minor forwarder updates” to be excluded from formal change control. Add contingencies for “Splunk TAs” to release faster through your Jenkin pipeline or, even better, self-manage with a Splunk Deployment server. Earn buy in and document what you know to be safe and sane into the Standard Change model. Keep developing your automations to change your work from Scary Change to well automated Non-Scary Change.
If you don’t like your CR process, chances are you’re not involved with it. Help grow it. Tune it to what’s right for the risk appetite of your company vs actual risk being created.