Our Journey with Salt (Medium.com)

Our Journey with Salt

When we hit the web to find places that discussed best practices, tips and tricks etc, we were surprised at the sheer silence of the community. (Or the terribleness of our Google searches) While we don’t have any tremendous insight, each of the Operations Engineers has experience with different config management tools, so with our collective wisdom we decided to chart our course and document it along the way.

The Myth of the Working Maager (Medium.com)

The Myth of the Working Manager

When we talk about management, what we’re often describing are more supervisory tasks than actual management. Coordinating time-off, clearing blockers and scheduling one-on-ones is probably the bare minimum necessary to consider yourself management. There’s an exhaustive list of other activities that management should be responsible for, but because most of us have spent decades being lead in a haze of incompetency, our careers have been devoid of these actions. That void eventually gives birth to our expectations and what follows is our collective standards being silently lowered.

People, Process, Tools...In That Order (Medium.com)

I’ve started doing more of my blogging on Medium.com’s blogging platform. It’s just easier in terms of generating readership, commenting, recommendations etc. But I still will post links to those articles here, since this is sort of my digital home page.

People, Process, Tools…In That Order

In technology circles, we tend to solve our problems in a backwards fashion. We pick the technology, retrofit our process to fit with that technology, then we get people onboard with our newfound wizardry. But that’s exactly why so many technology projects fail to deliver on the business value they’re purported to provide. We simply don’t know what problem we’re solving for.

When You Think You're a Fraud (Medium.com)

I’ve started doing more of my blogging on Medium.com’s blogging platform. It’s just easier in terms of generating readership, commenting, recommendations etc. But I still will post links to those articles here, since this is sort of my digital home page.

When You Think You’re a Fraud

I can’t help but compare myself to others in the field, but I’m pretty comfortable recognizing someone else’s skill level, even when it’s better than my own. My triggers are more about what others expect my knowledge level to be at, regardless of how absurd those expectations are.

Metrics Driven Development

I’ve started doing more of my blogging on Medium.com’s blogging platform. It’s just easier in terms of generating readership, commenting, recommendations etc. But I still will post links to those articles here, since this is sort of my digital home page.

Metrics Driven Development

The premise is simple. A series of automated tests that could confirm that a service was emitting the right kind of metrics. The more I thought about it, the more I considered its power.

Securely Connecting to Your Home Server

I’ve recently found the need to connect to my home network remotely quite a bit lately. There’s a number of solutions out there to do this, including setting up your own VPN tunnel using some open source tools. While I am a fan of VPN, I’ve opted for a simpler solution using OpenSSH and port forwarding on your router.

The horror of port forwarding isn’t necessarily the act of the forwarding itself, but the service in the background your forwarding to. I didn’t feel comfortable having Transmission or Couchpotato publicly accessible from the web. It was less about someone submitting crazy torrent downloads and more about those services having potential vulnerabilities in their prepacked web servers.

Generating the SSH Keypair

The first thing you’ll want to do is make sure you can login to the machine via a public/private key pair. In order to do that, you’ll need to first generate a keypair.

ssh-keygen -t rsa -b 4096 -o <key_file_name.key>

You’ll want a nice big fat key size (hence the 4096) just to future proof it. And considering this is for your home machine, the overhead you’ll pay for it is largely irrelevant. This command will generate two files, the file name you specified as -o and a file of the same name with the .pub extension.

It is very important to understand the difference between these two keys. One is the public key (.pub) which you are allowed to share with other systems for authentication. The other (without the .pub) is the private key, which you should never, never, never share. Your private key also requires specific file permissions, so you’ll want to execute the following.

chmod 700 <key_file_name.key>

Now you need to get that public key on to the server. The easiest way to do that is using the command ssh-copy-id. Once again, stress the importance of knowing the difference between the two keys, you should only copy your public key to the server. (The file ending in .pub)

ssh-copy-id -i <key_file_name.key.pub> <user>@<ipaddress>

It will ask you for your password and then copy the file to the correct location. Now to test that it works

ssh -i <key_file_name.key> <user>@<ipaddress>

If you set everything up right, you should be logged in without being prompted for a password.

Securing OpenSSH

The first thing you should do prior to forwarding OpenSSH

  • Disable insecure ciphers. Out of the box OpenSSH has a number of insecure ciphers which can make your server susceptible to various attacks.

  • Disable root logins. Seems odd that this is sometimes allowed depending on your distro. Very seldom do you want to allow root to login via an OpenSSH shell. It’s better to have an unprivileged user login and then have to sudo to perform specific actions. This might seem like overkill for a single user system, but it’s a good habit to get into.

  • Disable the use of passwords. Sounds crazy right? Not really. You should only be using public/private keypairs to authenticate to your SSH server. This makes it impossible for your machine to be brute forced. Now you just have to make sure you protect your private key from other people accessing it.

Instead of just copy/pasting the OpenSSH config, I figured it made more sense to put it on Github so that we can continue to evolve and refine it over time. Check back periodically to see if there’s any updates to the config.

Enable Port Forwarding on your Router

Enabling port forwarding is specific to your router. If you already know how to do it, then just make sure you forward port 22 to port 22 on your target server. If your home internet service provider is giving you a static IP address, then connecting is easy. If not you may want to look at a dynamic DNS service so that as your IP address changes, your DNS record is updated.

If you need to check what your public IP address is, you can execute the follwoing

curl icanhazip.com

That should return your public IP address.

SSH Tunneling

Now you can setup your SSH tunnel to send traffic over port 22 to any of the ports on the local machine. From your client, open a terminal and execute

ssh -i <private key> -L <port-to-forward-to>:<server ip>:<local port> <username>@<server ip>

So that’s a handful.

  • The private key is from the key pair generated in the previous step.
  • port-to-forward - the port you want to connect to on the remote server.
  • server ip - the public IP address (or DNS name if you set it up) of your router.
  • Local port - the port on your local machine you’ll connect to in order to connect to the remote server. For simplicity sake, I’d just use the port as the remote port.
  • username@server ip - This is a combination of the user you’ll be connecting to on the server along with the server ip. It will resemble an email address.

ssh -i server.key -L 3000:plex.myserver.net:3000 jeff@plex.myserver.net

If you need to forward multiple ports, you can repeat the -L flag for as many ports as you need.

The last step is to open your browser and go to “http://localhost:port” where port is the local port you specified in the above command. Your port should be forwarded and connected to the remote server.

Voila, you’re now communicating with your remote server securely.

UPDATE: Some folks have said that they have problems connecting remotely, with a “connection refused” or “server unexpectedly closed the connection”.

I’ve seen this happen on some setups and adding a localhost entry usually resolves the issue. In your /etc/hosts file just add an entry for that matches your server’s fully qualified domain name. (ex: plex.myserver.net)

Logs vs Metrics

I constantly struggle with the idea that a log entry and a metric are not the same thing. Both provide value, but they’re telling different stories. The problem is that the distinction is a lot like pornography. I’m not 100% sure how to define it.

In my shop, our applications emit a lot of data. Now some of you are reading and thinking that this is a first world problem in the technology space; I get that. But the problem is when I try to use the right tool to evaluate that data. Should this go into a log file to be aggregated by a tool like ELK or Splunk? Or should this just be a tick that fires and is sent to a metrics collector like Graphite? How do I articulate that choice to a developer?

Let’s look at Twitter for example. Let’s say they want to track tweets per second. For our purposes the options are

  • Log to a file when a tweet happens. That log is shipped to an aggregator. Then a log search tool evaluates the log messages (because there could be other log messages there) does some math and then displays it.
  • The application uses some transactional annotation around the “new tweet” functionality. That sends an “event” type metric with some additional metadata and fires it off to a metric service.

Both are conceptually the same thing, but the log file approach seems dirty to me. Other than the benefit of some abstractions, the core tenants of both approaches is the same. My discomfort must lie in the intent of these two things.

A log entry should give you information or detail about a specific event where as a metric should just let you know that an event occurred.

An example of a good log message

  • Tweet number 238472 was submitted successfully by user @darkandnerdy at 11:05am
  • User DarkandNerdy has failed a login attempt. The user has 4 concurrent login failures

An example of a metric

  • 11:05 - tweet submission successful
  • 11:15 - Failed login attempt

The specificity that a log message allows is what makes it valuable. But more importantly as an Operations person, that specificity should be allowed to be tunable. I only need that specificity when a problem arises. I shouldn’t be forced to have logging set to a low level (like INFO) just to have an idea of how the system is performing. In normal Operation, I want my log file to contain anomalies, not stacks of information telling me that everything is ok. (And if that is the case, maybe those logs need some separation from General log messages)

Through the process of writing this blog post I’ve become much clearer on how I feel about these two things. My “thesis statement” if you will, would be this:

Logs messages are notifications about events as they pertain to a specific transaction within the application. Metrics are notifications that an event occurred, without any ties to a transaction.

Ok so what’s the difference? Well again putting on my Operations hat, metrics can be incredibly smaller because they convey considerably less information. They’re also extremely easier to evaluate. Both of these points have impact around how we store, process and retain metrics.

A log file however, gives you details on a transaction which may allow you to tell a more complete story for a given event. The transactional nature of the log message in aggregate, gives you much more flexibility in terms of surfacing information (not just data) about the business.

So now that I think I’ve cracked my first world problem, I’d pass on a bit of advice to folks. Any data is better than no data. If things are preventing you from having a decent separation of things, just HAVING the data makes a huge difference. Once you have it, you can argue over the semantics. :-)

The High Cost of Low Friction Infrastructure

The impact that cloud computing is having on technology organizations is undeniable. It’s allowing companies to move at an unprecedented speed, churning out infrastructure as effortlessly as pushing a button in a Jenkins job. Gone are the days of counting infrastructure delivery time in quarters. Now the timescale is in minutes. While the agony of endless change gates for hardware has been applauded by most of us in the industry, our zeal for the speed gained has left us unaware of what’s been lost. There are no perfect systems. Everything comes with a cost.

##Systems Aren’t Always Code##

When we talk about Systems design in technology, we’re often discussing some computer-to-computer or computer-to-human interactions. But in-between these sophisticated transactions lies the original system, two-humans directly interacting with each other. (Albeit that interaction is increasingly becoming digital as well)

These systems exist as approval processes, peer review and the often vilified meeting. These human touch points however serve as signals within the human system. If you make a purchase order request for new hardware, that signals to the operations group that something within the organization has changed. Are we expecting an increase in demand? Has a new project started? Questions begin to flow, the human network does its thing and people begin to interpret the signal and take action. More than a handful of organizations have processes that are initiated as a result of these signals, to ensure broad involvement, risk assessment and impact.

In the world of the cloud though, many organizations haven’t thought through their human systems and where these gated signals will now come from. An engineer can spin-up a system, make a DNS entry, share a link and have a “production” system up and running before lunch. A self contained team has all the tools necessary to go to production. It’s a blessing if you’ve accounted for it and a curse if you haven’t.

Conway’s Law has often been used as a cautionary tale. What we fail to realize however is how often a company uses Conway’s law as a control mechanism for certain actions. By creating a barrier between systems, development and design, you’ve forced an interaction that improves communication. The quality of that communication is debatable, but the fact that it exists at all is a win in this context. I’m not arguing whether that limiter is good or bad, but your organization has to understand where and by what degree it depends on that limiter. How does the security organization become aware of new instances?

Who is responsible for patching those instances? Monitoring them? These questions are typically birthed at the point of that forced interaction. Could the team deploying the system ask and answer these questions internally? Absolutely. And often they do. But that is extremely company dependent and comes with its own set of challenges that are outside the scope of this post.

The issues I have outlined all have workable solutions, but you can’t solve a problem unless you’re aware it exists. Be honest about your gates and checkpoints. You wouldn’t have added them unless they provided some value. Maybe that value has been over or under estimated, but the move to the cloud is a great time to re-asses them.

##Do More With More##

Something I’ve always found ironically hilarious in the virtual server space is why it came about and where we’re at now. Tech organizations saw all this underutilized hardware and thought virtual machines would be able to help address this and save cost. Now the industry has loads of under-utilized virtual machines on over-subscribed hardware. If you find yourself in this situation, the move to the cloud may not offer much relief.

During the days of gated hardware acquisition, there was a certain pain that went with ordering a server. Forms, inquiries from management, meetings and just a bit of your soul were all common prices to pay as someone rolling out new hardware. As a result, there was a sort of self-preservation like preference on using hardware your team already had. With the cloud, that friction is removed, making it easier and more attractive to simply spin-up a new server. There’s value to that. A tidy separation of concerns, isolation for maintenance and a performance profile that is inline with your application’s needs.

The rub comes in when this mode of thinking becomes institutionalized. Before long, it’s habit that every application gets its own set of servers. But does that make sense for every solution? If billing and marketing have an internal-only application, does it make sense for each of them to have their own server? And then if they’ve got one, they have to have two right? (Redundancy and all) What is the cost of a 30 person unit sitting on 4 servers? You may have hidden that cost in VMWare, but in the cloud it comes front and center.

There’s that old economic principle, the law of demand. As the price of a good reaches 0, demand becomes infinite. While the price of cloud computing is nowhere near zero, organizations have effectively removed the purchaser of the good from the consequence of that good’s cost. We’ve trivialized the acquisition process to the point where need and want are synonymous. If you’re not careful, you can quickly fail to deliver on the promise of cheaper infrastructure using the cloud.


I bring up the topics above not as a deterrent to the cloud, but as a warning. Adopting the cloud isn’t just about speed of delivery, but about a shift in how your organization thinks about the technology life cycle. There are no free lunches. With speed comes some added risk. Is it worth the payoff? Every company has to decide that for themselves. But a healthy dose of organizational pragmatism is good “one-size fits all” advice.

Troubleshooting Fallacies

An outage on a large distributed system can be a very difficult thing to troubleshoot. The pressures of an outage can often lead to making poor or shortcut decisions. When the system is down, the drive is to get the system back operational as quickly as possible. But sometimes we sacrifice process for the sake of speed.

In this frantic state, we fall victim to what I call Troubleshooting Fallacies.

#1 - Subsystems Fail All or Nothing

When we’re troubleshooting an issue we start to quickly assess the potential failure points in the system that may be causing the problem. Sometimes this may happen at lightning speed at a subconscious level. For example:

  • You know it’s not the network because you can ping the server
  • DNS isn’t the problem because the hostname is resolving
  • The application can connect to the database because you verified credentials and connectivity from the command line.

The list could continue, but you get the idea. The criteria we use for eliminating these subsystems is usually razor thin, ignoring the complexity within each of those sub-systems. We behave as if the subsystem fails in totality. But it’s quite possible that the subsystem is failing in a sort of nuanced way. Maybe only specific DNS servers are failing. Or maybe they’re only failing from a specific node. Maybe the host can connect to the database, but the application is having trouble connecting, possibly due to a configuration issue.

When you begin to exhaust your options, be sure that you verify the subsystems involved as concretely as possible. Sometimes a high level pass of the subsystem simply isn’t enough.

#2 - No Errors = No Problems

When trouble starts and the resolution hasn’t been narrowed down, you might call a team member from each area like, storage, networking, application developers, and site support. As each functional area signs-in, most admins have already begun rationalizing why it isn’t their particular subsystem that’s at fault. This sort of system bias leads the admin to perform a less than ideal verification of their system.

Instead of verifying that things are working correctly, the admin simply confirms that no errors are being thrown. But this mode of verification assumes that the admin knows every type of failure mode of the system and how it manifests itself, an unlikely reality. This is what I call checking for errors instead of verifying success.

Checking for errors can be a bit oversimplified if you don’t believe your system is at fault. It exposes a kind of confirmation bias where you may dismiss anomalous, but not necessarily damning information. Verifying success is a bit more thorough, but requires a level of preparation before the incident, as it will most likely require some automation. Verifying success may look like:

  • Scripted versions of API calls
  • Database connection tests from nodes in question
  • Synthetic transactions against a web interface
  • Response code verifications

Now you’ll notice that most of these sound like potential monitors that should be part of the system. That is true, but a lot of these monitors would probably be implemented in the system admin’s language of choice, versus the language the system is written in. (Think python vs. java) These success verification applications should try to mimic or mirror the languages and libraries used in the application to help reveal potential software incompatabilities.

#3 - Finding the Root Cause is a Must

Root cause analysis is a tricky subject. In most of these overly complex systems, the idea of a root cause is a bed time story we tell managers to help them sleep at night. The hard truth is that the more complicated our systems become, the more nuanced are failures are. Root cause is fast becoming a thing of the past. Failure looks more like various sub-systems entering a failed state, which in turn produces a system level failure mode. Example:

  1. HTTP Requests come into a web server without a timeout value
  2. The HTTP requests results in database calls
  3. The database is saturated, so the queries take 2-3 seconds to respond.
  4. The service time of requests to the HTTP server is too long to handle the arrival rate of requests without queueing.
  5. The HTTP server runs out of available threads and makes the service completely unavailable.

Is the database the root cause? Or is it the fact that HTTP requests are allowed to execute without a timeout value? Or maybe the HTTP server doesn’t have a sufficient number of threads for peak traffic volume?

Any one of those could be a valid root cause, which means none of them are really the root cause. The lack of root cause does prevent a tidy answer for an incident report, but it does promote a more thorough understanding of your system and its various failure modes.

If your leadership insists on root cause analysis, I suggest you take that analysis as far back as possible. You may not like where it takes you

#4 - It Should Be a Quick Fix

As problems arise, admins can repeatedly underestimated the impact of the issue. The quick fix is always around the corner, which ultimately slows down the final resolution because of a sort of haphazard, gut feeling approach to troubleshooting.

It’s easy to be undisciplined in your troubleshooting process during an outage. The pressure is on and you leap from one potential issue to another, a lot of times without any scientific evidence to pack it up. It’s imperative however that we maintain control and process, even in the face of a management mob desperate for resolution.

This of course presupposes that you have a methodology. If you don’t have a process I would highly suggest looking at the USE Method by Brendan Gregg. It’s a practical approach that allows you to break up a system into various components and then test those components for particular failures.

Gregg’s approach has a particular performance bend to it, but with a little bit of effort it can be adapted to suit analysis at any level of the application stack.

#Wrap Up

This list is far from exhaustive, but highlights some of the issues that teams fall victim to. Recognizing your mistakes is the first step to avoiding them. I hope to continue to expand on this list over time.

Importance != Priority

In today’s modern economy workers in every industry are facing a common challenge. Do more with less. The less part is a particularly interesting constraint, because it adds to the number of concurrent tasks per team member. You constantly hear about this situation and a person’s ability to “juggle multiple priorities”. But I personally reject this line of thinking on the basis that the statement itself is flawed, unproductive and ultimately impossible.

When we say a task is a priority, what we’re really saying is the task is important. You can have an unbounded number of important items on your list, but importance describes the task in its own space. Priority however describes a task in relation to other tasks and as a result, you can truly only have a single priority at a time. That doesn’t reduce the importance of other tasks at hand, but it does clearly identify an item as the single most important thing on the docket. Priority is binary and global across the scope of work that you’re performing. Lets use an example to help clarify.

You have a presentation that you’re giving to C-Level executives. It’s the type of presentation that careers are made of. You’ve been prepping all year for this one moment. Right before you’re about to go up, you get an email that says there is a huge problem with the financial system and your assistance is needed. Normally this item would cause you to drop everything and address it right away. But today, this board meeting is the priority. It doesn’t detract from the importance of the financial systems problem, but that item has to be delayed behind this board presentation. You’ve clearly identified the priority at the moment. You’re not going to check-in with the progress of the financial problem during the meeting are you? It gets tabled and or delegated (and possibly made someone else’s priority) so that you can address the real priority, the board meeting.

Now sticking with that same example, you’re about to go on in front of the board when you get a phone call. Your significant other has been in a car accident. It’s pretty bad and they’re currently being air-lifted to a nearby medical facility. What’s the priority now? Are you going to deliver the board presentation from the car? From the hospital? I doubt it.

The example is extreme but it highlights a few things.

  1. Priorities are fluid
  2. Priorities are (usually) obvious

This means that sometimes important things slip. It’s the nature of the world. We like the idea of multiple priorities because it makes it easier to rationalize your choices. “I’m not saying work is more important than you honey. You’re both my priority.” But that’s bullshit. You’re not at dinner or at the zoo with your kid, you’re in your office working.

The idea of a single priority forces us to stare the reality of our choices right in the eyes. You’ve chosen this over that. Own that choice and everything that it implies. Or re-evaluate your priority.

Refactoring Pet Peeves

Somewhere during my wild romps on the Internet I came across an interesting article that talked about the concept of “defactoring”. I think we’re all familiar with the idea of “refactoring”, but defactoring goes against the many rules that have been engrained on us as technologists. I think that’s why I love it so much.

Defactoring reduces the number of ways we can recombine the pieces of code we have…We’d make it less flexible.

In the technology sector we’re often reading about the latest and greatest technology, architecture/software pattern and how you’re basically a luddite if you don’t adopt these practices. Proper factoring of code is probably one of the first times I experienced this phenomenon. I rushed to make sure all of my functions were small and tight. I mastered the art of taking a 30 line shell script and exploding it into 120 lines of well organized bliss.

The problem that I think “defactoring” tries to address is that of complexity for the sake of complexity and cool kid points. While I love a well organized code as much as the next guy, most of the reasons you would break apart aren’t valid in a lot of the programs being written. (Especially on the Systems end of the house)

The biggest reasons to refactor code in my experience is:

  • Reusability of code
  • Testability of code
  • Isolating complexity

If refactoring your code doesn’t provide any of these benefits, then what’s the real purpose of doing it? It’s another logical break when you’re reading the code that you have to jump to. Does it add any value?

Before you move that for loop to its own function, ask yourself “What does this buy me?” If you can’t answer that question in 10 seconds or less, you probably don’t need to do it. Yes, people might judge your code, but if it’s not that, it’ll be something else. (Engineers are a persnickety bunch)

The world has enough complexity. Don’t add to it without good reason.

Airmail2 + Omnifocus

Full disclosure. I’ve had a few drinks and should probably wait till morning to write this. But waiting is for suckers. YOLO

I’m a big fan of Omnifocus and Airmail. Together they’re like peanut butter and jelly. But one thing that absolutely drives me insane is the way Airmail’s integration works with Omnifocus.

When I convert an email in Airmail to a task in Omnifocus, it creates a task with the task name being the subject. But the body of the message (the note) actually becomes a link to the Airmail message. That’s all fine and dandy, but I use Gmail as my mail provider. When I archive messages, Omnifocus seems to get confused on how to find the message based on the URL link. Rubbish. I want something simple and stupid. Enter Applescript.

With Applescript I was able to quickly write a tool that allows me to convert the body of the email message into a note in the Omnifocus task, which eliminates the need for me to keep the message around at all in my Inbox. Below is the script, but you can check out the Gist here.

tell application "Airmail 2"

set theMessage to selected message

tell theMessage
	set theContent to htmlContent
	set theSubject to subject

end tell

tell application "OmniFocus"
	tell quick entry
		set theRTFMessage to do shell script "echo " & quoted form of theContent & "|/usr/bin/textutil " & " -convert txt -stdin -stdout -format html"
		make new inbox task with properties {name:theSubject, note:theRTFMessage}
		set note expanded of every tree to true
	end tell

end tell

end tell

The script is pretty vanilla, except for the line referencing /usr/bin/textutil. Textutil is an awesome little utility on OS X to convert text from various formats. It’s part of the Cocoa Framework so it should be available on all Macs running OS X. (Gotta get specific for people that still think Linux is a Desktop OS. OOOOH BUUURRRNNN)

Now you need to make the script useful.

  1. Open Script Editor on your Mac and copy pasta the script into it. Save it some where and make a note of the location.
  2. Launch Automator, and choose “Service” as the Document type.
  3. Open a Finder window, and drag your workflow onto the Automator build section.
  4. In the upper right hand section, change the in “any application” drop down to Airmail2. (You might have to click other and browse for it)
  5. Save the Service via File -> Save

Now that you’ve create the service, you’ll want to create a shortcut for it in Airmail.

Launch System Preferences and go to Shortcuts. Go to App Shortcuts in the left hand bar. Click the “+” icon.

In the Application section, choose Airmail 2. In Menu Title, type the exact name of the service you created above. Choose a Keyboard Shortcut for the last field. I personally use CMD+SHIFT+, but YMMV. Choose what works for you.

Voila. Get better emails in your Omnifocus. Now I’m just waiting for everyone to tell me there was an easier way to do this. Because I CAN’T be the only one frustrated by this.

Can You Stomach Root Cause Analysis?

Lately I’ve become extremely interested in accident analysis techniques. This is largely useful in the manufacturing and transportation industries, but there has been a growing trend to adopt these types of practices in the technology arena. Think Kanban, Lean Startup, and the Theory of Constraints to name a few.

But accident analysis and safety digs deep into the nature of failure within a system. Some of my favorite thinkers in the field like Sidney Dekker and Nancy Leveson have been forcing me to go beyond the surface of an issue and to dig deeper into the organizational issues that are equal contributors to failure.

Root Cause Analysis (RCA) is something that gets touted all the time in technology. When a system goes down, we’re desperately trying to find out what caused things to go bad. Despite our best efforts, we never seem to go far enough with RCAs.

One of my favorite mantras regarding root cause is that “Root cause is simply where we stop looking.” We go far enough down the rabbit hole that we simply can’t explain further, don’t have the will to explain further or we’ve reached a politically acceptable answer. (Leveson Engineering a Safer World)

So why do we go through the theater of Root Cause Analysis in technology? Because we need to explain the unexplainable. Because if we can’t explain something, how can we possibly give assurances it won’t happen in the future?

The technology field has done a great job of pretending that everyone has their shit together. No one should ever have a failure that goes undetected. Anyone who isn’t alerted before a problem happens is an idiot. These are all worthwhile goals, but they are so far away from the reality of where we are in technology. But thanks to blogs, social media and the wisdom of hindsight, gaps in system and failure monitoring is largely the result of unqualified staff. This belief is held by management, furthered by people in the industry who talk a good game and then ultimately internalized by those with imposter syndrome. So we stress and we agonize over the root cause of a thing. And that’s not entirely a bad thing, but here’s the rub.

Lets say we get to a point where we find that the root cause was due to setting X not being set to a reasonable or correct value. That’s it. We change setting X, explain how that moves up the cause of events and fixes everything had X been set to a sane value. But why was X set to the value it was set at? Easy, the person before us was an idiot. But is that really the case? What factors went into that decision? What organizational pressures were present that forced a more conservative value? Could we not spend the money for extra hardware that would better utilize X? Was there no time to do performance testing so we settled on the low value for X? Did our predecessor not get the training necessary to understand the impact of X? These are all questions that need to be answered to truly be able to do root cause analysis. And the truth of the matter is, most organizations don’t have the stomach for it.

Companies have a hard time looking themselves in the mirror and assessing themselves in an honest light. How many projects weren’t given enough time to be done right? How many projects have to skip a vital phase of the testing process due to time constraints? These are the problems you run into throughout your career, across countless companies, leading one to believe it’s less about the company and more about the human condition. But regardless of the source of these problems, they are all things that contribute to the cause of failure in our systems. Organizations that have operational excellence are the orgs who aren’t afraid to look at themselves honestly and follow the root cause of failure, no matter where it takes them.

Next time you participate in an RCA, take it to the next level. Don’t stop at the 5th “why?” Go to the 100th, or the 1000th or however long it takes to be able to show organizationally where change needs to happen. Don’t absolve yourself of all responsibility, but make sure everyone knows that the failure is not yours, but the organization’s as a whole.

Man of Steel - Not My Superman

People have taken turns beating up on the Zack Snyder film Man of Steel. I’m coming in on this quite late, but now that I’ve got children, I find myself reliving parts of my life and then reflecting on them.

My daughter is a huge fan of Superman. (Full disclosure, she’s only 2 years old) As I think about how to further expose her to one of our greatest superheroes, I obviously took to the body of film work on the character. At no point did Man of Steel cross my mind as a film to view. Why I didn’t want to show it to my daughter, helped me to understand what I didn’t like about the film.

Superman is one of those characters that embodies an idea. The honorable boy scout, the powerful guardian but most importantly the uncompromising moralist to name a few. These traits combined are what gives us the ability to have light moments with the character. Watching Superman walk through a hail of automatic gatling gun fire without so much as a scratch on his suit is awesome! It fills you with joy, as the bad guys get theirs, but it makes you want to cheer at the top of your lungs for Superman!

I missed those moments in this new incarnation of the character. I missed the joy and cheer that came with the previous Superman films. The But beyond the joy that was lost, it was the loss of some of the central tenants of the character that really made it difficult for me. Nothing illustrates this more than the killing of Zod.

To understand why killing Zod is such a major problem for me, you have to understand my feelings on Superman. He’s not just any hero. He is an all powerful, unstoppable hero. His weaknesses are Kryptonite and the loved ones around him. That’s it. When someone this powerful is flying through your skies, it’s difficult to trust them. It’s even more difficult to believe they have your best interests at heart.

But that’s the beauty of Superman. His beliefs are uncompromising. And if you agree with his belief system, it gives us, the protectorate, the trust necessary to bestow upon him the role of protector. When Superman kills Zod, not only does he betray his belief system, but he also destroys the trust that’s been built with the people. If he’ll kill Zod, what’s to stop him from deeming others a risk worthy of killing. (I know this gets muddied with the whole Doomsday thing, but I submit that Doomsday was a mindless, non-sentient killing machine. The equivalent of a robot. You may have dissenting views)

With the murder of Zod, now Superman is not this symbol of hope, altruism and unwavering morality. He’s now this guy that protects us as long as we don’t step out of bounds. As long as we don’t cross the line that he’s set, we’re OK. But if our views differ, we may become dangerous enough to be killed, which limits his ability to be a completely trustable character, the way Superman in my view should be.

So that’s my 2 cents on the film. I plan to re-watch it again. I’ve also taken some of my friends advice and started watching the old Superman: The Animated Series cartoon, which is also on Amazon Prime if you’re a member. The show better embodies the way I want my daughter to think of Superman for the time being. When she’s older, I’ll let her make her own choices. :-)

Feature Toggles in Puppet

When we’re performing Puppet changes, we try to work within the framework of some sort of SDLC. We’re in the process of migrating to Stash from SVN, so our processes are a bit in flux. But one problem we run into constantly is how to balance long running feature branches and separation of Puppet code that is still in development or testing.

The systems team works very closely with the developers, sometimes with our changes being dependent on one another. An example is when we move a file from being hosted on our web servers, to being served by S3. It requires a coordinated change of the Apache config to handle the redirects and the development code that automatically handles the population of S3. While this is being tested, we need to have to different copies of the Apache configuration for the site.

A) - Copy in production where files are located on the web server itself.

B) - Copy in test that handles a redirect to the S3 location.

The testing that takes place may take a day or it may take weeks. The longer testing takes, the more drift there is between my Puppet development branch and the master branch which is getting pushed to production regularly. I could just be a rebase monster while this is being tested, but sooner or later, I’ll fail in my responsibility and I’ll have some awful merge waiting to happen. I needed a better way and the best thing I could come up with was some form of Feature Toggle.

With a feature toggle, I have the ability to release some code, without all nodes receiving that code path. More specifically for my use case, I can commit code to master, without fear of it actually being executed. This is often leveraged in Continuous Integration environments to prevent incomplete code from impacting production.

With Puppet I decided to implement something very similar using if blocks and Puppet Enterprise console variables. When I’m developing something I put my resource declarations in a block like so

if $sitemap_redirect_feature == 'enabled' {
	// Puppet Resource Declarations
else {
	// Default activities

Then in Puppet Enterprise console, I’ll assign the variable sitemap_redirect_feature to enabled in the console. If you’re not a Puppet Enterprise customer or aren’t using the console, you could also specify it in a hiera lookup, with a default value.

hiera('sitemap_redirect_feature', 'disabled')

This makes it easier to assign to groups of servers based on your hiera configuration.

Because of the way Puppet variables are evaluated, any node that doesn’t explicitly set the variable will follow the else path.

The plus side to this is while you’re figuring out exactly how resources should be laid out, you can still commit to master without fear of breaking anything. (Just make sure you do all your static analysis so that your Puppet code is at least valid)

Once your testing is complete and your ready to push the changes to production, you simply remove your updated resource declarations from the if/else block so that they’re always executed. Delete the if/else block and push your code.

I’ve been using this pattern for a few weeks now and so far it is working out pretty well. I may refine the approach as I run into new hurdles.

Strategy vs Solution

In the technology arena, things are constantly changing and new technologies are being spun out at a rapid rate. The problem is that as technologist, we’re eager to try out the new hotness, with Docker being the new darling child. Just ask Google about the hype cycle behind Docker.

I’m not going to debate the anointed position of Docker. It is a very cool and incredibly useful technology. But what I do take issue with is using Docker for the sake of using Docker, without any real examination of the problems that are trying to be solved. Docker gets trotted out as a strategy, rather than taking its rightful place as a solution for a strategy.

Containerization of your application may or may not be a straight forward exercise. You could spend weeks getting things tuned and setup in a way so that you can now deploy your application via Docker. You’re living the dream of developing on your desktop and having that same container move all the way through your pipeline into production. But if your build still takes 90 minutes, is it worth the effort? Have you actually solved your pain point?

I’m not dismissing the other intangibles that Docker offers, but I’m a big fan of the Theory of Constraints. Optimizing for anything other than the bottleneck is just a waste.

It sounds like I’m picking on Docker, but it’s just an easy example because of its current popularity. But I’ll give an example closer to home.

I’m working on a Fantasy Football site in my spare time. One of my strategies is to collect information from all of the various sites that provide fantasy data projection.

Notice how my strategy is devoid of any specific technology or implementation. That’s how a strategy should be defined. In clear terms that don’t hint towards a specific solution or direction.

Well, I lost site of that and immediately jumped to the solution. I wrote a series of scrapers to go out to various websites and pull down the information, without any thought to my actual strategy. I jumped to the solution because it’s an easy thing to do as an engineer.

Fast forward a few weeks and I’m spending more time fixing the scrapers and coding defensively against changes to the source website, instead of continuing development of my application. But if I think about my strategy I could probably come up with a few quick solutions.

  • Mechanical Turk - I could hire someone for probably less than $10 dollars to have someone manually enter the data into a CSV document. Writing a CSV importer is a lot simpler than an HTML scraper.
  • Fantasy Data - While a bit pricier, I could also pay for an API end point to provide me with a bunch of data. ESPN, CBS, and Yahoo all have similar services available at varying prices.

Between the 3 options that I briefly described (Mechanical Turk, Fantasy Data and a custom scraper), the mechanical turk option makes the most sense for me. It’s inexpensive, delivers the value I’m looking for and has the lowest amount of effort on my side, allowing me to focus on my core product.

The moral of the story is, remember to evaluate why you want to implement a technology. The strategy should be separate from the solution so that you can make sure your addressing your pain points.

My New Understanding of the MVC Pattern

I’m relatively new to the Rails community. I come from the Python/Django world, but I’ve been enjoying the transition, except for one minor part; Models.

When I dig around looking for info on how to structure my code, I keep running into Best Practices that advocate for a skinny controller/fat model pattern. The idea being that the model contains most of the program logic. I feel like an ass-hat because I’m the new guy but this sounds crazy to me, and others definitely agree. Why limit ourselves to three class types?

I’ve started to move some of my logic into separate classes that are not connected to a model or a controller. They’re utility classes that deal with external sources of data that don’t need to be persisted and definitely don’t fit the role of the controller. In fact, their primary purpose is fetching of data from other sources, to be consumed elsewhere. With that use case in mind, I was a bit surprised when I mentioned this to a few programming buddies and it seemed like they hadn’t thought of it. While we couldn’t come up with a compelling reason why this was wrong, I was a little perturbed that it wasn’t something regularly done. So now I have a generically named folder ‘classes’ to house some of these items.

I’ve been doing some research on MVC purely as a design pattern and I realize that I’ve been making one fatal mistake that’s limiting my usage of the pattern.

Model != Persistence.

The problem I often run into is that my models are shaped based on how I store them in the database. But sometimes how I store an object isn’t necessarily how I want to interact with the object. I end up traversing a bunch of relationships via the ORM. But if my actual storage strategy changes, I suddenly have to update code everywhere that doesn’t necessarily care with how the data is saved. But in reading more about MVC, my model doesn’t have to mirror my storage, as long as the model knows how to persist the object.

I’ll be playing around with inserting an additional layer of abstraction for my models to allow me to interact with the object in its logical form, as opposed to its actual form in the database.

We’ll see how it goes

Why Everyone Should Attend a Conference

This has been a week of conference bliss for me. I attended Puppet Camp Chicago earlier in the week and spent the rest of the week I’ll at Linux Con. I’ve never been a big conference attendee in the profesisonal aspect of my life, so it was a bit of a first. I have to tell you it’s an awesome experience.

My experience has left me with a single question; Why are managers not pushing harder for employees to attend conferences? I’m paying for Linux Con out of my own pocket, but conference attendance is something bosses should embrace. It may seem like a scheme for employees to get a week off with paid expenses, but I assure you, it’s more than that.

The energy at a convention is like nothing you’ve experienced before. The space is filled with upbeat professionals that are tackling problems both incredibly similar and radically different than your own. The conference talks usually run the gamut in terms of experience levels. As an attendee you’d be hardpressed to not find something you’re interested in. Here’s my line up for Day 1 of the conference. This doesn’t include all of the talks I had to skip because of timing conflicts.

  • Linux Performance Tools - There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This talk summarizes the three types of performance tools: observability, benchmarking, and tuning, providing a tour of what exists and why they exist. Advanced tools including those based on tracepoints, kprobes, and uprobes are also included: perf_events, ktap, SystemTap, LTTng, and sysdig. You’ll gain a good understanding of the performance tools landscape, knowing what to reach for to get the most out of your systems.

  • Tuning Linux for Your Database - Many operations folk know the many Linux filesystems like EXT4 or XFS, they know of the schedulers available, they see the OOM killer coming and more. However, appropriate configuration is necessary when you’re running your databases at scale. Learn best practices for Linux performance tuning for MySQL, PostgreSQL, MongoDB, Cassandra and HBase. Topics that will be covered include: filesystems, swap and memory management, I/O scheduler settings, using the tools available (like iostat/vmstat/etc), practical kernel configuration, profiling your database, and using RAID and LVM.

  • Solving the Package Problem - In the beginning there was RPM (and Debian packages) and it was good. Certainly, Linux packaging has solved many problems and pain points for system admins and developers over the years – but as software development and deployment have evolved, new pain points have cropped up that have not been solved by traditional packaging. In this talk, Joe Brockmeier will run through some of the problems that admins and developers have run into, and some of the solutions that organizations should be looking at to solve their issues with developing and deploying software. This includes Software Collections, Docker containers, OStree and rpm-ostree, Platform-as-a-Service, and more.

  • From MySQL Instance to Big Data - MySQL is the most popular database on the web but how do you grow from one instance on a single LAMP box to meets needs of high availability, big data, and/or ‘drinking from the fire hose’ without losing your sanity. This presentation covers best practices such as DRBD, read/write splitting, clustering, the new Fabric tool, and feeding Hadoop. 80% of Hadoop sites are fed from MySQL instances and it can be frustrating without guidance. MySQL’s Fabric will manage sharding and provide more flexibility for your data. And using the memcached protocol to access data as a key/value pair can be up to 9 time faster than SQL (but

All of these talks are items that can help my career and my employer today. It has givien me a level of enthusiasm that I haven’t had in quite some time. Now imagine if you could give that level of education, motivation and enthusiasm to every member of your team.

My conference buddy and I have already identified several technologies we want to look at implementing, as well as developed contacts with people who are already using them. We’ve met with some great people at Puppet Labs, like Lindsey Smith, the Puppet Enterprise product owner, who listened to our real world problems and pain points. He also got us setup with the Puppet Labs Test Pilot Program so that we can be involved in the direction of Puppet Enterprise.

We grabbed a few beers with Morgan Tocker the MySQL Community manager at Oracle. We shared stories, talked about some of our struggles with MySQL and just generally had a good time and got a ton of insight into potential pain points in the future as well as features to leverage in upcoming releases.

When we get back to the office on Monday, we’ve got a ton of things to discuss, evaluate, re-evaluate and expand upon. That’s the power of conferences, and if you’re a manager, it’s why you should consider the next request for conference funds a little more carefully.

My Puppet Development Environment

I was in attendance at Puppet Camp Chicago today and had some really awesome conversations with people. It’s always worthwhile to hear how people are approaching similar problems to yours. It was also nice to get a chance to meet some of the developers of my favorite Puppet modules, but I digress.

One of the conversations that came up was what our local development process looked like for Puppet. Many people are attempting to find the right mixture of process and tools to help develop their infrastructure. With this in mind, I figured it might be worthwhile to share my developer setup. YMMV.

VIM - VIM is my editor of choice. Of course saying you use VIM is like saying “I have a car”. Nobody just uses VIM these days. There’s always some plugins that get mixed in there, my setup is no different.

  • vim-ruby - VIM Ruby is a nice plugin for all types of fun, helpful bits. Check it out.

  • NERDTree - A great plugin that adds some file browsing capabilities to VIM. Well worth it to avoid buffer hell.

  • Powerline - A great add-on for VIM, zsh, and bash that adds an awesome status bar to your VIM interface. The git status in the toolbar is extra helpful.

  • tmux - Tmux isn’t really a VIM plugin, but it is essential to my workflow. Being able to create multiple windows, split panes and easily navigate amongst them with keyboard shortcuts.

  • Custom VIM Functions - I have one main custom function that I use hevaily for linting. The function determines whether the file is a Puppet (.pp), JSON(.json) or ERB (*.erb) file and runs the appropriate linter. Below is a copy of it.

function LintFile()  
     let l:currentfile = expand('%:p')  
      if &ft == 'puppet'  
         let l:command = "puppet-lint " . l:currentfile  
     elseif &ft == 'eruby.html'  
         let l:command = "erb -P -x -T '-' " . l:currentfile . "| ruby -c"  
     elseif &ft == 'json'  
         let l:command = 'jsonlint -q ' . l:currentfile  
      silent !clear  
         execute "!" . l:command . " " . bufname("%")  
   map  :call LintFile()
##Virtual Machine Setup##
   My local development environment consists of two virtual machines, a Puppet master and a Puppet client. I'm using [Virtualbox](https://www.virtualbox.org) for virtualization, but really any VM tool should be fine. 
   The nice thing with having a virtual puppet client on your desktop is that you can snapshot it to get your VM back to an initial state. So before you do any development on the Puppet client, make sure you [take a snapshot](http://www.virtualbox.org/manual/ch01.html#idp55591632) so that you can get back to a clean starting point.
On the Puppet Master VM you'll want to [create a shared folder](https://www.virtualbox.org/manual/ch04.html#sharedfolders) in Virtualbox or your VM Manager of choice.  Point the shared folder to whatever folder holds your Puppet manifests on the local machine. Now [mount the shared folder](https://www.virtualbox.org/manual/ch04.html#sf_mount_manual) in your VM so that it's accessible within the Virtual Machine. You should now have access to your Puppet manifests on your local machine, via the Virtual Machine.
Last but not least, modify your [modulepath](https://docs.puppetlabs.com/references/latest/configuration.html#modulepath) in the virtual Puppet master and add the shared folder path to the modulepath. By adding the shared folder to your modulepath, you can develop your Puppet manifests on your local machine, with all your tools without the need to develop inside the VM or to sync files from your local machine to your VM.

##Remote Puppet Development##

Occasionally you might hit a use case that isn't testable on a local machine and you need to test it on a Puppet master in your pre-prod environment. (You do have a pre-prod environment right?) When this situation comes up it's nice to have [Puppet Environments](http://puppetlabs.com/blog/git-workflow-and-puppet-environments) setup. Most people use them in a dynamic fashion, but you can definitely use them statically. (And with SVN) After you've created the environments, it's just a matter of getting your files to the path on the remote server. Rsync is a great tool for this as it allows you to get your files to the remote server for testing, without the need to actually commit code that you're not sure will work yet. (Which in some environments might trigger a long, time consuming series of automated checks and builds)

That's pretty much it for my development environment. I should also mention that if you're working on a Mac, it might be worth checking out [Dash](http://kapeli.com/dash), which is an awesome developer documentation tool. It basically sucks down the Docsets of various programming languages and tools. (Puppet being one of them)

At some point I'll probably write a follow up post to detail our actual development and deployment workflow. Hope this helps some poor soul out there on the web.

Organizing Puppet Code

I feel like every team I talk to, at some point decides they need to blow up their Puppet code base and apply all of the things they’ve learned to their new awesome codebase. Well, we’re at that point in my shop and there’s a small debate going on about how to organize our Puppet modules.

This is really not meant to be a mind-blowing blog post, but more of a catalog of thoughts for me as I make my argument for separate repositories for each Puppet module. A few background items.

  • We’ll be using the Roles/Profiles pattern. What I’m calling “modules” are the generic implementations of technologies. These are the modules I’m suggesting go into separate repositories. I’m OK with profiles and roles co-existing in a single repository.
  • We’re coming from a semi-structured world where all modules lived in a single SVN Repository. Our current deployment method for Puppet code is a svn up on the Puppet Master.
  • We’ll be migrating to Git (specifically Stash)
  • We’ll have multiple contributors to the code base in 2 different geographic locations. (Chicago and New York for now) The 2 groups are new to each other and haven’t been working together long.

I think that’s all the housekeeping bits. My reasons for keeping separate Git repositories per modules are not at all revolutionary. It’s some of the same arguments people have been writing about on the web for awhile now.

Separate Version History

As development on modules move forward, the commits for these items will be interspersed between commit messages like “Updating AJP port for Tomcat in Hiera”. I know tools like Fisheye (which we use) can help eliminate some of the drudgery of flipping through commit messages, but you know what else would help? Having a separate repo where I can just look at the revision history for the module.

##Easier Module Versioning## With separate repositories, we can leverage version numbers for each release of the module. This allows us to freeze particular profiles that leverage those modules on a specific version number until they can be addressed and updated. With two disparate teams, this allows them to continue forward with potentially disruptive development, while other profiles have time to update to whatever breakage is occurring.

##Access to Tools## Tools like Librarian-Puppet and R10K are built around the assumption that you are keeping your Puppet modules in separate repositories. I haven’t done a deep dive on the tools yet, but from what I can tell, using them with a single monolithic repository is probably going to be a bit of a hurdle.

##Easier Merging/Rebasing and Testing## The Puppet code base is primarily supported by the Systems team. The world of VCS is still relatively new to the Systems discipline. As we get more comfortable with these tools, we tend to make some of the same mistakes developers make in their early years. The thing that comes to mind is commit size and waiting too long to merge upstream. (Or REBASE if that’s your thing) Keeping the modules in separate repositories tightens the problem space your coding for. If you need create a new parameter for a Tomcat module, a systems guy will probably

  1. Create the feature branch
  2. Modify the Tomcat module to accept parameters
  3. Modify the profiles to use the parameters
  4. Test the new profiles
  5. Make more tweaks to the tomcat module
  6. Make more tweaks to the profiles
  7. Test
  8. Commit
  9. Merge

With separate modules, the problem space gets shrunken to “Allow Tomcat to accept SHUTDOWN PORT as a parameter”. We’ve removed the profile out of the problem scope for now and have just focused on Tomcat accepting the parameter. Which also means you need a new, independent way to test this functionality, which means tests local to the module. (rspec-puppet anyone?) This doesn’t even include the potential for merge hell that could occur when this finally does get committed and pushed to master. Now I’m not naive. I know that this could theoretically happen even with separate modules, but I’m hoping the extra work involved would serve as a deterrent. Not to mention that gut feeling you get when you’re approaching a problem the wrong way.

#In favor of a Single Repository# I don’t want to discount that there could be some value in managing all your code as a single repository. Here’s the arguments I’ve heard so far.

##It complicates deployments of Puppet Code## True that. Nothing is easier than having to execute an svn up command……except for running a deploy_puppet command. Sure you’d have to spend cycles writing a deployment script of some sort, but if that’s a valid reason then we’re just being lazy. I might be being terribly optimistic but it doesn’t seem like a hard problem to solve.

In addition, I’ve always preferred the idea of delivering Puppet code (or any code for that matter) as some sort of artifact. Maybe we have a build process that delivers an RPM package that is your Puppet code. A simple rpm -iv or yum update and we’ve got the magic.

##It complicates module development## Sometimes when people are developing modules, their modules depend on other modules. I extremely dislike this approach, but it is a reality. You would now have to check out two separate modules and all of their dependencies in order to develop effectively.

Truth is, this sounds like bad workflow. A single module shouldn’t reach into other modules. In the rare event that it does (say your module leverages the concat module) then these dependencies shouldn’t just be inferred by an include statement. They should be managed by some tool like librarian-puppet, because in reality, that’s all it is. Every other language has an approach to solving dependency management (pip, gradle, bundler and now librarian)

#What’s Next?# With my thought process laid out with pros and cons, I still feel pretty strongly about separate repositories for each module. Another solution might be to create some scripts that manage the modules in a way that can fool the users that care into thinking that they’re dealing with a single repository. But this tends to defeat some of the subliminal messaging I hope to gain from separate modules. (Even though that’s probably a pipe dream)

I’ll be sure to post back what becomes of all this and if the team has any other objections.