People have taken turns beating up on the Zack Snyder film Man of Steel. I’m coming in on this quite late, but now that I’ve got children, I find myself reliving parts of my life and then reflecting on them.

My daughter is a huge fan of Superman. (Full disclosure, she’s only 2 years old) As I think about how to further expose her to one of our greatest superheroes, I obviously took to the body of film work on the character. At no point did Man of Steel cross my mind as a film to view. Why I didn’t want to show it to my daughter, helped me to understand what I didn’t like about the film.

Superman is one of those characters that embodies an idea. The honorable boy scout, the powerful guardian but most importantly the uncompromising moralist to name a few. These traits combined are what gives us the ability to have light moments with the character. Watching Superman walk through a hail of automatic gatling gun fire without so much as a scratch on his suit is awesome! It fills you with joy, as the bad guys get theirs, but it makes you want to cheer at the top of your lungs for Superman!

I missed those moments in this new incarnation of the character. I missed the joy and cheer that came with the previous Superman films. The But beyond the joy that was lost, it was the loss of some of the central tenants of the character that really made it difficult for me. Nothing illustrates this more than the killing of Zod.

To understand why killing Zod is such a major problem for me, you have to understand my feelings on Superman. He’s not just any hero. He is an all powerful, unstoppable hero. His weaknesses are Kryptonite and the loved ones around him. That’s it. When someone this powerful is flying through your skies, it’s difficult to trust them. It’s even more difficult to believe they have your best interests at heart.

But that’s the beauty of Superman. His beliefs are uncompromising. And if you agree with his belief system, it gives us, the protectorate, the trust necessary to bestow upon him the role of protector. When Superman kills Zod, not only does he betray his belief system, but he also destroys the trust that’s been built with the people. If he’ll kill Zod, what’s to stop him from deeming others a risk worthy of killing. (I know this gets muddied with the whole Doomsday thing, but I submit that Doomsday was a mindless, non-sentient killing machine. The equivalent of a robot. You may have dissenting views)

With the murder of Zod, now Superman is not this symbol of hope, altruism and unwavering morality. He’s now this guy that protects us as long as we don’t step out of bounds. As long as we don’t cross the line that he’s set, we’re OK. But if our views differ, we may become dangerous enough to be killed, which limits his ability to be a completely trustable character, the way Superman in my view should be.

So that’s my 2 cents on the film. I plan to re-watch it again. I’ve also taken some of my friends advice and started watching the old Superman: The Animated Series cartoon, which is also on Amazon Prime if you’re a member. The show better embodies the way I want my daughter to think of Superman for the time being. When she’s older, I’ll let her make her own choices. :-)

Logs vs Metrics

I constantly struggle with the idea that a log entry and a metric are not the same thing. Both provide value, but they’re telling different stories. The problem is that the distinction is a lot like pornography. I’m not 100% sure how to define it.

In my shop, our applications emit a lot of data. Now some of you are reading and thinking that this is a first world problem in the technology space; I get that. But the problem is when I try to use the right tool to evaluate that data. Should this go into a log file to be aggregated by a tool like ELK or Splunk? Or should this just be a tick that fires and is sent to a metrics collector like Graphite? How do I articulate that choice to a developer?

Let’s look at Twitter for example. Let’s say they want to track tweets per second. For our purposes the options are

  • Log to a file when a tweet happens. That log is shipped to an aggregator. Then a log search tool evaluates the log messages (because there could be other log messages there) does some math and then displays it.
  • The application uses some transactional annotation around the “new tweet” functionality. That sends an “event” type metric with some additional metadata and fires it off to a metric service.

Both are conceptually the same thing, but the log file approach seems dirty to me. Other than the benefit of some abstractions, the core tenants of both approaches is the same. My discomfort must lie in the intent of these two things.

A log entry should give you information or detail about a specific event where as a metric should just let you know that an event occurred.

An example of a good log message

  • Tweet number 238472 was submitted successfully by user @darkandnerdy at 11:05am
  • User DarkandNerdy has failed a login attempt. The user has 4 concurrent login failures

An example of a metric

  • 11:05 - tweet submission successful
  • 11:15 - Failed login attempt

The specificity that a log message allows is what makes it valuable. But more importantly as an Operations person, that specificity should be allowed to be tunable. I only need that specificity when a problem arises. I shouldn’t be forced to have logging set to a low level (like INFO) just to have an idea of how the system is performing. In normal Operation, I want my log file to contain anomalies, not stacks of information telling me that everything is ok. (And if that is the case, maybe those logs need some separation from General log messages)

Through the process of writing this blog post I’ve become much clearer on how I feel about these two things. My “thesis statement” if you will, would be this:

Logs messages are notifications about events as they pertain to a specific transaction within the application. Metrics are notifications that an event occurred, without any ties to a transaction.

Ok so what’s the difference? Well again putting on my Operations hat, metrics can be incredibly smaller because they convey considerably less information. They’re also extremely easier to evaluate. Both of these points have impact around how we store, process and retain metrics.

A log file however, gives you details on a transaction which may allow you to tell a more complete story for a given event. The transactional nature of the log message in aggregate, gives you much more flexibility in terms of surfacing information (not just data) about the business.

So now that I think I’ve cracked my first world problem, I’d pass on a bit of advice to folks. Any data is better than no data. If things are preventing you from having a decent separation of things, just HAVING the data makes a huge difference. Once you have it, you can argue over the semantics. :-)

The High Cost of Low Friction Infrastructure

The impact that cloud computing is having on technology organizations is undeniable. It’s allowing companies to move at an unprecedented speed, churning out infrastructure as effortlessly as pushing a button in a Jenkins job. Gone are the days of counting infrastructure delivery time in quarters. Now the timescale is in minutes. While the agony of endless change gates for hardware has been applauded by most of us in the industry, our zeal for the speed gained has left us unaware of what’s been lost. There are no perfect systems. Everything comes with a cost.

##Systems Aren’t Always Code##

When we talk about Systems design in technology, we’re often discussing some computer-to-computer or computer-to-human interactions. But in-between these sophisticated transactions lies the original system, two-humans directly interacting with each other. (Albeit that interaction is increasingly becoming digital as well)

These systems exist as approval processes, peer review and the often vilified meeting. These human touch points however serve as signals within the human system. If you make a purchase order request for new hardware, that signals to the operations group that something within the organization has changed. Are we expecting an increase in demand? Has a new project started? Questions begin to flow, the human network does its thing and people begin to interpret the signal and take action. More than a handful of organizations have processes that are initiated as a result of these signals, to ensure broad involvement, risk assessment and impact.

In the world of the cloud though, many organizations haven’t thought through their human systems and where these gated signals will now come from. An engineer can spin-up a system, make a DNS entry, share a link and have a “production” system up and running before lunch. A self contained team has all the tools necessary to go to production. It’s a blessing if you’ve accounted for it and a curse if you haven’t.

Conway’s Law has often been used as a cautionary tale. What we fail to realize however is how often a company uses Conway’s law as a control mechanism for certain actions. By creating a barrier between systems, development and design, you’ve forced an interaction that improves communication. The quality of that communication is debatable, but the fact that it exists at all is a win in this context. I’m not arguing whether that limiter is good or bad, but your organization has to understand where and by what degree it depends on that limiter. How does the security organization become aware of new instances?

Who is responsible for patching those instances? Monitoring them? These questions are typically birthed at the point of that forced interaction. Could the team deploying the system ask and answer these questions internally? Absolutely. And often they do. But that is extremely company dependent and comes with its own set of challenges that are outside the scope of this post.

The issues I have outlined all have workable solutions, but you can’t solve a problem unless you’re aware it exists. Be honest about your gates and checkpoints. You wouldn’t have added them unless they provided some value. Maybe that value has been over or under estimated, but the move to the cloud is a great time to re-asses them.

##Do More With More##

Something I’ve always found ironically hilarious in the virtual server space is why it came about and where we’re at now. Tech organizations saw all this underutilized hardware and thought virtual machines would be able to help address this and save cost. Now the industry has loads of under-utilized virtual machines on over-subscribed hardware. If you find yourself in this situation, the move to the cloud may not offer much relief.

During the days of gated hardware acquisition, there was a certain pain that went with ordering a server. Forms, inquiries from management, meetings and just a bit of your soul were all common prices to pay as someone rolling out new hardware. As a result, there was a sort of self-preservation like preference on using hardware your team already had. With the cloud, that friction is removed, making it easier and more attractive to simply spin-up a new server. There’s value to that. A tidy separation of concerns, isolation for maintenance and a performance profile that is inline with your application’s needs.

The rub comes in when this mode of thinking becomes institutionalized. Before long, it’s habit that every application gets its own set of servers. But does that make sense for every solution? If billing and marketing have an internal-only application, does it make sense for each of them to have their own server? And then if they’ve got one, they have to have two right? (Redundancy and all) What is the cost of a 30 person unit sitting on 4 servers? You may have hidden that cost in VMWare, but in the cloud it comes front and center.

There’s that old economic principle, the law of demand. As the price of a good reaches 0, demand becomes infinite. While the price of cloud computing is nowhere near zero, organizations have effectively removed the purchaser of the good from the consequence of that good’s cost. We’ve trivialized the acquisition process to the point where need and want are synonymous. If you’re not careful, you can quickly fail to deliver on the promise of cheaper infrastructure using the cloud.

##Conclusion##

I bring up the topics above not as a deterrent to the cloud, but as a warning. Adopting the cloud isn’t just about speed of delivery, but about a shift in how your organization thinks about the technology life cycle. There are no free lunches. With speed comes some added risk. Is it worth the payoff? Every company has to decide that for themselves. But a healthy dose of organizational pragmatism is good “one-size fits all” advice.

Troubleshooting Fallacies

An outage on a large distributed system can be a very difficult thing to troubleshoot. The pressures of an outage can often lead to making poor or shortcut decisions. When the system is down, the drive is to get the system back operational as quickly as possible. But sometimes we sacrifice process for the sake of speed.

In this frantic state, we fall victim to what I call Troubleshooting Fallacies.

#1 - Subsystems Fail All or Nothing

When we’re troubleshooting an issue we start to quickly assess the potential failure points in the system that may be causing the problem. Sometimes this may happen at lightning speed at a subconscious level. For example:

  • You know it’s not the network because you can ping the server
  • DNS isn’t the problem because the hostname is resolving
  • The application can connect to the database because you verified credentials and connectivity from the command line.

The list could continue, but you get the idea. The criteria we use for eliminating these subsystems is usually razor thin, ignoring the complexity within each of those sub-systems. We behave as if the subsystem fails in totality. But it’s quite possible that the subsystem is failing in a sort of nuanced way. Maybe only specific DNS servers are failing. Or maybe they’re only failing from a specific node. Maybe the host can connect to the database, but the application is having trouble connecting, possibly due to a configuration issue.

When you begin to exhaust your options, be sure that you verify the subsystems involved as concretely as possible. Sometimes a high level pass of the subsystem simply isn’t enough.

#2 - No Errors = No Problems

When trouble starts and the resolution hasn’t been narrowed down, you might call a team member from each area like, storage, networking, application developers, and site support. As each functional area signs-in, most admins have already begun rationalizing why it isn’t their particular subsystem that’s at fault. This sort of system bias leads the admin to perform a less than ideal verification of their system.

Instead of verifying that things are working correctly, the admin simply confirms that no errors are being thrown. But this mode of verification assumes that the admin knows every type of failure mode of the system and how it manifests itself, an unlikely reality. This is what I call checking for errors instead of verifying success.

Checking for errors can be a bit oversimplified if you don’t believe your system is at fault. It exposes a kind of confirmation bias where you may dismiss anomalous, but not necessarily damning information. Verifying success is a bit more thorough, but requires a level of preparation before the incident, as it will most likely require some automation. Verifying success may look like:

  • Scripted versions of API calls
  • Database connection tests from nodes in question
  • Synthetic transactions against a web interface
  • Response code verifications

Now you’ll notice that most of these sound like potential monitors that should be part of the system. That is true, but a lot of these monitors would probably be implemented in the system admin’s language of choice, versus the language the system is written in. (Think python vs. java) These success verification applications should try to mimic or mirror the languages and libraries used in the application to help reveal potential software incompatabilities.

#3 - Finding the Root Cause is a Must

Root cause analysis is a tricky subject. In most of these overly complex systems, the idea of a root cause is a bed time story we tell managers to help them sleep at night. The hard truth is that the more complicated our systems become, the more nuanced are failures are. Root cause is fast becoming a thing of the past. Failure looks more like various sub-systems entering a failed state, which in turn produces a system level failure mode. Example:

  1. HTTP Requests come into a web server without a timeout value
  2. The HTTP requests results in database calls
  3. The database is saturated, so the queries take 2-3 seconds to respond.
  4. The service time of requests to the HTTP server is too long to handle the arrival rate of requests without queueing.
  5. The HTTP server runs out of available threads and makes the service completely unavailable.

Is the database the root cause? Or is it the fact that HTTP requests are allowed to execute without a timeout value? Or maybe the HTTP server doesn’t have a sufficient number of threads for peak traffic volume?

Any one of those could be a valid root cause, which means none of them are really the root cause. The lack of root cause does prevent a tidy answer for an incident report, but it does promote a more thorough understanding of your system and its various failure modes.

If your leadership insists on root cause analysis, I suggest you take that analysis as far back as possible. You may not like where it takes you

#4 - It Should Be a Quick Fix

As problems arise, admins can repeatedly underestimated the impact of the issue. The quick fix is always around the corner, which ultimately slows down the final resolution because of a sort of haphazard, gut feeling approach to troubleshooting.

It’s easy to be undisciplined in your troubleshooting process during an outage. The pressure is on and you leap from one potential issue to another, a lot of times without any scientific evidence to pack it up. It’s imperative however that we maintain control and process, even in the face of a management mob desperate for resolution.

This of course presupposes that you have a methodology. If you don’t have a process I would highly suggest looking at the USE Method by Brendan Gregg. It’s a practical approach that allows you to break up a system into various components and then test those components for particular failures.

Gregg’s approach has a particular performance bend to it, but with a little bit of effort it can be adapted to suit analysis at any level of the application stack.

#Wrap Up

This list is far from exhaustive, but highlights some of the issues that teams fall victim to. Recognizing your mistakes is the first step to avoiding them. I hope to continue to expand on this list over time.

Importance != Priority

In today’s modern economy workers in every industry are facing a common challenge. Do more with less. The less part is a particularly interesting constraint, because it adds to the number of concurrent tasks per team member. You constantly hear about this situation and a person’s ability to “juggle multiple priorities”. But I personally reject this line of thinking on the basis that the statement itself is flawed, unproductive and ultimately impossible.

When we say a task is a priority, what we’re really saying is the task is important. You can have an unbounded number of important items on your list, but importance describes the task in its own space. Priority however describes a task in relation to other tasks and as a result, you can truly only have a single priority at a time. That doesn’t reduce the importance of other tasks at hand, but it does clearly identify an item as the single most important thing on the docket. Priority is binary and global across the scope of work that you’re performing. Lets use an example to help clarify.

You have a presentation that you’re giving to C-Level executives. It’s the type of presentation that careers are made of. You’ve been prepping all year for this one moment. Right before you’re about to go up, you get an email that says there is a huge problem with the financial system and your assistance is needed. Normally this item would cause you to drop everything and address it right away. But today, this board meeting is the priority. It doesn’t detract from the importance of the financial systems problem, but that item has to be delayed behind this board presentation. You’ve clearly identified the priority at the moment. You’re not going to check-in with the progress of the financial problem during the meeting are you? It gets tabled and or delegated (and possibly made someone else’s priority) so that you can address the real priority, the board meeting.

Now sticking with that same example, you’re about to go on in front of the board when you get a phone call. Your significant other has been in a car accident. It’s pretty bad and they’re currently being air-lifted to a nearby medical facility. What’s the priority now? Are you going to deliver the board presentation from the car? From the hospital? I doubt it.

The example is extreme but it highlights a few things.

  1. Priorities are fluid
  2. Priorities are (usually) obvious

This means that sometimes important things slip. It’s the nature of the world. We like the idea of multiple priorities because it makes it easier to rationalize your choices. “I’m not saying work is more important than you honey. You’re both my priority.” But that’s bullshit. You’re not at dinner or at the zoo with your kid, you’re in your office working.

The idea of a single priority forces us to stare the reality of our choices right in the eyes. You’ve chosen this over that. Own that choice and everything that it implies. Or re-evaluate your priority.

Refactoring Pet Peeves

Somewhere during my wild romps on the Internet I came across an interesting article that talked about the concept of “defactoring”. I think we’re all familiar with the idea of “refactoring”, but defactoring goes against the many rules that have been engrained on us as technologists. I think that’s why I love it so much.

Defactoring reduces the number of ways we can recombine the pieces of code we have…We’d make it less flexible.

In the technology sector we’re often reading about the latest and greatest technology, architecture/software pattern and how you’re basically a luddite if you don’t adopt these practices. Proper factoring of code is probably one of the first times I experienced this phenomenon. I rushed to make sure all of my functions were small and tight. I mastered the art of taking a 30 line shell script and exploding it into 120 lines of well organized bliss.

The problem that I think “defactoring” tries to address is that of complexity for the sake of complexity and cool kid points. While I love a well organized code as much as the next guy, most of the reasons you would break apart aren’t valid in a lot of the programs being written. (Especially on the Systems end of the house)

The biggest reasons to refactor code in my experience is:

  • Reusability of code
  • Testability of code
  • Isolating complexity

If refactoring your code doesn’t provide any of these benefits, then what’s the real purpose of doing it? It’s another logical break when you’re reading the code that you have to jump to. Does it add any value?

Before you move that for loop to its own function, ask yourself “What does this buy me?” If you can’t answer that question in 10 seconds or less, you probably don’t need to do it. Yes, people might judge your code, but if it’s not that, it’ll be something else. (Engineers are a persnickety bunch)

The world has enough complexity. Don’t add to it without good reason.

Airmail2 + Omnifocus

Full disclosure. I’ve had a few drinks and should probably wait till morning to write this. But waiting is for suckers. YOLO

I’m a big fan of Omnifocus and Airmail. Together they’re like peanut butter and jelly. But one thing that absolutely drives me insane is the way Airmail’s integration works with Omnifocus.

When I convert an email in Airmail to a task in Omnifocus, it creates a task with the task name being the subject. But the body of the message (the note) actually becomes a link to the Airmail message. That’s all fine and dandy, but I use Gmail as my mail provider. When I archive messages, Omnifocus seems to get confused on how to find the message based on the URL link. Rubbish. I want something simple and stupid. Enter Applescript.

With Applescript I was able to quickly write a tool that allows me to convert the body of the email message into a note in the Omnifocus task, which eliminates the need for me to keep the message around at all in my Inbox. Below is the script, but you can check out the Gist here.

tell application "Airmail 2"

set theMessage to selected message

tell theMessage
	set theContent to htmlContent
	set theSubject to subject

end tell

tell application "OmniFocus"
	tell quick entry
		set theRTFMessage to do shell script "echo " & quoted form of theContent & "|/usr/bin/textutil " & " -convert txt -stdin -stdout -format html"
		make new inbox task with properties {name:theSubject, note:theRTFMessage}
		set note expanded of every tree to true
		open
	end tell

end tell

end tell

The script is pretty vanilla, except for the line referencing /usr/bin/textutil. Textutil is an awesome little utility on OS X to convert text from various formats. It’s part of the Cocoa Framework so it should be available on all Macs running OS X. (Gotta get specific for people that still think Linux is a Desktop OS. OOOOH BUUURRRNNN)

Now you need to make the script useful.

  1. Open Script Editor on your Mac and copy pasta the script into it. Save it some where and make a note of the location.
  2. Launch Automator, and choose “Service” as the Document type.
  3. Open a Finder window, and drag your workflow onto the Automator build section.
  4. In the upper right hand section, change the in “any application” drop down to Airmail2. (You might have to click other and browse for it)
  5. Save the Service via File -> Save

Now that you’ve create the service, you’ll want to create a shortcut for it in Airmail.

Launch System Preferences and go to Shortcuts. Go to App Shortcuts in the left hand bar. Click the “+” icon.

In the Application section, choose Airmail 2. In Menu Title, type the exact name of the service you created above. Choose a Keyboard Shortcut for the last field. I personally use CMD+SHIFT+, but YMMV. Choose what works for you.

Voila. Get better emails in your Omnifocus. Now I’m just waiting for everyone to tell me there was an easier way to do this. Because I CAN’T be the only one frustrated by this.

Can You Stomach Root Cause Analysis?

Lately I’ve become extremely interested in accident analysis techniques. This is largely useful in the manufacturing and transportation industries, but there has been a growing trend to adopt these types of practices in the technology arena. Think Kanban, Lean Startup, and the Theory of Constraints to name a few.

But accident analysis and safety digs deep into the nature of failure within a system. Some of my favorite thinkers in the field like Sidney Dekker and Nancy Leveson have been forcing me to go beyond the surface of an issue and to dig deeper into the organizational issues that are equal contributors to failure.

Root Cause Analysis (RCA) is something that gets touted all the time in technology. When a system goes down, we’re desperately trying to find out what caused things to go bad. Despite our best efforts, we never seem to go far enough with RCAs.

One of my favorite mantras regarding root cause is that “Root cause is simply where we stop looking.” We go far enough down the rabbit hole that we simply can’t explain further, don’t have the will to explain further or we’ve reached a politically acceptable answer. (Leveson Engineering a Safer World)

So why do we go through the theater of Root Cause Analysis in technology? Because we need to explain the unexplainable. Because if we can’t explain something, how can we possibly give assurances it won’t happen in the future?

The technology field has done a great job of pretending that everyone has their shit together. No one should ever have a failure that goes undetected. Anyone who isn’t alerted before a problem happens is an idiot. These are all worthwhile goals, but they are so far away from the reality of where we are in technology. But thanks to blogs, social media and the wisdom of hindsight, gaps in system and failure monitoring is largely the result of unqualified staff. This belief is held by management, furthered by people in the industry who talk a good game and then ultimately internalized by those with imposter syndrome. So we stress and we agonize over the root cause of a thing. And that’s not entirely a bad thing, but here’s the rub.

Lets say we get to a point where we find that the root cause was due to setting X not being set to a reasonable or correct value. That’s it. We change setting X, explain how that moves up the cause of events and fixes everything had X been set to a sane value. But why was X set to the value it was set at? Easy, the person before us was an idiot. But is that really the case? What factors went into that decision? What organizational pressures were present that forced a more conservative value? Could we not spend the money for extra hardware that would better utilize X? Was there no time to do performance testing so we settled on the low value for X? Did our predecessor not get the training necessary to understand the impact of X? These are all questions that need to be answered to truly be able to do root cause analysis. And the truth of the matter is, most organizations don’t have the stomach for it.

Companies have a hard time looking themselves in the mirror and assessing themselves in an honest light. How many projects weren’t given enough time to be done right? How many projects have to skip a vital phase of the testing process due to time constraints? These are the problems you run into throughout your career, across countless companies, leading one to believe it’s less about the company and more about the human condition. But regardless of the source of these problems, they are all things that contribute to the cause of failure in our systems. Organizations that have operational excellence are the orgs who aren’t afraid to look at themselves honestly and follow the root cause of failure, no matter where it takes them.

Next time you participate in an RCA, take it to the next level. Don’t stop at the 5th “why?” Go to the 100th, or the 1000th or however long it takes to be able to show organizationally where change needs to happen. Don’t absolve yourself of all responsibility, but make sure everyone knows that the failure is not yours, but the organization’s as a whole.

Feature Toggles in Puppet

When we’re performing Puppet changes, we try to work within the framework of some sort of SDLC. We’re in the process of migrating to Stash from SVN, so our processes are a bit in flux. But one problem we run into constantly is how to balance long running feature branches and separation of Puppet code that is still in development or testing.

The systems team works very closely with the developers, sometimes with our changes being dependent on one another. An example is when we move a file from being hosted on our web servers, to being served by S3. It requires a coordinated change of the Apache config to handle the redirects and the development code that automatically handles the population of S3. While this is being tested, we need to have to different copies of the Apache configuration for the site.

A) - Copy in production where files are located on the web server itself.

B) - Copy in test that handles a redirect to the S3 location.

The testing that takes place may take a day or it may take weeks. The longer testing takes, the more drift there is between my Puppet development branch and the master branch which is getting pushed to production regularly. I could just be a rebase monster while this is being tested, but sooner or later, I’ll fail in my responsibility and I’ll have some awful merge waiting to happen. I needed a better way and the best thing I could come up with was some form of Feature Toggle.

With a feature toggle, I have the ability to release some code, without all nodes receiving that code path. More specifically for my use case, I can commit code to master, without fear of it actually being executed. This is often leveraged in Continuous Integration environments to prevent incomplete code from impacting production.

With Puppet I decided to implement something very similar using if blocks and Puppet Enterprise console variables. When I’m developing something I put my resource declarations in a block like so

if $sitemap_redirect_feature == 'enabled' {
	// Puppet Resource Declarations
}
else {
	// Default activities
}

Then in Puppet Enterprise console, I’ll assign the variable sitemap_redirect_feature to enabled in the console. If you’re not a Puppet Enterprise customer or aren’t using the console, you could also specify it in a hiera lookup, with a default value.

hiera('sitemap_redirect_feature', 'disabled')

This makes it easier to assign to groups of servers based on your hiera configuration.

Because of the way Puppet variables are evaluated, any node that doesn’t explicitly set the variable will follow the else path.

The plus side to this is while you’re figuring out exactly how resources should be laid out, you can still commit to master without fear of breaking anything. (Just make sure you do all your static analysis so that your Puppet code is at least valid)

Once your testing is complete and your ready to push the changes to production, you simply remove your updated resource declarations from the if/else block so that they’re always executed. Delete the if/else block and push your code.

I’ve been using this pattern for a few weeks now and so far it is working out pretty well. I may refine the approach as I run into new hurdles.

Strategy vs Solution

In the technology arena, things are constantly changing and new technologies are being spun out at a rapid rate. The problem is that as technologist, we’re eager to try out the new hotness, with Docker being the new darling child. Just ask Google about the hype cycle behind Docker.

I’m not going to debate the anointed position of Docker. It is a very cool and incredibly useful technology. But what I do take issue with is using Docker for the sake of using Docker, without any real examination of the problems that are trying to be solved. Docker gets trotted out as a strategy, rather than taking its rightful place as a solution for a strategy.

Containerization of your application may or may not be a straight forward exercise. You could spend weeks getting things tuned and setup in a way so that you can now deploy your application via Docker. You’re living the dream of developing on your desktop and having that same container move all the way through your pipeline into production. But if your build still takes 90 minutes, is it worth the effort? Have you actually solved your pain point?

I’m not dismissing the other intangibles that Docker offers, but I’m a big fan of the Theory of Constraints. Optimizing for anything other than the bottleneck is just a waste.

It sounds like I’m picking on Docker, but it’s just an easy example because of its current popularity. But I’ll give an example closer to home.

I’m working on a Fantasy Football site in my spare time. One of my strategies is to collect information from all of the various sites that provide fantasy data projection.

Notice how my strategy is devoid of any specific technology or implementation. That’s how a strategy should be defined. In clear terms that don’t hint towards a specific solution or direction.

Well, I lost site of that and immediately jumped to the solution. I wrote a series of scrapers to go out to various websites and pull down the information, without any thought to my actual strategy. I jumped to the solution because it’s an easy thing to do as an engineer.

Fast forward a few weeks and I’m spending more time fixing the scrapers and coding defensively against changes to the source website, instead of continuing development of my application. But if I think about my strategy I could probably come up with a few quick solutions.

  • Mechanical Turk - I could hire someone for probably less than $10 dollars to have someone manually enter the data into a CSV document. Writing a CSV importer is a lot simpler than an HTML scraper.
  • Fantasy Data - While a bit pricier, I could also pay for an API end point to provide me with a bunch of data. ESPN, CBS, and Yahoo all have similar services available at varying prices.

Between the 3 options that I briefly described (Mechanical Turk, Fantasy Data and a custom scraper), the mechanical turk option makes the most sense for me. It’s inexpensive, delivers the value I’m looking for and has the lowest amount of effort on my side, allowing me to focus on my core product.

The moral of the story is, remember to evaluate why you want to implement a technology. The strategy should be separate from the solution so that you can make sure your addressing your pain points.

My New Understanding of the MVC Pattern

I’m relatively new to the Rails community. I come from the Python/Django world, but I’ve been enjoying the transition, except for one minor part; Models.

When I dig around looking for info on how to structure my code, I keep running into Best Practices that advocate for a skinny controller/fat model pattern. The idea being that the model contains most of the program logic. I feel like an ass-hat because I’m the new guy but this sounds crazy to me, and others definitely agree. Why limit ourselves to three class types?

I’ve started to move some of my logic into separate classes that are not connected to a model or a controller. They’re utility classes that deal with external sources of data that don’t need to be persisted and definitely don’t fit the role of the controller. In fact, their primary purpose is fetching of data from other sources, to be consumed elsewhere. With that use case in mind, I was a bit surprised when I mentioned this to a few programming buddies and it seemed like they hadn’t thought of it. While we couldn’t come up with a compelling reason why this was wrong, I was a little perturbed that it wasn’t something regularly done. So now I have a generically named folder ‘classes’ to house some of these items.

I’ve been doing some research on MVC purely as a design pattern and I realize that I’ve been making one fatal mistake that’s limiting my usage of the pattern.

Model != Persistence.

The problem I often run into is that my models are shaped based on how I store them in the database. But sometimes how I store an object isn’t necessarily how I want to interact with the object. I end up traversing a bunch of relationships via the ORM. But if my actual storage strategy changes, I suddenly have to update code everywhere that doesn’t necessarily care with how the data is saved. But in reading more about MVC, my model doesn’t have to mirror my storage, as long as the model knows how to persist the object.

I’ll be playing around with inserting an additional layer of abstraction for my models to allow me to interact with the object in its logical form, as opposed to its actual form in the database.

We’ll see how it goes

Why Everyone Should Attend a Conference

This has been a week of conference bliss for me. I attended Puppet Camp Chicago earlier in the week and spent the rest of the week I’ll at Linux Con. I’ve never been a big conference attendee in the profesisonal aspect of my life, so it was a bit of a first. I have to tell you it’s an awesome experience.

My experience has left me with a single question; Why are managers not pushing harder for employees to attend conferences? I’m paying for Linux Con out of my own pocket, but conference attendance is something bosses should embrace. It may seem like a scheme for employees to get a week off with paid expenses, but I assure you, it’s more than that.

The energy at a convention is like nothing you’ve experienced before. The space is filled with upbeat professionals that are tackling problems both incredibly similar and radically different than your own. The conference talks usually run the gamut in terms of experience levels. As an attendee you’d be hardpressed to not find something you’re interested in. Here’s my line up for Day 1 of the conference. This doesn’t include all of the talks I had to skip because of timing conflicts.

  • Linux Performance Tools - There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This talk summarizes the three types of performance tools: observability, benchmarking, and tuning, providing a tour of what exists and why they exist. Advanced tools including those based on tracepoints, kprobes, and uprobes are also included: perf_events, ktap, SystemTap, LTTng, and sysdig. You’ll gain a good understanding of the performance tools landscape, knowing what to reach for to get the most out of your systems.

  • Tuning Linux for Your Database - Many operations folk know the many Linux filesystems like EXT4 or XFS, they know of the schedulers available, they see the OOM killer coming and more. However, appropriate configuration is necessary when you’re running your databases at scale. Learn best practices for Linux performance tuning for MySQL, PostgreSQL, MongoDB, Cassandra and HBase. Topics that will be covered include: filesystems, swap and memory management, I/O scheduler settings, using the tools available (like iostat/vmstat/etc), practical kernel configuration, profiling your database, and using RAID and LVM.

  • Solving the Package Problem - In the beginning there was RPM (and Debian packages) and it was good. Certainly, Linux packaging has solved many problems and pain points for system admins and developers over the years – but as software development and deployment have evolved, new pain points have cropped up that have not been solved by traditional packaging. In this talk, Joe Brockmeier will run through some of the problems that admins and developers have run into, and some of the solutions that organizations should be looking at to solve their issues with developing and deploying software. This includes Software Collections, Docker containers, OStree and rpm-ostree, Platform-as-a-Service, and more.

  • From MySQL Instance to Big Data - MySQL is the most popular database on the web but how do you grow from one instance on a single LAMP box to meets needs of high availability, big data, and/or ‘drinking from the fire hose’ without losing your sanity. This presentation covers best practices such as DRBD, read/write splitting, clustering, the new Fabric tool, and feeding Hadoop. 80% of Hadoop sites are fed from MySQL instances and it can be frustrating without guidance. MySQL’s Fabric will manage sharding and provide more flexibility for your data. And using the memcached protocol to access data as a key/value pair can be up to 9 time faster than SQL (but

All of these talks are items that can help my career and my employer today. It has givien me a level of enthusiasm that I haven’t had in quite some time. Now imagine if you could give that level of education, motivation and enthusiasm to every member of your team.

My conference buddy and I have already identified several technologies we want to look at implementing, as well as developed contacts with people who are already using them. We’ve met with some great people at Puppet Labs, like Lindsey Smith, the Puppet Enterprise product owner, who listened to our real world problems and pain points. He also got us setup with the Puppet Labs Test Pilot Program so that we can be involved in the direction of Puppet Enterprise.

We grabbed a few beers with Morgan Tocker the MySQL Community manager at Oracle. We shared stories, talked about some of our struggles with MySQL and just generally had a good time and got a ton of insight into potential pain points in the future as well as features to leverage in upcoming releases.

When we get back to the office on Monday, we’ve got a ton of things to discuss, evaluate, re-evaluate and expand upon. That’s the power of conferences, and if you’re a manager, it’s why you should consider the next request for conference funds a little more carefully.

My Puppet Development Environment

I was in attendance at Puppet Camp Chicago today and had some really awesome conversations with people. It’s always worthwhile to hear how people are approaching similar problems to yours. It was also nice to get a chance to meet some of the developers of my favorite Puppet modules, but I digress.

One of the conversations that came up was what our local development process looked like for Puppet. Many people are attempting to find the right mixture of process and tools to help develop their infrastructure. With this in mind, I figured it might be worthwhile to share my developer setup. YMMV.

VIM - VIM is my editor of choice. Of course saying you use VIM is like saying “I have a car”. Nobody just uses VIM these days. There’s always some plugins that get mixed in there, my setup is no different.

  • vim-ruby - VIM Ruby is a nice plugin for all types of fun, helpful bits. Check it out.

  • NERDTree - A great plugin that adds some file browsing capabilities to VIM. Well worth it to avoid buffer hell.

  • Powerline - A great add-on for VIM, zsh, and bash that adds an awesome status bar to your VIM interface. The git status in the toolbar is extra helpful.

  • tmux - Tmux isn’t really a VIM plugin, but it is essential to my workflow. Being able to create multiple windows, split panes and easily navigate amongst them with keyboard shortcuts.

  • Custom VIM Functions - I have one main custom function that I use hevaily for linting. The function determines whether the file is a Puppet (.pp), JSON(.json) or ERB (*.erb) file and runs the appropriate linter. Below is a copy of it.

function LintFile()  
     let l:currentfile = expand('%:p')  
      if &ft == 'puppet'  
         let l:command = "puppet-lint " . l:currentfile  
     elseif &ft == 'eruby.html'  
         let l:command = "erb -P -x -T '-' " . l:currentfile . "| ruby -c"  
     elseif &ft == 'json'  
         let l:command = 'jsonlint -q ' . l:currentfile  
      end  
      silent !clear  
         execute "!" . l:command . " " . bufname("%")  
   endfunction  
   map  :call LintFile()
   
   
##Virtual Machine Setup##
   
   My local development environment consists of two virtual machines, a Puppet master and a Puppet client. I'm using [Virtualbox](https://www.virtualbox.org) for virtualization, but really any VM tool should be fine. 
   
   The nice thing with having a virtual puppet client on your desktop is that you can snapshot it to get your VM back to an initial state. So before you do any development on the Puppet client, make sure you [take a snapshot](http://www.virtualbox.org/manual/ch01.html#idp55591632) so that you can get back to a clean starting point.
   
On the Puppet Master VM you'll want to [create a shared folder](https://www.virtualbox.org/manual/ch04.html#sharedfolders) in Virtualbox or your VM Manager of choice.  Point the shared folder to whatever folder holds your Puppet manifests on the local machine. Now [mount the shared folder](https://www.virtualbox.org/manual/ch04.html#sf_mount_manual) in your VM so that it's accessible within the Virtual Machine. You should now have access to your Puppet manifests on your local machine, via the Virtual Machine.
	
Last but not least, modify your [modulepath](https://docs.puppetlabs.com/references/latest/configuration.html#modulepath) in the virtual Puppet master and add the shared folder path to the modulepath. By adding the shared folder to your modulepath, you can develop your Puppet manifests on your local machine, with all your tools without the need to develop inside the VM or to sync files from your local machine to your VM.

##Remote Puppet Development##

Occasionally you might hit a use case that isn't testable on a local machine and you need to test it on a Puppet master in your pre-prod environment. (You do have a pre-prod environment right?) When this situation comes up it's nice to have [Puppet Environments](http://puppetlabs.com/blog/git-workflow-and-puppet-environments) setup. Most people use them in a dynamic fashion, but you can definitely use them statically. (And with SVN) After you've created the environments, it's just a matter of getting your files to the path on the remote server. Rsync is a great tool for this as it allows you to get your files to the remote server for testing, without the need to actually commit code that you're not sure will work yet. (Which in some environments might trigger a long, time consuming series of automated checks and builds)

That's pretty much it for my development environment. I should also mention that if you're working on a Mac, it might be worth checking out [Dash](http://kapeli.com/dash), which is an awesome developer documentation tool. It basically sucks down the Docsets of various programming languages and tools. (Puppet being one of them)

At some point I'll probably write a follow up post to detail our actual development and deployment workflow. Hope this helps some poor soul out there on the web.

Organizing Puppet Code

I feel like every team I talk to, at some point decides they need to blow up their Puppet code base and apply all of the things they’ve learned to their new awesome codebase. Well, we’re at that point in my shop and there’s a small debate going on about how to organize our Puppet modules.

This is really not meant to be a mind-blowing blog post, but more of a catalog of thoughts for me as I make my argument for separate repositories for each Puppet module. A few background items.

  • We’ll be using the Roles/Profiles pattern. What I’m calling “modules” are the generic implementations of technologies. These are the modules I’m suggesting go into separate repositories. I’m OK with profiles and roles co-existing in a single repository.
  • We’re coming from a semi-structured world where all modules lived in a single SVN Repository. Our current deployment method for Puppet code is a svn up on the Puppet Master.
  • We’ll be migrating to Git (specifically Stash)
  • We’ll have multiple contributors to the code base in 2 different geographic locations. (Chicago and New York for now) The 2 groups are new to each other and haven’t been working together long.

I think that’s all the housekeeping bits. My reasons for keeping separate Git repositories per modules are not at all revolutionary. It’s some of the same arguments people have been writing about on the web for awhile now.

Separate Version History

As development on modules move forward, the commits for these items will be interspersed between commit messages like “Updating AJP port for Tomcat in Hiera”. I know tools like Fisheye (which we use) can help eliminate some of the drudgery of flipping through commit messages, but you know what else would help? Having a separate repo where I can just look at the revision history for the module.

##Easier Module Versioning## With separate repositories, we can leverage version numbers for each release of the module. This allows us to freeze particular profiles that leverage those modules on a specific version number until they can be addressed and updated. With two disparate teams, this allows them to continue forward with potentially disruptive development, while other profiles have time to update to whatever breakage is occurring.

##Access to Tools## Tools like Librarian-Puppet and R10K are built around the assumption that you are keeping your Puppet modules in separate repositories. I haven’t done a deep dive on the tools yet, but from what I can tell, using them with a single monolithic repository is probably going to be a bit of a hurdle.

##Easier Merging/Rebasing and Testing## The Puppet code base is primarily supported by the Systems team. The world of VCS is still relatively new to the Systems discipline. As we get more comfortable with these tools, we tend to make some of the same mistakes developers make in their early years. The thing that comes to mind is commit size and waiting too long to merge upstream. (Or REBASE if that’s your thing) Keeping the modules in separate repositories tightens the problem space your coding for. If you need create a new parameter for a Tomcat module, a systems guy will probably

  1. Create the feature branch
  2. Modify the Tomcat module to accept parameters
  3. Modify the profiles to use the parameters
  4. Test the new profiles
  5. Make more tweaks to the tomcat module
  6. Make more tweaks to the profiles
  7. Test
  8. Commit
  9. Merge

With separate modules, the problem space gets shrunken to “Allow Tomcat to accept SHUTDOWN PORT as a parameter”. We’ve removed the profile out of the problem scope for now and have just focused on Tomcat accepting the parameter. Which also means you need a new, independent way to test this functionality, which means tests local to the module. (rspec-puppet anyone?) This doesn’t even include the potential for merge hell that could occur when this finally does get committed and pushed to master. Now I’m not naive. I know that this could theoretically happen even with separate modules, but I’m hoping the extra work involved would serve as a deterrent. Not to mention that gut feeling you get when you’re approaching a problem the wrong way.

#In favor of a Single Repository# I don’t want to discount that there could be some value in managing all your code as a single repository. Here’s the arguments I’ve heard so far.

##It complicates deployments of Puppet Code## True that. Nothing is easier than having to execute an svn up command……except for running a deploy_puppet command. Sure you’d have to spend cycles writing a deployment script of some sort, but if that’s a valid reason then we’re just being lazy. I might be being terribly optimistic but it doesn’t seem like a hard problem to solve.

In addition, I’ve always preferred the idea of delivering Puppet code (or any code for that matter) as some sort of artifact. Maybe we have a build process that delivers an RPM package that is your Puppet code. A simple rpm -iv or yum update and we’ve got the magic.

##It complicates module development## Sometimes when people are developing modules, their modules depend on other modules. I extremely dislike this approach, but it is a reality. You would now have to check out two separate modules and all of their dependencies in order to develop effectively.

Truth is, this sounds like bad workflow. A single module shouldn’t reach into other modules. In the rare event that it does (say your module leverages the concat module) then these dependencies shouldn’t just be inferred by an include statement. They should be managed by some tool like librarian-puppet, because in reality, that’s all it is. Every other language has an approach to solving dependency management (pip, gradle, bundler and now librarian)

#What’s Next?# With my thought process laid out with pros and cons, I still feel pretty strongly about separate repositories for each module. Another solution might be to create some scripts that manage the modules in a way that can fool the users that care into thinking that they’re dealing with a single repository. But this tends to defeat some of the subliminal messaging I hope to gain from separate modules. (Even though that’s probably a pipe dream)

I’ll be sure to post back what becomes of all this and if the team has any other objections.

Video from Opening the Clubhouse Doors C2E2 Panel

I’m a little late in posting this, but better late than never. The panel that I moderated for C2E2 finally has video footage up. It was a great discussion with a bunch of really great people. Scott Snyder was so incredibly humble and gracious. He’s lucky I hadn’t started reading Batman Eternal before the panel. I’m so in love with that book I’m not sure I could have prevented myself from bear-hugging him. Anyways, I digress. Check out this great talk.

Rethinking Stability - Part 1 Review of In Search of Certainty

Stumbling through the web, I found this book club called ReadOps. It’s an amazing idea and our first book to read is In Search of Certainty by Mark Burgess, a seriously smart man. We’re reading the book in sections and with a bit of effort, I was able to get through part 1. Below is the writeup I did for ReadOps. If you’re in the IT Field, ReadOps might be worth checking out.


Part 1 of this book was rough, but I promise that it gets better in the later chapters. The principle issue I have is the amount of depth that Burgess goes into to setup his arguments. There are significant correlations between his work in physics and its history, but I don’t find it useful beyond the 2nd or 3rd paragraph. Now with that being said there are a few points that I find absolutely stellar.

##How We Measure Stability## The sections on stability really challenged my thinking on how we measure stability. In general, I’ve always measured stability throgh the ITSM Incident/Problem Management processes. But Burgess struck home for me when he says that what we’re actually measuring in these processes are catastrophes, not stability.

If I had the right models for stability, I’d recognize the erratic memory usage patterns of the Java Virtual Machine (JVM). Those swings/fluctuations would give me an idea on the stability of the JVM. (I’m also mixing a concept he talked about in regards to scale, but that’s a whole different thing) Instead, I don’t pay attention to the small perturbations that lead up to the eventual OOM or long pause garbage collection. Instead the OOM triggers an incident ticket, which then gets tracked and mapped against our stability, when truth be told, stability was challenged much earlier in the lifecycle.

I’m not sure if ITSM has controls to deal with these types of situations or not. In my specific case above, an incident ticket might not have been warranted due to the fact that memory usage can be self-healing through garbage collection. But without defined thresholds and historical data to trend against, it would be easy to see miss an incident or situation where memory usage whent from 40% -> 75% -> 44%. Sure memory usage dropped significantly, but it’s still up 4% from where we started. What do I do with this information? I guess that’s where clearly defined thresholds come into play.

##A Stability Measurement## With all that being said, I wonder if there’s the possibility to measure stability into some abstract value or number. (Maybe this has already been done and I’m late to the party?) I think a lot about Apdex and how much I love it as an idea. But for me, as a Systems guy, its at the wrong scale and it introduces components that I have no control over. (Namely the client browser and everything that happens inside it) What would be incredibly useful though was some sort of metric for ServerDex. I’m imagining taking a range of values, deciding which ones fall outside the desired thresholds and applying some sort of weighted decaying average to it. (I’m literally just spitballing here) That could give Systems folk some sort of value to track against. It be nice to be able to take several of these measurements and combine them for an approximation of the stability of our system as a whole.

Ok, time for me to hit the hay.

C2E2 is Almost Here

Wow, I can’t believe that C2E2 is almost here! The con has definitely become one of my favorite events of the year and I’m glad that CNSC got the opportunity to work with them again.

This year I’m also excited to be moderating a panel entitled “Opening the Clubhouse Doors: Creating More Inclusive Geek Communities”. If you’re going to be at C2E2, for selfish reasons, I highly recommend you check it out.

Date: Friday April 25

Time: 6:30pm - 7:30pm

Location: S401CD

On the panel will be

  • Michi Trota, blogger, essayist, fire spinner and general bad ass.
  • Mary Robinette Kowal , Hugo Award-Winning Sci-Fi and Fantasy author.
  • Mary Anne Mohanraj founded the World Fantasy Award-winning and Hugo-nominated magazine, Strange Horizons.
  • Karlyn Meyer is an attorney who uses her vacation days to work at PAX and game development conferences. She studied intellectual property and the law’s application to gender and sexuality
  • Scott Snyder is a comic book writer working for the likes of DC and Marvel as well as creator owned titles. You can see Scott’s do his thing on the pages of Batman and The Wake.

As you can tell, this is going to be an awesome panel. Some of our panelists were together last year at C2E2 for the Excorcising the Spectre of the Fake Geek Girl panel. Video link below.

Brew is Failing with Crazy Error

If you're on Mavericks and you're using Homebrew, you may have experienced a weird error message.

</p>

/System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:in `require': /usr/local/Library/Homebrew/download_strategy.rb:88: invalid multibyte escape: /^\037\213/ (SyntaxError)

</p>

This seems to be caused by an update to Ruby version 2.0 as part of Mavericks. All you need to do is make sure that Homebrew points specifically to the 1.8 version of Ruby.

</p>

Edit /usr/local/bin/brew

</p>

Change the first line from

</p>

#!/usr/bin/ruby

</p>

to

</p>

 #!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby

</p>

Run a brew update afterwards. If brew update works, you're good to go. If it fails, you might need to go nuclear on this bitch. Understand this will probably trash any custom formulas you have.

</p>

cd /usr/local
git reset --hard origin/master
git clean -df

</p>

Hopefully that solves your problem.

</p>

Postgres, Django and that Damn Clang Error

I'm migrating to PostGres for one of my Django projects. (From MySQL) I'm writing this more as a note for myself, but if someone else finds it useful, go for it. If you've done this on a Mac, you may have seen the following errors.

</p>

Error: pg_config executable not found.


</p>

If you've found that error, then you may not have Postgres installed. If you do have Postgres installed then make sure the installs bin directory is in your path. If you don't have Postgress installed, the easiest way is Postgres.app. After installing Postgres drop to a terminal and add a new value to your PATH.

</p>

export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/9.3/bin
pip install psycopg2

</p>

Don't feel discouraged when this fails again. Because it probably will. The message is extremely helpful if you've got your Ph.D.

</p>

clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
clang: note: this will be a hard error (cannot be downgraded to a warning) in the future
error: command 'cc' failed with exit status 1


</p>

It may not be downgraded in the future, but today is not tomorrow. So lets hack this bad boy.

</p>

Things you need:

</p>

  1. If you don't have it already, download Xcode and install it.

</p>

1a. Drop to a terminal and install the command line tools with

</p>

xcode-select --install

</p>

  1. If you already had Xcode installed and it's version 4.2 or earlier, skip ahead to step X. If you downloaded Xcode in step 2, you'll need to install some additional compiler tools that were removed from Xcode. The best way is to use Homebrew. (You are using Homebrew RIGHT?)

    brew install apple-gcc42

  2. Once that's complete. If this works, you're done. If not (and it probably won't) move on to step4.

    pip install psycopg2

  3. Set an environment flag to skip the BS compiler flags being used.

    export ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future
    pip install psycopg2


</p>

With any luck, that will result in a successful install.

</p>

IT and the Empathy Deficit

This post is REALLY late, but I think the topic is still relevant, even if the trigger events have faded in our memory

</p>

The Information Technology field is completely devoid of any ability at self-reflection. The whole damn thing, from companies to board of directors, to developers, to system admins. How easily and quickly we can wag our finger when someone else fails, yet when we ourselves fall down, there’s a “perfectly logical explanation”.

</p>

In case you were under a rock on last Friday, many of Google’s services went down for an extended outage. I know for our fast paced world of hyper-connectivity, 25 minutes without email or documents is the end of the world. There’s the entrepreneur who finally got his chance to pitch in front of a venture capital firm, but couldn’t get to his presentation. The college kid that was trying to print his assignment before making a mad dash to beat the deadline. I get it, these services impact our lives in major ways.

</p>

But it’s alarming to see how the people who should understand most, are the first to pile on. Yahoo just couldn’t help themselves and tweeted about the issue multiple times. They have since apologized but honestly,at this point who cares.

</p>

But as the Twitterverse collectively freaked out everyone in my office was calm as a cucumber. Sure we couldn’t access email, but we knew Google would fix the problem and be back up as soon as possible. How did we know?

</p>

Because it’s what we would do.

</p>

News flash. Sometimes people make mistakes. Sometimes process fails. Sometimes gaps we didn’t know about are found. Sometimes test cases are missed. As a developer, tester or system admin, have you never made a mistake? Have you never let a bug slip in to production? Have you never under-estimated the impact of a change? If you’re perfect, then this message isn’t for you. But if you’re like the other 99.999% (see what I did there?) of people in our field, I’m sure we can agree on a few things.

</p>

  • Google’s uptime is pretty damn good.
  • Google is run by some pretty smart people.
  • Even smart people can be fallible.
  • Downtime is a human tragedy. We should treat it with respect.

</p>

That last one sounds crazy, but seriously. For someone on that Site Reliability Team, the outage wasn’t a laughing matter. It probably doesn’t feel good to know that the Internet is collectively dismayed and disgusted by a mistake you made, even though 50% of people wouldn’t understand the mistake if you explained it to them. Instead of ridicule, we should encourage open dialogue about how mistakes like this are made, so everyone, not just Google can learn from them.

</p>

Outages are learning opportunities for everyone. Why did it happen? Was it a tools failure? I’m sure others would like to know if it’s a tool they use as well. Was it a process failure? Open dialogue about the failures of traditional IT Operations shops and their failures had a huge hand in forming the DevOps movement. Was it human error? Why did that person think the action they took was the right one? If it made sense to them, it will make sense to someone else, which means you might have a documentation or a training issue.

</p>

All of these problems are correctable but only if we feel comfortable talking about our failures. This constant ridicule and cynicism our industry has when someone fails threatens the dialogue necessary.

</p>

Google has shared some details about the outage, and I’m happy to say it seems to be a growing trend among companies, but what about at a lower more personal level?

</p>

I challenge those in our field.

</p>

Be fallible
Be open with your failures
Get to the heart of why the failure happened. Don’t just call it a derp moment and move on.
Recognize when someone is trying to do these things and encourage it.

</p>