Wednesday, October 18, 2017

WTF is DevOps?

In the late nineties, I can remember having a conversation with my future wife about the number of pillows on her bed.

“What are all these pillows for?”

I can understand the pillow for sleeping on. That I got. But there must have been at least ten pillows on the bed.

Patiently, she explained. “They are decorative pillows.”

“So you don’t sleep on them?”

Her patience was starting to wear thin. “No, they are decorative.”

The next week, we were sitting on the couch watching TV when a promo for “The Man Show” came on. Jimmy Kimmel was doing a monologue with just a clip for the promo. In the monologue he said, “Hey, ladies, what’s with all the pillows?”

I absolutely lost it as the timing on the promo and our recent conversation was just uncanny. To end my laughter at her expense, Julie hit me in the head with a pillow. It only made me laugh harder, but in the same vein…

WHAT’S WITH ALL THESE ENVIRONMENTS?

The second most read thing I have ever written was entitled “SharePoint is a Colossal Piece of Shit and Should Not Be Used by Anyone”.  One of my chief complaints is that SharePoint does not have a good way to migrate code between environments. Most SharePoint implementations have a single environment, production.

Typically, when writing software, there are multiple environments. Code moves from a developer’s laptop to a development environment where the code is then mixed in with everyone else’s changes. Tests are performed. If everything passes, it is moved to a testing environment where business users can take a look at the new feature or verify a bug has been squashed. If everything goes well, there is a final move from testing to production. The goal is to verify along the way that the amount of change an end user sees is limited, that new code is thoroughly tested before it reaches an end user, and that nothing breaks during the deployment of the code itself.

When it’s just one software engineer writing code all by herself, having multiple environments would probably be overkill. Our heroine could introduce a new feature or squash a bug on her laptop, test it, and then push it out to a production server. In fact, that’s what I have typically done on a lot of my hobby work. I have a copy that runs on my laptop and I have a server in Amazon Web Services (AWS) or Digital Ocean. I code, test, and push.

However, when working with a team, it is a bit more complicated. What if I make a change in code and it works but then the guy sitting across from me makes a change and it works? Sadly, when our code is merged together, weird stuff starts happening. On a big software project, the code base becomes a living organism that is constantly changing. It is not uncommon to start working on a feature and have it take over a week before the resulting changes are attempted to merge back into the main code base. During that week, numerous changes will probably have been made by other team members.

Although each individual developer should be responsible for unit testing their code, how all the changes are going to work together is usually an unknown. Sadly, software is usually so fragile that every change can cause a ripple effect. Therefore, every time a change is introduced, it is a best practice to run a suite of regression tests to verify that the stuff that used to work actually still does work after the change has been introduced.

It is also at the point that the new code is moved from the developer’s laptop to the development environment that issues can arise as the hardware and software the developer is working on is now different than the integrated environment. A developer may work on a Mac or a PC, but have their code run on a Linux box. Not only are all the changes now tested together, but the deployment mechanism and differences between operating systems and hardware can be tested.

It is not uncommon for developers to test edge cases and create users like “Daffy Duck” in a development environment. The development environment exists to merge code together, test deployments, and make sure everything is working before getting the business users involved. The data quality and the ability for a business user to actually use this environment is usually fairly limited. Additionally, it is not uncommon to make multiple deployments to this environment before calling it done. Sure, every time a developer pushes code they really think they are done, but it never works that way.

In test, it is not  uncommon to have more production like data. Data can be sanitized for Personally Identifiable Information (PII) and pulled back into the test environment so that business users can run tests and have the system behave more like they are used to in production. When a deployment goes from dev to test, the change set should have been thoroughly tested and there should be a high confidence that the code will work as expected. Of course, business users may change their minds, in which case development starts all over again on the engineer’s laptop, pushed to dev, and then to test before the business user sees it again.

Once the product owner or business users sign off, the code can move from test to prod. Usually this is done in a specific maintenance window and the user will never know about all the work that went into writing and testing a feature. If this all seems like overkill, plenty of bugs and unintended consequences are usually caught during the process and it allows multiple developers to work on the same code base simultaneously.

LAPTOP OPS

There are services such as Heroku or AWS Elastic Beanstalk which will allow a user to deploy an application from their laptop. If the user goes to the command line (where the real work is done) and types, “git push heroku master”, the contents of the local git repository are teleported to a Heroku dyno. A few minutes later, any changes that were made are now in effect in that environment.

This is not Dev Ops. It is laptop ops. It might work well for one lonely developer, beavering away on a code base by themselves, but… What if I make some changes to my application and the guy sitting a few desks away from me makes some changes? What happens if we type the magical “eb deploy” invocation at the same time? How would we know whose changes just stomped on top of the previous deploy? Short answer, we really don’t. While it may be a fun way to prototype, way too much can go wrong when working with a team.


FINALLY, DEVOPS DEFINED

So I attended DevOps days in Austin this year. The first thing every speaker did was try to define DevOps. So here’s my checklist for a definition my personal definition of DevOps:

  • All code is kept in a source control repository
  • Developers can clone the repository and get a local copy of the application working with minimal headache
  • All third party dependencies are clearly marked, kept out of source control, locked at a specific version, and can be quickly installed locally
  • Any change is tracked and fully auditable
  • Deployments of changes happen in a controlled and gated manner

In my dream DevOps world, no one would deploy code from their laptop. The more I thought about it, the more I liked the idea of merging a change into git launching a build. Popular git applications like the near ubiquitous GitHub have controls built into them that prevent merging directly into a branch, only allow certain GitHub users to approve a merge, allow for private repositories, and more safety/security controls. At the time; I was working with a Frankenstein’s monster of Stash, Sonatype, Jenkins, and Rundeck which was cobbled together by a crack team of architects that took at least half an hour to do a simple deployment. After the deployment was approved by God Himself. And Congress.

THE CODE PUSH PYRAMID

On April 14, 2017; I was sitting with a small team late at night. It was Good Friday and the rest of the company went home at 1:00. We were just getting started. Someone above my pay grade had drawn a line in the sand and declared some level of feature parity would be available in the Data Warehouse that day.

The day started optimistically. A deployment was approved and a scant half hour later, it was in production. But it didn’t do what it was expected to do. I pulled together a script that set the production environment back to the way it was and waited for the next deployment. That didn’t work either. I ran my script. And so it went. Until 11:00 that night. It occurred to me somewhere around noon that every change made in dev barely touched the ground in test before it was shot straight to production. I had an epiphany. We were testing in production!


I tried my hardest to convey to a team of people unphased by the fact that every change was going straight to production that this was a bad idea. I coined the phrase “The Code Push Pyramid (™)”. The idea is lots of changes get introduced in dev. Once tested together, they can move to test. Nowhere near as many changes that get pushed to dev will make it to test. Even fewer changes will make it from test to prod. That’s the DevOps way.

I grew increasingly frustrated with the highly inefficient processes governing every aspect of software development from the tools and frameworks we could use, to the deployments using Frankenstein’s monster, and the micromanagement we encountered along the way. What made it all the worse was that for all of our vaunted process, we TESTED IN PRODUCTION! I have personally dropped tables or restaged data too many times to count IN PRODUCTION. At some point, I offered that we could save a lot of money and time if we just got rid of all the lower environments and deployed straight to production as that was where we did our testing anyway.

It came to my attention that we tested in production because we had PII data. With all the attention given to the Equifax data breach, protecting PII data seems reasonable enough. I immediately proposed that we should figure out which fields contain PII and replace it with mock data in the lower environments. That way, we could still test and protect PII. This practice is fairly common in the industry. A crack team of architects is still working on this solution, using AWS Kinesis (trust me, this is a great tool but makes absolutely no sense for this scenario) as we speak.

#SYNERGY

Before we broke up the long running and inefficiently used Docker application into Lambdas, my co-worker, The Kid, kept on advocating that we should be using Lambdas. He was looking at it from a “this could solve all of our problems in a very cost effective and fast way” kind of perspective. I was looking at it from a “how in the world am I going to deploy this on Frankenstein’s monster” sort of way. I created a little cheer I used to say every time The Kid brought up the subject of Lambdas.

“You say Lambda, I say no!”

(pause)

“Lambda!”

(pause)

“No!”

“You say Lambda, I say no! Lambda! No!”

And repeat. Maybe you had to be there, but it was kind of funny. The entire time I was doing my “no lambda” chant, I was working like mad to find a way to deploy them. Something better than Frankenstein’s monster. From a laptop ops perspective, I loved using an open source (read: free) framework called serverless. My hope was to introduce this framework to the monster and make it suck just a little bit less. The crack team of architects said no.

So here I am, months later, and I came up with my own framework for deploying things to AWS in a manner that meets my definition of DevOps. I have a GitHub action hook pointing to a lambda. The lambda reads the branch that has been merged and the branch determines which environment it will be deployed to. If the branch does not match the config file, the service assumes it is a feature branch and no deployment is made. The lambda then clones the repository and runs a deploy script kept within the repo. I built a build board and deployed it with the platform which shows the current state of every repository and streams the output in real time to the browser so anyone who is interested can follow along at home. I started calling it “Synergy” and I guess the name stuck. I hope to have a demo up next week.

Hopefully, anyone reading this far has a better understanding of what DevOps is and why we need all of those environments. Now, twenty years later, can someone please tell me why the ladies need all those pillows?

No comments:

Post a Comment