Home > blog > Do I need Chaos Engineering on my environment? Trust me you need it!

Before we begin let us try to understand what is chaos engineering, when should one use it or even why should one think about using Chaos Engineering. I personally faced these predicaments when I embarked on this journey.. Let us delve deeper into the need for Chaos Engineering and understanding what it is about.

Chaos Engineering by the name itself defines that you are a chaotic engineer and you are a pain to work with, you know I am joking. However, it’s not even a chaos monkey that is going to engineer your environment. Chaos Engineering is more like conducting an experiment on your environment to ensure you don’t get into a chaotic situation.

As we see the textbook definition from Wikipedia “Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions” We have various Perturbation Models like Chaos Monkey, Litmus Chaos, Chaos Kong, Chaos Gorilla, Security Monkey, Chaos Mesh, Gremlin, Chaos Machine etc to perform an intense level of breaking your environment/application and ensuring adequate service of the same is available. This tends to expose the systems’ weak points. This method or process of checking your application or environment failure is called Resiliency.

Being a staunch supporter and believer of DevOps, I would like to have chaos testing of my application as part of the development process itself. Teams in your organization may or may not possess these skills of doing a resilience test on the application. Or they could be busy with their deadlines, sprint issues, releasing the next build for their application etc. This is where tools come in handy and hence my case with the above tools. Chaos Engineering can be used to orchestrate Application, Network, Infrastructure failures building the needed resiliency into the application.

Staying away from the historic information of how it all started with NetFlix, let’s stay focus and look at the bigger picture on what you can do with it.

Let me cherry-pick from the numerous perturbation model out there, the LitmusChaos. The LitmusChaos toolset is the first tool I experimented with in my projects at Maveric Systems Limited with my team of SRE Core Engineers. It is part of CNCF Projects and a perfect toolset to do all your cloud-native workloads.

LitmusChaos mainly orchestrates chaos on Kubernetes to help SREs like us to find our kryptonite. Well, personally I was stunned by a question from my fatuous brain on what weakness would a DevOps/SRE Engineers would have other than do some deployment and some frivolous tools for Continuous Integration & Continuous Delivery.

Oh really! Enter the cognitive brain with epic & heroic cinematic background music, you silly fatuous brain let me walk you to the memory lane where you had sleepless nights troubleshooting your application, infra and network issues, and don’t forget about those countless black coffee you slipped to stay away to figure what the heck is going, perplexed with the question how come it broke my system and how do I fix it. Of course, I had all the Infrastructure muscle to host my application, network to sustain the traffic the list goes on.

That’s when I realised DevOps/SRE use chaos engineering in the initial low-level environment and eventually in production to find the hidden bugs, vulnerabilities. Now, don’t you agree fixing the weakness will end up in a high resilience of my system? I can have a peaceful night and binge-watching NetFlix over my weekend, even on the weekend when I had my production releases.

Perfect now that I have your attention and have shown you what tricks the chaos engineering hold in its sleeve. Let me help you out how to choose some of the perfect perturbation models in layman’s terms.

Let us start with Chaos Monkey, which is very famous and popular because of Netflix as they use this to test the resilience of their IT infrastructure. You can go for the same if your application or environment is hosting a lot of media content and providing for its users. When it comes to Chaos Kong & Gorilla, they can be used if you have hosted your application on AWS and you want to focus more on your region and the availability of hosted data. Latency Monkey can be used to check the latency of your network like outages, delays, packet loss etc.

The most important and the elephant in the room “Security Monkey” my second favourite after LitmusChaos. You are in luck as Security monkey is open-source unlike many others and it is from Netflix. What does it do? Well, it monitors, alerts and reports to you if any anomalies are found. The best part is you can use this on all your cloud provider AWS. AZURE, GCP, etc.

Now, there are lot many more models, around 25 to 30 specifically designed for your needs. Cherry-pick and choose the best monkey which is going to stop you from giving you a chaotic nightmare.

In the end, I found this is the way I can know for sure how my system handles failures out in the real world, if you find the problems at the very initial stage it will save you from a lot of trouble be it cost, the effort of coding, rescaling the on infrastructure, etc before they cascade into a huge problem and break my system. Using chaos engineering ultimately will increase the resilience of your systems and have very less or no impact on your Application, Environment, Infrastructure or be it the main reason why we developed our product in the first place “Our Customers”.

Article by

Daniyal Rayn

DevOps, Site Reliability and Chaos Engineer