Alex - stock.adobe.com
Although security chaos engineering may appear similar to penetration testing, it's not. Both are proactive measures intended to find weaknesses or vulnerabilities in a system before they can have an impact. But while pen testing simulates an attack to discover weaknesses, chaos testing intentionally introduces errors, failures and random behavior to assess if a system can detect and handle it all correctly.
A security team requesting to potentially sabotage an application in production will, no doubt, raise plenty of eyebrows. However, these are controlled experiments and, when done correctly, will result in a far more resilient and secure IT environment and increase confidence that the system operates the way it was designed to.
How to use security chaos engineering to test security
An organization's first test should be run on a staging or test environment so team members can become familiar with the tools they are using and how the process works. Whichever environment is being tested, there should be just one experiment conducted at a time -- a port misconfiguration, for example -- so whatever happens next, the root cause -- the what, when and where -- of the incident is known.
Note that the aim is never to launch a complex attack against an entire infrastructure. This creates a very different scenario to a live, unexpected incident, making it much more difficult to determine where an incident began. It means those involved are not under the intense pressure of a real attack and can learn more from the exercise. Avoid introducing too many variables; otherwise, there can be cascading issues that obscure visibility into what happened to the initial target.
Everyone who may be affected by the exercise and is needed to bring the system back to a stable state should be involved in the test. This could include software and network engineers and the incident response and security teams. It will help if everyone has some understanding of how to code so they have some understanding of how software is built and the complexities involved, particularly as infrastructure as code becomes more common.
Preplanning a test is also imperative. Clearly define what the test will involve and map out what is expected to happen if systems and security controls perform as expected, as well as at what point the experiment will terminate. With port misconfiguration, for example, document which firewall rules should catch unauthorized traffic through the port, what events should be generated, where events should be logged and processed, and who should receive the alerts. This is an interesting exercise in itself, as software, network and solutions architects may have different models -- mental and diagrammatical -- of how they believe the overall system should function.
Once a test is underway, record each expected and unexpected event and see how quickly the response team can fix the problem using just the information generated by the security controls. If the test starts to create unforeseen problems or instability, it should be terminated and the injected errors removed.
Once a test has concluded, review the outcomes and determine what changes are needed to ensure the system will pass the test the next time. Is there configuration drift between instances? Do log entries need more context? Alerts are fine, but not if they don't tell exactly which instance generated it. Other test outcomes may show firewall rules are perhaps no longer relevant or effective. What did the security controls not do that they were designed and configured to do? Perhaps a shortage of people or skills caused the error to persist longer than it should have. It will certainly inform the incident response team of their strengths and weaknesses.
Once the different security controls have been checked in the test environment, security control validation can move onto the production environment. This is important, as there is always some configuration drift or differences between the two in a distributed system due to the speed, scale and complexity of changes as services and resources spin up and down.
Available security chaos engineering tools
Although security chaos engineering is a relatively new discipline, there is a growing number of tools available that can help organizations set up and execute chaos tests, including the following:
- ChaoSlingr is focused primarily on AWS infrastructures, pushing failures into a system so security issues can be identified.
- Gremlin has an attack library to help users explore different scenarios and can target containerized services.
- Chaos Automation Platform is a tool developed and used by Netflix to interrogate the deployment pipeline for a user-specified service. It launches experiment and control clusters of that service and injects a failure scenario with the results of the experiment reported to the service owner.
- Verica is an automated experimentation platform that integrates with Kubernetes and Apache Kafka out of the box and is built by former members of the Chaos Automation Platform team.
- Chaos Toolkit is an open source project that enables developers to create their own experiments for specific use cases.
Test for potential failures instead of waiting
Given the constantly high number of security failures across the internet, security practices need to advance to handle more distributed, ephemeral and immutable systems. Spending more money on the same tools clearly isn't working, partly because it is difficult for humans to mentally model a modern application and its supporting infrastructure, and it's impossible to pen test such applications in every single possible state. Security chaos engineering could be the answer to how organizations can increase the overall security and confidence in their mission-critical applications.