Terraform at scale, and drift detection with Terradrift

As organizations continue to adopt infrastructure as code (IaC) practices and tools like Terraform, it becomes increasingly important to maintain a consistent and predictable state of the infrastructure. However, managing Terraform at scale can present several challenges, such as drifts caused by manual changes to the infrastructure by the console or changes made in Git branches that are not merged into the upstream branch. These drifts can be challenging to trace and lead to longer troubleshooting times for small or quick tasks.

One of the main challenges of running Terraform at scale is drift. Drift occurs when the actual state of the infrastructure differs from the desired state defined in the Terraform code. This can happen for several reasons, including manual changes made to the infrastructure outside of Terraform, or changes made in Git branches that are not merged into the master branch. Drift can also occur when provider APIs introduce changes such as default fields that are not present to in the HCL.

Managing drift at scale can be a tedious and time-consuming task. It can be challenging to trace the source of the drift and determine how and when it occurred. Especially if you have hundreds of stacks (tf directories)

To address these challenges, the open-source tool Terradrift was created. Terradrift is a Terraform drift detection tool that uses terraform-exec under the hood to perform terraform plan and report drift changes instantly or as Prometheus metrics. This allows organizations to continuously monitor their infrastructure for drift and quickly identify and address any issues as they arise.

Terradrift is designed to work in two modes, Server and CLI mode.

Both modes will scan all terraform stacks (directories) in a given working directory and run terraform plan to detect if drifts exist. The difference between the two modes is that the terradrift-cli will run the scan once and exit, Printing the output of the current state, while the terradrift-server will continuously scan based on a defined schedule and exposes the drift results as Prometheus exporter metrics on /metrics endpoint providing the flexibility of reporting the drift. By using Prometheus Alerts based on how long the drift has been detected, Also you can create dashboards based on those metrics stored on Prometheus or any monitoring platform.

CLI mode

terradrift-cli discovers the stacks from the given workdir and then runs the terraform plan command to detect the drifts based on the plan output.

Example


$ terradrift-cli --workdir ./examples/ --config examples/config.yaml        
STACK-NAME      DRIFT   ADD     CHANGE  DESTROY PATH                    TF-VERSION 
api-production  false   0       0       0       gcp/api                 1.2.7     
api-staging     false   0       0       0       gcp/api                 1.2.7     
core-production true    0       0       1       aws/core-production     1.2.7     
core-staging    true    1       0       0       gcp/core-staging        1.0.6

See all details in terradrift-cli

Server mode (terradrift-server)

You can run the server following the example below after setting the required environment variables for Github token and the cloud provider.

Example

$ ./terradrift-server --repository https://github.com/username/reponame \
--git-token $GIT_TOKEN \
--config ./config.yaml \
--interval 600

Retrieving the drifts as prometheus metrics

$ curl http://localhost:8080/metrics
# HELP terradrift_plan_add_resources Number of resources to be added based on tf plan
# TYPE terradrift_plan_add_resources gauge
terradrift_plan_add_resources{stack="api-staging"} 1
terradrift_plan_add_resources{stack="core-production"} 0
# HELP terradrift_plan_change_resources Number of resources to be changed based on tf plan
# TYPE terradrift_plan_change_resources gauge
terradrift_plan_change_resources{stack="api-staging"} 0
terradrift_plan_change_resources{stack="core-production"} 0
# HELP terradrift_plan_destroy_resources Number of resources to be destroyed based on tf plan
# TYPE terradrift_plan_destroy_resources gauge
terradrift_plan_destroy_resources{stack="api-staging"} 0
terradrift_plan_destroy_resources{stack="core-production"} 1

See all details in terradrift-server on how to get started by deploying terradrift-server as a helm-chart or any of the available options.

Terradrift supports all cloud providers, whatever your current IaC does. if your normal terraform cmds requires certain environment variables to be exported, all you have to do is to include the flag --extra-backend-vars

In conclusion, managing Terraform at scale can present several challenges, including drifts. Terradrift helps organizations address these drifts by continuously providing real-time visibility into the state of the infrastructure. This allows organizations to maintain a consistent and predictable state of their infrastructure while reducing the risk of non-tracked changes.