At-Scale Infrastructure Testing with Sentinels

Automate testing over any set of Kubernetes Clusters with the Sentinel Resource

Validating the correctness of any infrastructure change is a meaningfully complex task that has no parallel to a local unit test that is effective at the application layer. Slight differences almost always require some degree of live integration testing. Multiply this by n kubernetes clusters, for any large n, and you definitely need to automate.

Sentinels are meant to provide a flexible abstraction to solve for this. In particular, they allow you to bundle a sequence of checks that can:

Run terratest-based integration tests across any subset of your clusters and aggregate the results
Tail logs across any set of clusters using search filters, and analyze it with AI and git-source rules files
Deep-query a kubernetes resource on a cluster and analyze its health with AI and git-sourced rules files

Once a sentinel is defined, it can be run anytime on-demand via API. This can be triggered:

in our UI
in github actions or other CI systems
in Plural pipelines

Some common usecases that we find they are particularly well suited for are:

Validating kubernetes upgrades do not introduce regressions
Cross-cutting kubernetes operator changes (eg istio upgrades)
Validating network reconfigurations are safe.

But there are likely many more.

The motivation behind all of these, and the use of AI, is that oftentimes confirming infra health requires aggregating multiple textual datasources and interpreting them using some degree of discretion that consumes meaningful man-hours as a result. You simply cannot do that deterministically, so a governed AI-based approach is needed. For deterministic correctness, a full terratest run can exercise common paths like validating pods start, storage volumes can be mounted, networking is enabled, etc.