Open Source · Apache 2.0 · 419/419 Tests Passing

Turn an alert into
an investigation
in 60 seconds.

Tarka converts Prometheus/Alertmanager alerts into structured triage reports — evidence, confidence-ranked hypotheses, and copy-paste-ready commands. No guessing. Just signal.

→ Get Started (5 min) View on GitHub

console.tarka.local

The Problem

Most alerts still require a senior engineer to decode them.

Every on-call rotation has a few people who carry the tribal knowledge. They know which PromQL to run, which namespace to check, and which logs actually matter. Everyone else waits.

The result: slow MTTA, senior burnout, and incidents that drag on because the first responder isn't sure what "normal" looks like.

Tarka encodes the first 60 seconds of every investigation so every engineer on your team can respond with confidence — not just the one who's been there since day one.

Before Tarka

✗Slack: "hey @senior-sre can you look at this?"

✗10 browser tabs — Grafana, k9s, Kibana, Alertmanager…

✗"Is this the same as last week's thing?"

✗15 minutes to find the relevant PromQL query

✗Postmortem: "root cause: unknown"

After Tarka

✓Alert fires → structured report in < 60s

✓Evidence already gathered: K8s state, metrics, logs

✓Similar past incidents surfaced from case memory

✓Copy-paste kubectl and PromQL commands ready to run

✓Explicit when evidence is missing — never guesses

How It Works

A structured investigation pipeline, every time.

The same 11-stage pipeline runs whether you invoke Tarka from CLI, a webhook, or the web UI.

Alert Ingestion

Prometheus/Alertmanager alert arrives via CLI, webhook POST, or web UI. Labels extracted, target inferred automatically.

Multi-Source Evidence

Best-effort reads from Prometheus, Kubernetes API, and logs. Missing sources noted explicitly, never faked.

27+ Diagnostics

Crash loops, OOM kills, image pull failures, CPU saturation, change correlation. Each produces confidence-scored hypotheses.

Actionable Report

One-line verdict: label + why + next. Evidence-backed bullets, ranked hypotheses, 3–7 copy-paste commands.

          
        

        triage-report — KubePodCrashLooping / prod / my-app
      
## Triage
scope:         pod
discriminator: CrashLoopBackOff
impact:        pod unavailable (3 restarts in 10m)
## Why
- Container exiting with code 1 (OOMKilled: no)
- Last 3 log lines before exit: FATAL: database connection refused
- No recent rollout detected (last deploy: 6 days ago)
- Liveness probe failing: /healthz → 502 (upstream timeout)
## Next
kubectl logs -n prod my-app --previous --tail=50
kubectl describe pod -n prod my-app
kubectl get events -n prod --sort-by=lastTimestamp

Features

Everything you need for confident triage.

100% Read-Only

Never mutates cluster state. Every operation is a read — Prometheus queries, kubectl get, log fetches. Safe to run during a live incident.

Works Without LLM

Deterministic base triage runs without any AI model. Add Vertex AI or Claude only when you want richer narrative enrichment.

Honest About Unknowns

When evidence is missing, Tarka says so explicitly. Scenarios A–D describe exactly what's blocked. No hallucinated root causes.

Copy-Paste Friendly

Every report ends with PromQL-first, kubectl-second commands you can run immediately. Designed for responders who need to act.

Case Memory

Stores every investigation in PostgreSQL. Surfaces similar past incidents during triage. Skills extracted from resolved cases inject relevant suggestions.

3 Deployment Modes

CLI for laptop investigations. Webhook mode for in-cluster automation. React console for team-wide visibility. One codebase, three surfaces.

See It in Action

A triage console built for on-call engineers.

Case Inbox — all firing alerts in one place, scored by impact and confidence.

Integrations

Plugs into your existing stack.

Required: Prometheus/Alertmanager. Everything else is optional and degrades gracefully.

Core

Prometheus Alertmanager Kubernetes

Optional

Loki VictoriaLogs AWS CloudTrail GitHub Slack Vertex AI Anthropic Claude PostgreSQL NATS JetStream

When optional sources are unavailable, Tarka records the gap and continues. The report explicitly states what's missing.

Deployment

Run it your way.

mode: cli

Local Investigation

Run on your laptop in 5 minutes. Point at any Prometheus-compatible endpoint. No infrastructure required.

$ poetry install
$ python main.py --list-alerts
$ python main.py --alert 0

→ Quickstart Guide

mode: webhook Most common

In-Cluster Automation

Alertmanager fires → webhook → NATS JetStream queue → worker pool → reports in S3. Zero human intervention required during incidents.

Alertmanager
  → FastAPI webhook
  → NATS JetStream
  → Worker pool
  → S3 + Web UI

→ Deployment Guide

mode: web-ui

Team Console

React-based case browser. See all investigations, drill into reports, chat with the agent, and review historical patterns across your team.

→ See Screenshots

Turn an alert into an investigation in 60 seconds.