Complete AI prompt library for DevOps engineers. Covers Kubernetes manifests, Helm charts, GitHub Actions CI/CD, Terraform IaC, Docker multi-stage builds, monitoring with Prometheus and Grafana, security scanning, and production deployment patterns.
DevOps & Kubernetes in 2026: Infrastructure as Code at Scale
Modern DevOps is defined by reproducibility — every environment identical, every deployment automated, every configuration in version control. AI dramatically accelerates Kubernetes manifest writing, Terraform module creation, and CI/CD pipeline design when prompted with the right production requirements. These prompts produce infrastructure code that passes a production readiness review — not just configurations that deploy without errors.
Picking the Right AI Model for DevOps + K8s Work
Infrastructure code has high blast radius when wrong. The AI model comparison here focuses on which models produce YAML and HCL that actually works versus YAML that looks plausible but has subtle misconfiguration.
| Model | Best For (DevOps/K8s) | Weak Spot | When to Reach For It |
|---|---|---|---|
| Claude Sonnet | Helm chart architecture, Terraform module design, GitHub Actions complex workflows | Kubernetes API version drift — occasionally uses older apiVersions | Designing a multi-environment Terraform layout, reviewing a Helm values.yaml for security issues, planning the ArgoCD app-of-apps pattern |
| ChatGPT GPT-5.5 | Kubernetes manifest generation, Dockerfile writing, GitHub Actions job boilerplate | Missing resource limits/requests; outdated Kubernetes API versions (apps/v1beta1) | Generating a Deployment manifest from a description, writing GitHub Actions matrix builds |
| Gemini 3.5 Flash | Current Kubernetes release notes, cloud provider (GKE/EKS/AKS) specifics, SRE research | Less precise on Helm templating specifics | Researching GKE Autopilot vs Standard for your use case, checking current K8s version compatibility |
| Cursor | YAML completion with schema validation for K8s manifests | Terraform module variable propagation in complex module graphs | Writing K8s manifests with real-time schema validation, completing Helm templates |
| GitHub Copilot | Completing Dockerfile layers, GitHub Actions step patterns | Kubernetes YAML — generates syntactically valid but semantically wrong configs | Completing Dockerfile RUN layers, writing GitHub Actions with/env blocks |
| Grok | Direct infrastructure trade-off analysis: Kubernetes vs ECS, Terraform vs Pulumi | Less depth on Helm chart template specifics | Getting a direct opinion on whether Kubernetes is overkill for your team size and traffic |
| DeepSeek | Bash scripting, simple CI/CD steps, documentation | Kubernetes security configurations (PSA, NetworkPolicy, RBAC) are often incomplete | Writing CI/CD shell scripts, generating documentation for runbooks |
For Kubernetes specifically: always validate AI-generated YAML with kubectl --dry-run=client and a schema validator (kubeval or kubeconform) before applying to any cluster. The most common AI error is outdated apiVersions — Kubernetes deprecates and removes APIs on a predictable schedule, and models trained before the removal date generate removed APIs that cluster admission will reject. Check the K8s API deprecation guide for your cluster version.
1. Kubernetes Deployment with Production Best Practices
You are a senior Kubernetes engineer who has run production clusters at scale.
Write a complete Kubernetes Deployment manifest for a Node.js API service:
Deployment requirements:
- replicas: 3 (for HA)
- Strategy: RollingUpdate with maxUnavailable: 0, maxSurge: 1 (zero-downtime deployments)
- Resource requests: cpu: 100m, memory: 128Mi
- Resource limits: cpu: 500m, memory: 512Mi (REQUIRED — explain why missing limits cause node starvation)
- Image: specify imagePullPolicy: Always with SHA digest (not mutable tag)
Probes (explain the difference before writing each):
- startupProbe: GET /health, failureThreshold: 30, periodSeconds: 10 (slow start tolerance)
- readinessProbe: GET /health, failureThreshold: 3, periodSeconds: 5 (traffic control)
- livenessProbe: GET /health, failureThreshold: 3, periodSeconds: 15 (restart trigger)
Security context (REQUIRED on every production deployment):
- runAsNonRoot: true, runAsUser: 1001
- readOnlyRootFilesystem: true
- allowPrivilegeEscalation: false
- capabilities: drop: [ALL]
Also write: Service (ClusterIP), HorizontalPodAutoscaler (min:3, max:10, CPU 70%), PodDisruptionBudget (minAvailable: 2).
Add a comment on every field explaining why it exists.
Why it works: Requesting comments on every field forces the AI to explain the production reasoning — you get documentation that would normally take a Kubernetes expert's code review.
2. Helm Chart for Microservice
You are a Helm chart expert.
Create a production Helm chart for a Node.js microservice:
Chart structure:
charts/my-service/
Chart.yaml (name, version, appVersion, description)
values.yaml (all defaults)
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
pdb.yaml
configmap.yaml
secret.yaml (sealed-secrets or external-secrets reference)
serviceaccount.yaml
_helpers.tpl (common labels, selector labels, name truncation)
values.yaml defaults:
- replicaCount: 3
- image.repository, image.tag (overridden per environment)
- resources.requests and resources.limits
- autoscaling.enabled, minReplicas, maxReplicas, targetCPU
- ingress.enabled, ingress.hosts, ingress.tls
- env: {} (map of environment variables from ConfigMap or Secret)
- probes: liveness and readiness enabled by default
Environment overrides (values-prod.yaml):
- Higher resource limits, more replicas, different image tag, production ingress host
Provide:
1. All template files with proper Go templating
2. NOTES.txt explaining post-install steps
3. Helm install command for staging and production
4. How to use helm diff before upgrading
3. GitHub Actions — Full CI/CD Pipeline
You are a GitHub Actions expert building production CI/CD pipelines.
Write a complete GitHub Actions workflow for a Node.js microservice deployed to Kubernetes:
Trigger: push to main, PRs to main, manual dispatch (workflow_dispatch)
Jobs:
test (every push and PR):
- ubuntu-latest
- Setup Node 22, pnpm cache by pnpm-lock.yaml hash
- pnpm install --frozen-lockfile
- TypeScript compile check, ESLint, Prettier
- Jest unit + integration tests (Testcontainers PostgreSQL via Docker-in-Docker)
- Upload coverage to Codecov
- Fail if coverage drops below 80%
security (parallel to test):
- npm audit --audit-level=high
- Trivy image scan: scan Dockerfile for OS vulnerabilities (CRITICAL severity fails build)
- SARIF output uploaded to GitHub Security tab
build (after test passes, main branch only):
- Multi-stage Docker build with BuildKit cache
- Tag with: git SHA (immutable) + 'latest'
- Push to AWS ECR (OIDC auth — no long-lived secrets)
- Image size check: fail if > 200MB
deploy-staging (after build):
- Update Helm values: image.tag = git SHA
- helm upgrade --install my-service charts/my-service -f values-staging.yaml
- Wait for rollout: kubectl rollout status deployment/my-service
- Smoke test: curl staging.api.example.com/health — assert 200
deploy-production (manual approval gate, after deploy-staging):
- environment: production (with required reviewers in GitHub)
- Same helm upgrade to production namespace
- Slack notification: success with deploy URL, failure with rollback command
Secrets: use OIDC for AWS (no access keys), GitHub Environment Secrets for Slack.
4. Terraform Infrastructure
You are a Terraform expert for AWS infrastructure.
Write Terraform (1.8+) to provision a production Kubernetes application infrastructure on AWS:
Modules to create:
module/networking:
- VPC with public, private, and database subnets across 3 AZs
- NAT Gateway (one per AZ for HA)
- VPC flow logs to CloudWatch
module/eks:
- EKS cluster 1.30 with managed node groups
- Node groups: general (t3.medium, min:3, max:10), memory-optimised for databases
- IRSA (IAM Roles for Service Accounts) for pods to access AWS services
- aws-load-balancer-controller, cluster-autoscaler, external-dns via Helm
module/rds:
- PostgreSQL 16 Multi-AZ RDS instance
- Subnet group in private subnets
- Security group: only allow from EKS security group on 5432
- Automated backups 7 days, encryption at rest
- Parameter group for performance tuning (shared_buffers, work_mem)
module/elasticache:
- Redis 7 cluster mode disabled (single primary + replica)
- Private subnets only
- Auth token (password) stored in AWS Secrets Manager
Global:
- All resources tagged: Environment, Team, CostCenter, ManagedBy=terraform
- Remote state: S3 bucket + DynamoDB table for locking
- Variables with validation blocks for required values
- Outputs: cluster endpoint, RDS endpoint, ECR URLs
Output: all module files with variables.tf, main.tf, outputs.tf, and root configuration.
5. Monitoring with Prometheus & Grafana
You are a Kubernetes observability expert.
Set up production monitoring for a Node.js microservice on Kubernetes:
Prometheus scraping:
- ServiceMonitor (Prometheus Operator CRD) for the microservice
- Expose /metrics endpoint: express-prometheus-middleware for Node.js
- Custom metrics: http_request_duration_seconds (histogram), http_requests_total (counter by status), active_connections (gauge), background_job_duration_seconds
Prometheus Alert Rules (PrometheusRule CRD):
- Critical: error rate > 1% for 5 minutes (5xx / total)
- Critical: pod unavailable (all replicas down)
- Warning: P99 latency > 500ms for 10 minutes
- Warning: pod restarts > 3 in 5 minutes
- Info: deployment rollout in progress
Grafana Dashboard (JSON model for 4 panels):
- Request rate (requests per second by status code)
- Error rate percentage (5xx / total, red threshold at 1%)
- Latency percentiles (P50, P95, P99 on one graph)
- Infrastructure (CPU usage, memory usage, pod count)
AlertManager routing:
- Critical → PagerDuty (immediate)
- Warning → Slack #alerts channel
- Resolved notifications on both channels
Output: ServiceMonitor YAML, PrometheusRule YAML, AlertManager config, and Grafana dashboard JSON.
6. Docker Security Hardening
You are a container security expert.
Write a security-hardened multi-stage Dockerfile for a Node.js API:
Stage 1 (deps):
- node:22-alpine as base
- Set npm config loglevel=error, no-fund, no-audit (faster build)
- COPY package.json pnpm-lock.yaml
- RUN pnpm install --frozen-lockfile --prod (production only)
Stage 2 (build):
- COPY source, run tsc
- RUN pnpm install --frozen-lockfile (includes dev deps for build)
Stage 3 (production — most important):
- Base: gcr.io/distroless/nodejs22-debian12 (Google Distroless — no shell, no package manager, minimal CVE surface)
- Alternative if distroless is too restrictive: node:22-alpine with security updates applied
- COPY from deps stage: /app/node_modules
- COPY from build stage: /app/dist
- USER nonroot:nonroot (distroless built-in non-root user, UID 65532)
- EXPOSE port via ENV, not hardcoded
- No COPY of .env files, secrets, or source code
- HEALTHCHECK CMD ["/nodejs/bin/node", "-e", "require('http').get('http://localhost:3000/health', r => process.exit(r.statusCode === 200 ? 0 : 1))"]
Security scan: show Trivy command to scan the final image and interpret the report.
After Dockerfile: explain attack surface reduction — why distroless has 90% fewer CVEs than ubuntu-based images.
8. ArgoCD GitOps Deployment Pipeline
You are a senior DevOps engineer specializing in GitOps with ArgoCD.
Design a complete GitOps deployment pipeline using ArgoCD + Kubernetes:
Repository structure (app-of-apps pattern):
- argocd/apps/: Application CRDs for each service
- argocd/projects/: AppProject definitions with source/destination restrictions
- k8s/services/{service-name}/: Kubernetes manifests per service (not Helm — plain YAML for simplicity)
ArgoCD configuration:
- App-of-apps: root Application watches argocd/apps/, auto-syncs child Applications
- Sync policy: automated sync (every 3 minutes) with selfHeal=true, prune=true
- Sync waves: wave 0 = namespaces + secrets (from Vault), wave 1 = databases, wave 2 = services
- Rollback strategy: manual rollback via argocd app rollback, automated on health check failure
Health checks:
- Custom health check for CronJob: healthy if last run succeeded
- Custom health check for PVC: healthy if not in Pending state
Environment promotion:
- staging: auto-sync from main branch
- production: manual sync, requires approval gate (ArgoCD RBAC: only prod-deployers group can sync)
External secrets: ExternalSecret CRD pulling from AWS Secrets Manager
Output: app-of-apps Application YAML, AppProject with RBAC, 2 example service Application CRDs, and the ExternalSecret for DB credentials.
9. Kubernetes Security Hardening
You are a Kubernetes security engineer.
Harden a production Kubernetes cluster namespace for a multi-tenant SaaS:
Pod Security Admission (PSA):
- Enforce restricted policy on all production namespaces
- Audit on staging namespaces (log violations but don't block)
- Configuration: label namespaces with pod-security.kubernetes.io/enforce: restricted
NetworkPolicy:
- Default deny all ingress and egress for every namespace
- Allow: pods to DNS (port 53), pods to specific services within namespace, ingress from NGINX namespace only
- Block: all cross-namespace traffic except explicitly allowed
RBAC:
- Service accounts: each deployment gets its own ServiceAccount (not default)
- Minimal permissions: no cluster-admin anywhere, no wildcards in Rules
- Role: app-reader (get/list/watch pods, configmaps in namespace), app-writer (above + create/update)
Resource limits (required by PSA restricted):
- Every container: requests and limits for CPU and memory
- LimitRange: default limits if not specified (CPU: 100m request/500m limit, Memory: 128Mi/512Mi)
- ResourceQuota: per-namespace limits (total CPU: 10 cores, Memory: 20Gi, pods: 50)
Security context (required by PSA restricted):
- runAsNonRoot: true, runAsUser: 1000, readOnlyRootFilesystem: true
- allowPrivilegeEscalation: false, capabilities: drop ALL
- seccompProfile: RuntimeDefault
Output: NetworkPolicy manifests (default-deny + allow rules), RBAC Role + RoleBinding, ResourceQuota, LimitRange, and a compliant Deployment example with all security contexts set.
End-to-End Workflow: New Service to Production
Taking a new microservice from Docker image to production Kubernetes with full observability:
- Dockerfile (Prompt 5 variant): "Write a multi-stage Dockerfile for a Go 1.22 binary: builder (golang:1.22-alpine, CGO_ENABLED=0), production (distroless/static), non-root UID 65534 (nobody), COPY only the binary, EXPOSE 8080, health check via HTTP /health."
- K8s manifests (Prompt 1 variant): "Write Kubernetes Deployment, Service, and HPA for the service. 3 replicas, PodDisruptionBudget minAvailable=2, resource limits (100m/500m CPU, 128Mi/512Mi memory), liveness probe /health 15s, readiness probe /ready 5s."
- Helm chart (Prompt 2 variant): "Wrap the manifests in a Helm chart. Values: image.repository, image.tag, replicaCount, resources, ingress.enabled, ingress.host. Use .Values.global.environment to set NODE_ENV."
- Monitoring (Prompt 6 variant): "Add Prometheus ServiceMonitor, an alert rule for error rate > 1%, and a Grafana dashboard panel for P99 latency using PromQL on the histogram metric from the service."
- ArgoCD (Prompt 8 variant): "Create an ArgoCD Application CRD for this service: source = this git repo at path k8s/services/my-service, destination = production cluster's 'production' namespace, automated sync with selfHeal."
Where AI Goes Wrong in DevOps + Kubernetes
- Outdated Kubernetes apiVersions. AI generates deprecated or removed API versions —
networking.k8s.io/v1beta1for Ingress (removed in K8s 1.22),batch/v1beta1for CronJob (removed in 1.25). Always specify your cluster version and runkubectl --dry-run=clientto validate. - Missing resource limits and requests. AI-generated Deployments almost never include resources.requests and resources.limits. Without them, the scheduler can't make good placement decisions and you'll have noisy-neighbor problems. Always require "set CPU and memory requests and limits for every container."
- latest image tag in production configs. AI uses
image: myapp:latestin Kubernetes Deployments. Latest is non-deterministic — rolling back becomes impossible, deployments are unpredictable. Always require "pin image tag to the specific git SHA." - Secrets in YAML. AI generates Kubernetes Secret manifests with base64-encoded values hardcoded. Base64 is not encryption. Use External Secrets Operator pulling from AWS Secrets Manager or HashiCorp Vault. Never commit secret values to git, even base64-encoded.
- Single-replica deployments without PodDisruptionBudget. AI generates 1-replica deployments. Any node drain (rolling Kubernetes upgrade, spot instance termination) takes the service down. Require minReplicas=3 and a PodDisruptionBudget of minAvailable=2 for any production service.
- Terraform without state locking. AI generates Terraform with local state or S3 backend without DynamoDB locking. Concurrent applies without locking corrupt state. Always require "S3 backend with DynamoDB locking" and show the backend configuration.
7. Good vs Bad DevOps Prompts
| Task | ❌ Bad Prompt | ✅ Good Prompt |
|---|---|---|
| Kubernetes | "Deploy my app to Kubernetes" | "Write a Kubernetes Deployment for a Node.js API: 3 replicas, RollingUpdate maxUnavailable=0, resource requests cpu:100m/memory:128Mi and limits cpu:500m/memory:512Mi, readinessProbe + livenessProbe on /health, securityContext runAsNonRoot+readOnlyRootFilesystem+drop:ALL, HPA 3-10 pods at 70% CPU, PDB minAvailable:2." |
| CI/CD | "Set up GitHub Actions for my app" | "Write GitHub Actions for Node.js → EKS: test job (Jest+Testcontainers, 80% coverage gate), parallel security job (npm audit + Trivy), build job (Docker BuildKit + ECR push via OIDC), deploy-staging (Helm upgrade + rollout wait + smoke test), deploy-production (environment gate with required reviewers + Slack notification)." |
| Terraform | "Create AWS infrastructure" | "Write Terraform 1.8 modules for: VPC (3 AZ, public+private+DB subnets), EKS 1.30 (managed node groups, IRSA), RDS PostgreSQL 16 Multi-AZ (private subnets, encrypted), ElastiCache Redis 7 (auth token in Secrets Manager). S3+DynamoDB remote state. Variable validation blocks. Mandatory tags: Environment, Team, CostCenter." |
Before You Prompt: DevOps & Kubernetes Context Setup
Infrastructure code has higher blast radius than application code — a misconfigured Kubernetes manifest with missing resource limits can cause node exhaustion at 2 AM. AI-generated YAML frequently omits security contexts, uses deprecated API versions, and misses resource limits. This block enforces the production-readiness baseline:
Context for all DevOps/K8s prompts in this session:
- Kubernetes: 1.32+ — all apiVersions must be current
Never: apps/v1beta1, extensions/v1beta1 (removed years ago)
- Terraform: 1.9+ (OpenTofu compatible), always pin provider versions
- CI/CD: GitHub Actions with reusable workflows
- REQUIRED in every Kubernetes manifest:
resources.requests AND resources.limits (both CPU and memory — no exceptions)
livenessProbe AND readinessProbe (different purposes — explain the difference)
securityContext at pod AND container level:
runAsNonRoot: true, runAsUser: 1001
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities: drop: ["ALL"]
- Container images: never use latest tag — pin to semantic version or digest
- Secrets: never hardcode in YAML — reference from K8s Secret or external secrets operator
The securityContext block is the most commonly missing piece in AI-generated Kubernetes manifests. Most container images run as root by default. Most admission controllers in hardened clusters reject privileged containers. Running as root in production is a CIS Kubernetes Benchmark failure and a common lateral movement vector. Include the full security context block as a non-negotiable requirement in every Deployment prompt.
3 Common Mistakes When Prompting AI for DevOps & Kubernetes
Mistake 1: Missing resource limits on containers
AI generates Kubernetes Deployments with resources: {} — empty or omitted entirely. Without resource limits, a memory leak in one pod can consume all available node memory, triggering OOM kills across all pods on the node. Kubernetes schedulers also make poor placement decisions without resource requests. Specify: "every container must have explicit resources.requests AND resources.limits for both CPU and memory — document the reasoning for the chosen values." A Deployment without resource limits should fail code review.
Mistake 2: Running containers as root
Container images default to running as root (UID 0) unless explicitly overridden. AI generates manifests without securityContext because most examples in training data don't include it. A container running as root in a compromised pod has direct access to host files if the container runtime is misconfigured. Specify: "add pod-level and container-level securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, drop ALL capabilities." Most CI security scanners (Trivy, Snyk) flag missing security contexts.
Mistake 3: Using the latest image tag
AI generates image: nginx:latest for every example. The latest tag has no rollback path (if something breaks, "latest" yesterday and "latest" today are different images), is non-deterministic across nodes (two nodes might pull different versions), and bypasses Kubernetes' image pull cache when imagePullPolicy: Always is set. Specify: "pin all container images to a specific semantic version tag or digest — never latest." For production, prefer image digests (nginx@sha256:...) over tags for immutability.
Further Reading
Resources for AI-assisted DevOps and infrastructure development:
- How engineering teams use AI for infrastructure automation — AI strategies for DevOps teams: Terraform module design, Kubernetes cost optimization, and AI-assisted incident response
- How to write the perfect AI prompt — the prompting framework behind every infrastructure template in this guide
- AI Coding Prompt Library — 50+ copy-ready prompts for Kubernetes, Helm, Terraform, GitHub Actions, and Docker hardening
Generate a custom DevOps/K8s prompt → Try PromptPrepare free
Help & Answers
Frequently Asked Questions
Found this helpful?
Save it to your library or share with your team.
Keep Reading