Programming·Updated May 10, 2026·20 min read·👁 7.9K views

DevOps & Kubernetes AI Prompts: Automate Infrastructure Faster in 2026

John Allick· AI Researcher
📖 3,011 words
Quick Summary

Complete AI prompt library for DevOps engineers. Covers Kubernetes manifests, Helm charts, GitHub Actions CI/CD, Terraform IaC, Docker multi-stage builds, monitoring with Prometheus and Grafana, security scanning, and production deployment patterns.

#DevOps#Kubernetes#Docker#Terraform#GitHub Actions#CI/CD#AI Prompts

DevOps & Kubernetes in 2026: Infrastructure as Code at Scale

Modern DevOps is defined by reproducibility — every environment identical, every deployment automated, every configuration in version control. AI dramatically accelerates Kubernetes manifest writing, Terraform module creation, and CI/CD pipeline design when prompted with the right production requirements. These prompts produce infrastructure code that passes a production readiness review — not just configurations that deploy without errors.

Picking the Right AI Model for DevOps + K8s Work

Infrastructure code has high blast radius when wrong. The AI model comparison here focuses on which models produce YAML and HCL that actually works versus YAML that looks plausible but has subtle misconfiguration.

ModelBest For (DevOps/K8s)Weak SpotWhen to Reach For It
Claude SonnetHelm chart architecture, Terraform module design, GitHub Actions complex workflowsKubernetes API version drift — occasionally uses older apiVersionsDesigning a multi-environment Terraform layout, reviewing a Helm values.yaml for security issues, planning the ArgoCD app-of-apps pattern
ChatGPT GPT-5.5Kubernetes manifest generation, Dockerfile writing, GitHub Actions job boilerplateMissing resource limits/requests; outdated Kubernetes API versions (apps/v1beta1)Generating a Deployment manifest from a description, writing GitHub Actions matrix builds
Gemini 3.5 FlashCurrent Kubernetes release notes, cloud provider (GKE/EKS/AKS) specifics, SRE researchLess precise on Helm templating specificsResearching GKE Autopilot vs Standard for your use case, checking current K8s version compatibility
CursorYAML completion with schema validation for K8s manifestsTerraform module variable propagation in complex module graphsWriting K8s manifests with real-time schema validation, completing Helm templates
GitHub CopilotCompleting Dockerfile layers, GitHub Actions step patternsKubernetes YAML — generates syntactically valid but semantically wrong configsCompleting Dockerfile RUN layers, writing GitHub Actions with/env blocks
GrokDirect infrastructure trade-off analysis: Kubernetes vs ECS, Terraform vs PulumiLess depth on Helm chart template specificsGetting a direct opinion on whether Kubernetes is overkill for your team size and traffic
DeepSeekBash scripting, simple CI/CD steps, documentationKubernetes security configurations (PSA, NetworkPolicy, RBAC) are often incompleteWriting CI/CD shell scripts, generating documentation for runbooks

For Kubernetes specifically: always validate AI-generated YAML with kubectl --dry-run=client and a schema validator (kubeval or kubeconform) before applying to any cluster. The most common AI error is outdated apiVersions — Kubernetes deprecates and removes APIs on a predictable schedule, and models trained before the removal date generate removed APIs that cluster admission will reject. Check the K8s API deprecation guide for your cluster version.

1. Kubernetes Deployment with Production Best Practices

⌥ PROMPT
You are a senior Kubernetes engineer who has run production clusters at scale.

Write a complete Kubernetes Deployment manifest for a Node.js API service:

Deployment requirements:
- replicas: 3 (for HA)
- Strategy: RollingUpdate with maxUnavailable: 0, maxSurge: 1 (zero-downtime deployments)
- Resource requests: cpu: 100m, memory: 128Mi
- Resource limits: cpu: 500m, memory: 512Mi (REQUIRED — explain why missing limits cause node starvation)
- Image: specify imagePullPolicy: Always with SHA digest (not mutable tag)

Probes (explain the difference before writing each):
- startupProbe: GET /health, failureThreshold: 30, periodSeconds: 10 (slow start tolerance)
- readinessProbe: GET /health, failureThreshold: 3, periodSeconds: 5 (traffic control)
- livenessProbe: GET /health, failureThreshold: 3, periodSeconds: 15 (restart trigger)

Security context (REQUIRED on every production deployment):
- runAsNonRoot: true, runAsUser: 1001
- readOnlyRootFilesystem: true
- allowPrivilegeEscalation: false
- capabilities: drop: [ALL]

Also write: Service (ClusterIP), HorizontalPodAutoscaler (min:3, max:10, CPU 70%), PodDisruptionBudget (minAvailable: 2).

Add a comment on every field explaining why it exists.

Why it works: Requesting comments on every field forces the AI to explain the production reasoning — you get documentation that would normally take a Kubernetes expert's code review.

2. Helm Chart for Microservice

⌥ PROMPT
You are a Helm chart expert.

Create a production Helm chart for a Node.js microservice:

Chart structure:
charts/my-service/
  Chart.yaml (name, version, appVersion, description)
  values.yaml (all defaults)
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    pdb.yaml
    configmap.yaml
    secret.yaml (sealed-secrets or external-secrets reference)
    serviceaccount.yaml
    _helpers.tpl (common labels, selector labels, name truncation)

values.yaml defaults:
- replicaCount: 3
- image.repository, image.tag (overridden per environment)
- resources.requests and resources.limits
- autoscaling.enabled, minReplicas, maxReplicas, targetCPU
- ingress.enabled, ingress.hosts, ingress.tls
- env: {} (map of environment variables from ConfigMap or Secret)
- probes: liveness and readiness enabled by default

Environment overrides (values-prod.yaml):
- Higher resource limits, more replicas, different image tag, production ingress host

Provide:
1. All template files with proper Go templating
2. NOTES.txt explaining post-install steps
3. Helm install command for staging and production
4. How to use helm diff before upgrading

3. GitHub Actions — Full CI/CD Pipeline

⌥ PROMPT
You are a GitHub Actions expert building production CI/CD pipelines.

Write a complete GitHub Actions workflow for a Node.js microservice deployed to Kubernetes:

Trigger: push to main, PRs to main, manual dispatch (workflow_dispatch)

Jobs:

test (every push and PR):
- ubuntu-latest
- Setup Node 22, pnpm cache by pnpm-lock.yaml hash
- pnpm install --frozen-lockfile
- TypeScript compile check, ESLint, Prettier
- Jest unit + integration tests (Testcontainers PostgreSQL via Docker-in-Docker)
- Upload coverage to Codecov
- Fail if coverage drops below 80%

security (parallel to test):
- npm audit --audit-level=high
- Trivy image scan: scan Dockerfile for OS vulnerabilities (CRITICAL severity fails build)
- SARIF output uploaded to GitHub Security tab

build (after test passes, main branch only):
- Multi-stage Docker build with BuildKit cache
- Tag with: git SHA (immutable) + 'latest'
- Push to AWS ECR (OIDC auth — no long-lived secrets)
- Image size check: fail if > 200MB

deploy-staging (after build):
- Update Helm values: image.tag = git SHA
- helm upgrade --install my-service charts/my-service -f values-staging.yaml
- Wait for rollout: kubectl rollout status deployment/my-service
- Smoke test: curl staging.api.example.com/health — assert 200

deploy-production (manual approval gate, after deploy-staging):
- environment: production (with required reviewers in GitHub)
- Same helm upgrade to production namespace
- Slack notification: success with deploy URL, failure with rollback command

Secrets: use OIDC for AWS (no access keys), GitHub Environment Secrets for Slack.

4. Terraform Infrastructure

⌥ PROMPT
You are a Terraform expert for AWS infrastructure.

Write Terraform (1.8+) to provision a production Kubernetes application infrastructure on AWS:

Modules to create:

module/networking:
- VPC with public, private, and database subnets across 3 AZs
- NAT Gateway (one per AZ for HA)
- VPC flow logs to CloudWatch

module/eks:
- EKS cluster 1.30 with managed node groups
- Node groups: general (t3.medium, min:3, max:10), memory-optimised for databases
- IRSA (IAM Roles for Service Accounts) for pods to access AWS services
- aws-load-balancer-controller, cluster-autoscaler, external-dns via Helm

module/rds:
- PostgreSQL 16 Multi-AZ RDS instance
- Subnet group in private subnets
- Security group: only allow from EKS security group on 5432
- Automated backups 7 days, encryption at rest
- Parameter group for performance tuning (shared_buffers, work_mem)

module/elasticache:
- Redis 7 cluster mode disabled (single primary + replica)
- Private subnets only
- Auth token (password) stored in AWS Secrets Manager

Global:
- All resources tagged: Environment, Team, CostCenter, ManagedBy=terraform
- Remote state: S3 bucket + DynamoDB table for locking
- Variables with validation blocks for required values
- Outputs: cluster endpoint, RDS endpoint, ECR URLs

Output: all module files with variables.tf, main.tf, outputs.tf, and root configuration.

5. Monitoring with Prometheus & Grafana

⌥ PROMPT
You are a Kubernetes observability expert.

Set up production monitoring for a Node.js microservice on Kubernetes:

Prometheus scraping:
- ServiceMonitor (Prometheus Operator CRD) for the microservice
- Expose /metrics endpoint: express-prometheus-middleware for Node.js
- Custom metrics: http_request_duration_seconds (histogram), http_requests_total (counter by status), active_connections (gauge), background_job_duration_seconds

Prometheus Alert Rules (PrometheusRule CRD):
- Critical: error rate > 1% for 5 minutes (5xx / total)
- Critical: pod unavailable (all replicas down)
- Warning: P99 latency > 500ms for 10 minutes
- Warning: pod restarts > 3 in 5 minutes
- Info: deployment rollout in progress

Grafana Dashboard (JSON model for 4 panels):
- Request rate (requests per second by status code)
- Error rate percentage (5xx / total, red threshold at 1%)
- Latency percentiles (P50, P95, P99 on one graph)
- Infrastructure (CPU usage, memory usage, pod count)

AlertManager routing:
- Critical → PagerDuty (immediate)
- Warning → Slack #alerts channel
- Resolved notifications on both channels

Output: ServiceMonitor YAML, PrometheusRule YAML, AlertManager config, and Grafana dashboard JSON.

6. Docker Security Hardening

⌥ PROMPT
You are a container security expert.

Write a security-hardened multi-stage Dockerfile for a Node.js API:

Stage 1 (deps):
- node:22-alpine as base
- Set npm config loglevel=error, no-fund, no-audit (faster build)
- COPY package.json pnpm-lock.yaml
- RUN pnpm install --frozen-lockfile --prod (production only)

Stage 2 (build):
- COPY source, run tsc
- RUN pnpm install --frozen-lockfile (includes dev deps for build)

Stage 3 (production — most important):
- Base: gcr.io/distroless/nodejs22-debian12 (Google Distroless — no shell, no package manager, minimal CVE surface)
- Alternative if distroless is too restrictive: node:22-alpine with security updates applied
- COPY from deps stage: /app/node_modules
- COPY from build stage: /app/dist
- USER nonroot:nonroot (distroless built-in non-root user, UID 65532)
- EXPOSE port via ENV, not hardcoded
- No COPY of .env files, secrets, or source code
- HEALTHCHECK CMD ["/nodejs/bin/node", "-e", "require('http').get('http://localhost:3000/health', r => process.exit(r.statusCode === 200 ? 0 : 1))"]

Security scan: show Trivy command to scan the final image and interpret the report.

After Dockerfile: explain attack surface reduction — why distroless has 90% fewer CVEs than ubuntu-based images.

8. ArgoCD GitOps Deployment Pipeline

⌥ PROMPT
You are a senior DevOps engineer specializing in GitOps with ArgoCD.

Design a complete GitOps deployment pipeline using ArgoCD + Kubernetes:

Repository structure (app-of-apps pattern):
- argocd/apps/: Application CRDs for each service
- argocd/projects/: AppProject definitions with source/destination restrictions
- k8s/services/{service-name}/: Kubernetes manifests per service (not Helm — plain YAML for simplicity)

ArgoCD configuration:
- App-of-apps: root Application watches argocd/apps/, auto-syncs child Applications
- Sync policy: automated sync (every 3 minutes) with selfHeal=true, prune=true
- Sync waves: wave 0 = namespaces + secrets (from Vault), wave 1 = databases, wave 2 = services
- Rollback strategy: manual rollback via argocd app rollback, automated on health check failure

Health checks:
- Custom health check for CronJob: healthy if last run succeeded
- Custom health check for PVC: healthy if not in Pending state

Environment promotion:
- staging: auto-sync from main branch
- production: manual sync, requires approval gate (ArgoCD RBAC: only prod-deployers group can sync)

External secrets: ExternalSecret CRD pulling from AWS Secrets Manager

Output: app-of-apps Application YAML, AppProject with RBAC, 2 example service Application CRDs, and the ExternalSecret for DB credentials.

9. Kubernetes Security Hardening

⌥ PROMPT
You are a Kubernetes security engineer.

Harden a production Kubernetes cluster namespace for a multi-tenant SaaS:

Pod Security Admission (PSA):
- Enforce restricted policy on all production namespaces
- Audit on staging namespaces (log violations but don't block)
- Configuration: label namespaces with pod-security.kubernetes.io/enforce: restricted

NetworkPolicy:
- Default deny all ingress and egress for every namespace
- Allow: pods to DNS (port 53), pods to specific services within namespace, ingress from NGINX namespace only
- Block: all cross-namespace traffic except explicitly allowed

RBAC:
- Service accounts: each deployment gets its own ServiceAccount (not default)
- Minimal permissions: no cluster-admin anywhere, no wildcards in Rules
- Role: app-reader (get/list/watch pods, configmaps in namespace), app-writer (above + create/update)

Resource limits (required by PSA restricted):
- Every container: requests and limits for CPU and memory
- LimitRange: default limits if not specified (CPU: 100m request/500m limit, Memory: 128Mi/512Mi)
- ResourceQuota: per-namespace limits (total CPU: 10 cores, Memory: 20Gi, pods: 50)

Security context (required by PSA restricted):
- runAsNonRoot: true, runAsUser: 1000, readOnlyRootFilesystem: true
- allowPrivilegeEscalation: false, capabilities: drop ALL
- seccompProfile: RuntimeDefault

Output: NetworkPolicy manifests (default-deny + allow rules), RBAC Role + RoleBinding, ResourceQuota, LimitRange, and a compliant Deployment example with all security contexts set.

End-to-End Workflow: New Service to Production

Taking a new microservice from Docker image to production Kubernetes with full observability:

  1. Dockerfile (Prompt 5 variant): "Write a multi-stage Dockerfile for a Go 1.22 binary: builder (golang:1.22-alpine, CGO_ENABLED=0), production (distroless/static), non-root UID 65534 (nobody), COPY only the binary, EXPOSE 8080, health check via HTTP /health."
  2. K8s manifests (Prompt 1 variant): "Write Kubernetes Deployment, Service, and HPA for the service. 3 replicas, PodDisruptionBudget minAvailable=2, resource limits (100m/500m CPU, 128Mi/512Mi memory), liveness probe /health 15s, readiness probe /ready 5s."
  3. Helm chart (Prompt 2 variant): "Wrap the manifests in a Helm chart. Values: image.repository, image.tag, replicaCount, resources, ingress.enabled, ingress.host. Use .Values.global.environment to set NODE_ENV."
  4. Monitoring (Prompt 6 variant): "Add Prometheus ServiceMonitor, an alert rule for error rate > 1%, and a Grafana dashboard panel for P99 latency using PromQL on the histogram metric from the service."
  5. ArgoCD (Prompt 8 variant): "Create an ArgoCD Application CRD for this service: source = this git repo at path k8s/services/my-service, destination = production cluster's 'production' namespace, automated sync with selfHeal."

Where AI Goes Wrong in DevOps + Kubernetes

  • Outdated Kubernetes apiVersions. AI generates deprecated or removed API versions — networking.k8s.io/v1beta1 for Ingress (removed in K8s 1.22), batch/v1beta1 for CronJob (removed in 1.25). Always specify your cluster version and run kubectl --dry-run=client to validate.
  • Missing resource limits and requests. AI-generated Deployments almost never include resources.requests and resources.limits. Without them, the scheduler can't make good placement decisions and you'll have noisy-neighbor problems. Always require "set CPU and memory requests and limits for every container."
  • latest image tag in production configs. AI uses image: myapp:latest in Kubernetes Deployments. Latest is non-deterministic — rolling back becomes impossible, deployments are unpredictable. Always require "pin image tag to the specific git SHA."
  • Secrets in YAML. AI generates Kubernetes Secret manifests with base64-encoded values hardcoded. Base64 is not encryption. Use External Secrets Operator pulling from AWS Secrets Manager or HashiCorp Vault. Never commit secret values to git, even base64-encoded.
  • Single-replica deployments without PodDisruptionBudget. AI generates 1-replica deployments. Any node drain (rolling Kubernetes upgrade, spot instance termination) takes the service down. Require minReplicas=3 and a PodDisruptionBudget of minAvailable=2 for any production service.
  • Terraform without state locking. AI generates Terraform with local state or S3 backend without DynamoDB locking. Concurrent applies without locking corrupt state. Always require "S3 backend with DynamoDB locking" and show the backend configuration.

7. Good vs Bad DevOps Prompts

Task❌ Bad Prompt✅ Good Prompt
Kubernetes"Deploy my app to Kubernetes""Write a Kubernetes Deployment for a Node.js API: 3 replicas, RollingUpdate maxUnavailable=0, resource requests cpu:100m/memory:128Mi and limits cpu:500m/memory:512Mi, readinessProbe + livenessProbe on /health, securityContext runAsNonRoot+readOnlyRootFilesystem+drop:ALL, HPA 3-10 pods at 70% CPU, PDB minAvailable:2."
CI/CD"Set up GitHub Actions for my app""Write GitHub Actions for Node.js → EKS: test job (Jest+Testcontainers, 80% coverage gate), parallel security job (npm audit + Trivy), build job (Docker BuildKit + ECR push via OIDC), deploy-staging (Helm upgrade + rollout wait + smoke test), deploy-production (environment gate with required reviewers + Slack notification)."
Terraform"Create AWS infrastructure""Write Terraform 1.8 modules for: VPC (3 AZ, public+private+DB subnets), EKS 1.30 (managed node groups, IRSA), RDS PostgreSQL 16 Multi-AZ (private subnets, encrypted), ElastiCache Redis 7 (auth token in Secrets Manager). S3+DynamoDB remote state. Variable validation blocks. Mandatory tags: Environment, Team, CostCenter."

Before You Prompt: DevOps & Kubernetes Context Setup

Infrastructure code has higher blast radius than application code — a misconfigured Kubernetes manifest with missing resource limits can cause node exhaustion at 2 AM. AI-generated YAML frequently omits security contexts, uses deprecated API versions, and misses resource limits. This block enforces the production-readiness baseline:

⌥ PROMPT
Context for all DevOps/K8s prompts in this session:
- Kubernetes: 1.32+ — all apiVersions must be current
  Never: apps/v1beta1, extensions/v1beta1 (removed years ago)
- Terraform: 1.9+ (OpenTofu compatible), always pin provider versions
- CI/CD: GitHub Actions with reusable workflows
- REQUIRED in every Kubernetes manifest:
  resources.requests AND resources.limits (both CPU and memory — no exceptions)
  livenessProbe AND readinessProbe (different purposes — explain the difference)
  securityContext at pod AND container level:
    runAsNonRoot: true, runAsUser: 1001
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities: drop: ["ALL"]
- Container images: never use latest tag — pin to semantic version or digest
- Secrets: never hardcode in YAML — reference from K8s Secret or external secrets operator

The securityContext block is the most commonly missing piece in AI-generated Kubernetes manifests. Most container images run as root by default. Most admission controllers in hardened clusters reject privileged containers. Running as root in production is a CIS Kubernetes Benchmark failure and a common lateral movement vector. Include the full security context block as a non-negotiable requirement in every Deployment prompt.

3 Common Mistakes When Prompting AI for DevOps & Kubernetes

Mistake 1: Missing resource limits on containers

AI generates Kubernetes Deployments with resources: {} — empty or omitted entirely. Without resource limits, a memory leak in one pod can consume all available node memory, triggering OOM kills across all pods on the node. Kubernetes schedulers also make poor placement decisions without resource requests. Specify: "every container must have explicit resources.requests AND resources.limits for both CPU and memory — document the reasoning for the chosen values." A Deployment without resource limits should fail code review.

Mistake 2: Running containers as root

Container images default to running as root (UID 0) unless explicitly overridden. AI generates manifests without securityContext because most examples in training data don't include it. A container running as root in a compromised pod has direct access to host files if the container runtime is misconfigured. Specify: "add pod-level and container-level securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, drop ALL capabilities." Most CI security scanners (Trivy, Snyk) flag missing security contexts.

Mistake 3: Using the latest image tag

AI generates image: nginx:latest for every example. The latest tag has no rollback path (if something breaks, "latest" yesterday and "latest" today are different images), is non-deterministic across nodes (two nodes might pull different versions), and bypasses Kubernetes' image pull cache when imagePullPolicy: Always is set. Specify: "pin all container images to a specific semantic version tag or digest — never latest." For production, prefer image digests (nginx@sha256:...) over tags for immutability.

Further Reading

Resources for AI-assisted DevOps and infrastructure development:

Generate a custom DevOps/K8s prompt → Try PromptPrepare free

Help & Answers

Frequently Asked Questions

John AllickAI Researcher· Updated May 10, 2026

John Allick is an AI researcher specializing in prompt engineering and large language model evaluation. He benchmarks models across ChatGPT, Claude, Gemini, Grok, and DeepSeek, focusing on practical techniques that produce reliable, production-ready outputs. Every guide on PromptPrepare is tested live on current model versions before publication.

✓ Expert-tested on live models✓ Updated May 10, 2026✓ Model-verified examples

Found this helpful?

Save it to your library or share with your team.

Keep Reading

Related Guides

Apply this guide instantly

Free AI prompt generator