main*
πŸ“kubernetes-homelab-gitops.md
πŸ“…July 19, 2024│⏱7 min read

Kubernetes Homelab: From Zero to GitOps in a Weekend

#kubernetes#infrastructure#homelab#gitops

Why Build a Homelab?

I've been running production Kubernetes for years. But there's a gap between "knowing K8s" and "deeply understanding K8s." Production clusters have:

  • Change approval processes
  • Limited experimentation budget
  • Actual users who get grumpy when you break things

A homelab removes those constraints. You can:

  • Learn K8s internals without cloud bills
  • Test infrastructure-as-code before production
  • Dogfood your own deployments
  • Break things at 3am and only disappoint yourself

The Case for Physical Hardware

Cloud alternatives existβ€”kind clusters, local Docker Compose, cloud sandbox accounts. They're fine for learning basics. But they miss:

FeatureCloud SandboxPhysical Homelab
Persistent storageEphemeral or expensiveCheap NVMe drives
Network troubleshootingAbstracted awayReal NICs, real problems
Multi-node networkingSingle node typicallyActual cross-node traffic
Long-running workloadsTime/resource limitedRun for months
Cost over 6 months$200-500+$0 (after hardware)

I went with a mini PC cluster. Four nodes, 64GB total RAM, silent, low power.

Hardware Build

The Node Strategy

I didn't want a rack server. Too loud, too hot, too ugly for a home office. Mini PCs are the sweet spot:

ComponentPer Nodex4 Total Cost
Intel N100 Mini PC$150$600
16GB DDR5 SODIMM$35$140
256GB NVMe SSD$40$160
Total$900

Power draw: ~10W idle per node. That's $12/month in electricity. The cloud equivalent would be $300+/month.

Storage Architecture

Two approaches for persistent storage:

Option 1: Centralized NFS

  • One node exports NFS share
  • Other nodes mount it
  • Pros: Simple, one backup target
  • Cons: Single point of failure

Option 2: Distributed (Longhorn)

  • Each node contributes storage
  • Replicated across nodes
  • Pros: Fault tolerant, K8s-native
  • Cons: More complex, network overhead

I use both. NFS for media files (movies don't need replication), Longhorn for databases (need HA).

Software Stack

The Foundation: K3s

K3s is Kubernetes without the bloat. Single binary, minimal dependencies, perfect for homelab.

# On control plane node
curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --disable traefik \
  --write-kubeconfig-mode 644
 
# Get the token for workers
cat /var/lib/rancher/k3s/server/node-token
# On each worker node
curl -sfL https://get.k3s.io | sh -s - agent \
  --server https://CONTROL_PLANE_IP:6443 \
  --token YOUR_TOKEN

Five minutes later, you have a cluster.

GitOps with ArgoCD

This is the game-changer. ArgoCD watches your Git repo and syncs cluster state.

# bootstrap/argocd.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: homelab-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/YOUR_USERNAME/homelab
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Deploy ArgoCD:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl apply -f bootstrap/argocd.yaml

Now everything in apps/ directory in Git gets deployed automatically.

Ingress with Traefik

Traefik is modern ingress, auto-discovers services.

# infrastructure/traefik/values.yaml
ports:
  web:
    redirectTo: websecure
  websecure:
    tls:
      enabled: true
 
certificatesResolvers:
  cloudflare:
    acme:
      email: your@email.com
      dnsChallenge:
        provider: cloudflare

SSL with Let's Encrypt via DNS challenge. No port forwarding, no manual cert management.

Monitoring Stack

You can't run K8s blind. Prometheus + Grafana are essential.

# monitoring/kube-prometheus-stack.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
spec:
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: kube-prometheus-stack
    targetRevision: 45.x
  destination:
    namespace: monitoring

This gives you:

  • Prometheus (metrics collection)
  • Grafana (visualization)
  • AlertManager (alerting)
  • Node Exporter (node metrics)
  • Kube-State-Metrics (K8s metrics)

Real Workloads

What actually runs on this thing?

Media Stack

# apps/media/plex.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: plex
spec:
  template:
    spec:
      containers:
        - name: plex
          image: plexinc/pms-docker:latest
          resources:
            requests:
              memory: "4Gi"
              cpu: "1"
          volumeMounts:
            - name: media
              mountPath: /media
            - name: config
              mountPath: /config

Plex with GPU passthrough for transcoding. Sonarr, Radarr, and Prowlarr handle the automation.

Home Automation

# apps/home-assistant.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: home-assistant
spec:
  template:
    spec:
      containers:
        - name: home-assistant
          image: homeassistant/home-assistant:stable
          hostNetwork: true
          volumeMounts:
            - name: config
              mountPath: /config

Home Assistant with host networking for mDNS discovery.

Password Management

# apps/vaultwarden.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vaultwarden
spec:
  template:
    spec:
      containers:
        - name: vaultwarden
          image: vaultwarden/server:latest
          env:
            - name: SIGNUPS_ALLOWED
              value: "false"

Bitwarden-compatible server. My passwords, my infrastructure.

DNS Ad Blocking

# apps/pihole.yaml
apiVersion: v1
kind: Service
metadata:
  name: pihole
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.1.200

Pi-hole as the network DNS. Every device gets ad-blocking without configuration.

GitOps Workflow

The workflow is stupid simple:

# Make a change
vim apps/media/plex.yaml
 
# Commit and push
git add . && git commit -m "increase plex memory" && git push
 
# Wait 30 seconds
# ArgoCD syncs automatically

No kubectl apply. No manual cluster changes. Git is the single source of truth.

Handling Secrets

Never commit secrets. Use Sealed Secrets:

# Create a secret
kubectl create secret generic db-password --from-literal=password=hunter2
 
# Seal it (encrypt with cluster's public key)
kubeseal --format=yaml < db-password-secret.yaml > sealed-secret.yaml
 
# Now safe to commit
git add sealed-secret.yaml && git commit -m "add db secret"

The sealed secret can only be decrypted by your cluster's controller.

Lessons from Production Incidents

Incident 1: The Time I Deleted Everything

What happened: I ran kubectl delete namespace default --grace-period=0 thinking it was a test cluster. It was the production cluster.

Impact: Every workload gone. 50+ services.

Recovery time: 5 minutes.

How: ArgoCD noticed the drift and recreated everything from Git. The selfHeal: true setting saved my ass.

syncPolicy:
  automated:
    prune: true
    selfHeal: true  # <-- THIS

Lesson: GitOps doesn't just make deployment easier. It makes recovery instant.

Incident 2: The Mysterious Network Latency

Symptoms: 2-second delays between services. Intermittent. Frustrating.

Without monitoring: I'd have been guessing for hours.

With Prometheus: Grafana dashboard showed elevated TCP retransmits on one node.

Root cause: One mini PC had a failing NIC. Not dead, just flaky.

Fix: Replaced the node. Prometheus confirmed latency dropped.

Lesson: Monitor before you need it. The metrics you ignore today are the debugging data you'll wish you had tomorrow.

Incident 3: Certificate Expiration

What happened: Let's Encrypt certs expired. Internal services unreachable.

Why: Cert-manager wasn't auto-renewing because of a misconfigured DNS challenge.

How I found it: Alerts from Prometheus via AlertManager to Slack.

Fix: Corrected Cloudflare API token permissions. Certs renewed within minutes.

Lesson: Alert on everything. Cert expiration at 3am should wake you up.

Cost Comparison: Cloud vs Homelab

ResourceCloud (Monthly)Homelab
4 nodes, 64GB RAM$350 (EKS + EC2)$0 (paid upfront)
500GB block storage$50 (EBS)$0 (local NVMe)
Load balancer$18 (ALB)$0 (Traefik)
1TB data transfer$90$0 (home internet)
Managed DNS$0.50 (Route53)$0 (Cloudflare free)
SSL certificates$0 (ACM)$0 (Let's Encrypt)
Total monthly~$500$12 (electricity)

Break-even: ~2 months

Year 1 savings: ~$5,900

Year 2+ savings: ~$6,000/year

What's Next

The homelab evolves. Current experiments:

  • GPU scheduling: Adding an RTX 3060 for ML inference workloads
  • Multi-cluster: Setting up a staging cluster for pre-production testing
  • Disaster recovery: Velero for cross-cluster backups
  • Service mesh: Linkerd for zero-trust internal networking

Conclusion

A homelab isn't about being cheap. It's about having a sandbox where:

  • Mistakes are learning opportunities, not career-limiting incidents
  • You control the entire stack
  • Skills transfer directly to production environments

The Kubernetes knowledge that took me from "can deploy a pod" to "can architect multi-cluster GitOps platforms" came from hours of breaking and fixing my own cluster. That's education you can't buy.


Questions about building your own homelab? Hit me up on Twitter or check out my homelab repo.