This is a mono repository for my home infrastructure and Kubernetes cluster. I try to adhere to Infrastructure as Code (IaC) and GitOps practices using tools like Ansible, Terraform, Kubernetes, Flux, Renovate, and GitHub Actions.
My Kubernetes cluster is deployed with Talos. Storage is provided by multiple solutions including NFS via democratic-csi and local storage options.
Storage Architecture:
- NFS Storage: democratic-csi for shared filesystem access
- Local Storage: Talos local storage for high-performance workloads
- Legacy Storage: 2x Unraid VMs providing media storage (migration in progress)
- Network: Multi-tier network with 40Gb links for storage traffic
- Total Capacity: 476.96TB raw across 76 drives
There is a template over at onedr0p/cluster-template if you want to try and follow along with some of the practices used here.
Networking:
- cilium: eBPF-based CNI with kube-proxy replacement, L2 announcements, and advanced networking features.
- multus-cni: Multiple network interfaces per pod for IoT and legacy network integration.
- ingress-nginx: Dual ingress controllers (internal + external) for service routing.
- external-dns: Automatic DNS management (internal via UniFi, external via Cloudflare).
- k8s-gateway: Internal DNS server for cluster services.
- cloudflared: Secure Cloudflare tunnels for external access.
Storage:
- democratic-csi: NFS storage provisioner for shared workloads.
- volsync: PVC backup and recovery.
Security & Secrets:
- cert-manager: Automated SSL/TLS certificate management.
- external-secrets: Secrets management using Infisical.
- sops: Encrypted secrets in Git.
Observability:
- kube-prometheus-stack: Prometheus, Grafana, and Alertmanager.
- loki: Log aggregation and query.
- promtail: Log shipper for Loki.
- gatus: Service health monitoring and status page.
GPU & Hardware:
- nvidia-device-plugin: GPU support for 4x Tesla P100 GPUs.
- intel-device-plugin: Intel hardware acceleration.
- node-feature-discovery: Automatic hardware capability detection.
GitOps & Automation:
- actions-runner-controller: Self-hosted GitHub runners.
- spegel: Stateless cluster-local OCI registry mirror.
- system-upgrade-controller: Automated Talos system upgrades.
Flux watches the clusters in my kubernetes folder (see Directories below) and makes the changes to my clusters based on the state of my Git repository.
The way Flux works for me here is it will recursively search the kubernetes/apps folder until it finds the most top level kustomization.yaml per directory and then apply all the resources listed in it. That aforementioned kustomization.yaml will generally only have a namespace resource and one or many Flux kustomizations (ks.yaml). Under the control of those Flux kustomizations there will be a HelmRelease or other resources related to the application which will be applied.
Renovate watches my entire repository looking for dependency updates, when they are found a PR is automatically created. When some PRs are merged Flux applies the changes to my cluster.
This Git repository contains the following directories under Kubernetes.
π kubernetes
βββ π apps # applications
βββ π bootstrap # bootstrap procedures
βββ π components # re-useable components
βββ π flux # flux system configurationThis is a high-level look how Flux deploys my applications with dependencies. In most cases a HelmRelease will depend on other HelmRelease's, in other cases a Kustomization will depend on other Kustomization's, and in rare situations an app can depend on a HelmRelease and a Kustomization. The example below shows that plex won't be deployed or upgrade until the storage dependencies are installed and in a healthy state.
graph TD
%% Styling
classDef kustomization fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
classDef helmRelease fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
%% Nodes
A>Kustomization: democratic-csi]:::kustomization
B[HelmRelease: democratic-csi]:::helmRelease
C[Kustomization: democratic-csi-driver]:::kustomization
D>Kustomization: plex]:::kustomization
E[HelmRelease: plex]:::helmRelease
%% Relationships with styled edges
A -->|Creates| B
A -->|Creates| C
C -->|Depends on| B
D -->|Creates| E
E -->|Depends on| C
%% Link styling
linkStyle default stroke:#666,stroke-width:2px
My network spans two physical locations connected via a 60GHz wireless bridge, featuring a multi-tier switching architecture optimized for high-performance storage and compute workloads.
Physical Network Topology
graph TB
%% Styling
classDef router fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff
classDef coreswitch fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
classDef distswitch fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
classDef accessswitch fill:#f39c12,stroke:#fff,stroke-width:2px,color:#fff
classDef server fill:#8e44ad,stroke:#fff,stroke-width:2px,color:#fff
classDef wireless fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff
subgraph House["π House"]
OPN["OPNsense<br/>192.168.1.1<br/>Gateway"]:::router
ARUBA["Aruba S2500-48p<br/>192.168.1.26<br/>PoE Access Switch"]:::accessswitch
CLIENTS["Client Devices"]
WHOUSE["Mikrotik NRay60<br/>192.168.1.7"]:::wireless
OPN --- ARUBA
ARUBA --- CLIENTS
ARUBA --- WHOUSE
end
subgraph Bridge["β‘ 60GHz Bridge"]
WHOUSE -.1Gbps Wireless.-> WSHOP
end
subgraph Garage["π Garage/Shop"]
WSHOP["Mikrotik NRay60<br/>192.168.1.8"]:::wireless
BROCADE["Brocade ICX6610<br/>192.168.1.20<br/>Core L3 Switch"]:::coreswitch
ARISTA["Arista 7050<br/>192.168.1.21<br/>40Gb Distribution"]:::distswitch
PX1["Proxmox-01<br/>4C/16GB"]:::server
PX2["Proxmox-02<br/>24C/196GB"]:::server
PX3["Proxmox-03<br/>64C/516GB"]:::server
PX4["Proxmox-04<br/>24C/196GB"]:::server
WSHOP --- BROCADE
BROCADE <-->|2x 40Gb LAG| ARISTA
PX1 -->|2x10Gb + 2x1Gb| BROCADE
PX2 -->|2x10Gb + 2x1Gb| BROCADE
PX3 -->|2x10Gb + 2x1Gb| BROCADE
PX4 -->|2x10Gb + 2x1Gb| BROCADE
PX1 -.40Gb Storage.-> ARISTA
PX2 -.40Gb Storage.-> ARISTA
PX3 -.40Gb Storage.-> ARISTA
PX4 -.40Gb Storage.-> ARISTA
end
style House fill:#ecf0f1,stroke:#34495e,stroke-width:3px
style Garage fill:#ecf0f1,stroke:#34495e,stroke-width:3px
style Bridge fill:#ffebee,stroke:#c62828,stroke-width:2px
Key Features:
- Dual locations connected via 60GHz wireless (1Gbps)
- Multi-tier switching: Core (Brocade), Distribution (Arista), Access (Aruba)
- Dedicated 40Gb storage network on Arista
- LACP bonding on server links for redundancy
Kubernetes Cluster
graph TB
%% Styling
classDef control fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
classDef worker fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
classDef gpu fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff
classDef vip fill:#f39c12,stroke:#fff,stroke-width:3px,color:#000
VIP["Kubernetes API VIP<br/>10.100.0.40:6443"]:::vip
subgraph ControlPlane["Control Plane Nodes"]
CP1["talos-control-1<br/>10.100.0.50<br/>4 CPU / 16GB"]:::control
CP2["talos-control-2<br/>10.100.0.51<br/>4 CPU / 16GB"]:::control
CP3["talos-control-3<br/>10.100.0.52<br/>4 CPU / 16GB"]:::control
end
subgraph Workers["Worker Nodes"]
GPU["talos-node-gpu-1<br/>10.100.0.53<br/>16 CPU / 128GB<br/>4x Tesla P100"]:::gpu
W1["talos-node-large-1<br/>10.100.0.54<br/>16 CPU / 128GB"]:::worker
W2["talos-node-large-2<br/>10.100.0.55<br/>16 CPU / 128GB"]:::worker
W3["talos-node-large-3<br/>10.100.0.56<br/>16 CPU / 128GB"]:::worker
end
VIP -.-> CP1
VIP -.-> CP2
VIP -.-> CP3
style ControlPlane fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style Workers fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
Cluster Configuration:
- Total Nodes: 7 (3 control plane, 4 workers)
- Total Resources: 76 CPU cores, 560GB RAM, 4x Tesla P100 GPUs
- OS: Talos Linux
- CNI: Cilium with eBPF (10.69.0.0/16 pod CIDR)
- API VIP: 10.100.0.40 (shared across control plane)
VLAN & Network Segmentation
graph LR
%% Styling
classDef mgmt fill:#95a5a6,stroke:#fff,stroke-width:2px,color:#fff
classDef servers fill:#3498db,stroke:#fff,stroke-width:2px,color:#fff
classDef storage fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff
classDef k8s fill:#2ecc71,stroke:#fff,stroke-width:2px,color:#fff
subgraph Physical["Physical Networks"]
V1["VLAN 1<br/>192.168.1.0/24<br/>Management<br/>MTU 1500"]:::mgmt
V100["VLAN 100<br/>10.100.0.0/24<br/>Servers/VMs<br/>MTU 1500"]:::servers
V150["VLAN 150<br/>10.150.0.0/24<br/>Storage Public<br/>MTU 9000"]:::storage
V200["VLAN 200<br/>10.200.0.0/24<br/>Storage Cluster<br/>MTU 9000"]:::storage
end
subgraph Kubernetes["Kubernetes Networks"]
POD["Pod Network<br/>10.69.0.0/16<br/>Cilium CNI"]:::k8s
SVC["Service Network<br/>10.96.0.0/16<br/>ClusterIP"]:::k8s
end
V100 -.Talos VMs.-> POD
V100 -.Talos VMs.-> SVC
V150 -.CSI Drivers.-> POD
style Physical fill:#ecf0f1,stroke:#34495e,stroke-width:2px
style Kubernetes fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
Network Details:
- VLAN 1: Switch management, IPMI, gateway: OPNsense
- VLAN 100: Kubernetes nodes, gateway: OPNsense, internet access
- VLAN 150: Storage client connections, jumbo frames, L2 only
- VLAN 200: Storage cluster traffic (40Gb dedicated), jumbo frames, L2 only
- Pod Network: Cilium eBPF-based CNI with native routing
- Service Network: Standard Kubernetes ClusterIP services
For detailed network documentation including switch configurations, see Network Architecture.
While most of my infrastructure and workloads are self-hosted I do rely upon the cloud for certain key parts of my setup. This saves me from having to worry about three things. (1) Dealing with chicken/egg scenarios, (2) services I critically need whether my cluster is online or not and (3) The "hit by a bus factor" - what happens to critical apps (e.g. Email, Password Manager, Photos) that my family relies on when I no longer around.
Alternative solutions to the first two of these problems would be to host a Kubernetes cluster in the cloud and deploy applications like HCVault, Vaultwarden, ntfy, and Gatus; however, maintaining another cluster and monitoring another group of workloads would be more work and probably be more or equal out to the same costs as described below.
| Service | Use | Cost |
|---|---|---|
| Infisical | Secrets with External Secrets | Free |
| Cloudflare | Domain and S3 | Free |
| GCP | Voice interactions with Home Assistant over Google Assistant | Free |
| GitHub | Hosting this repository and continuous integration/deployments | Free |
| Migadu | Email hosting | ~$20/yr |
| Pushover | Kubernetes Alerts and application notifications | $5 OTP |
| UptimeRobot | Monitoring internet connectivity and external facing applications | Free |
| Total: ~$2/mo |
In my cluster there are two instances of ExternalDNS running. One for syncing private DNS records to my UDM Pro Max using ExternalDNS webhook provider for UniFi, while another instance syncs public DNS to Cloudflare. This setup is managed by creating ingresses with two specific classes: internal for private DNS and external for public DNS. The external-dns instances then syncs the DNS records to their respective platforms accordingly.
| Device | CPU | RAM | Storage | Network | Function |
|---|---|---|---|---|---|
| Proxmox-01 | Intel Xeon E3-1230 V2 (4 cores @ 3.30GHz) |
16GB | 24x 4TB HDD | 1Gb IPMI 2x 1Gb 2x 10Gb 2x 40Gb |
Primary Storage |
| Proxmox-02 | 2x Intel Xeon X5680 (24 cores @ 3.33GHz) |
196GB | 2x 120GB ZFS mirror | 1Gb IPMI 2x 1Gb 2x 10Gb 2x 40Gb |
Kubernetes + Ceph |
| Proxmox-03 | 2x Intel Xeon E5-2697A v4 (64 cores @ 2.60GHz) |
516GB | 1x 3.92TB SSD + 1x 800GB 4x Tesla P100 16GB |
1Gb IPMI 2x 1Gb 2x 10Gb 2x 40Gb |
Kubernetes + Ceph GPU Passthrough |
| Proxmox-04 | 2x Intel Xeon X5680 (24 cores @ 3.33GHz) |
196GB | 8x 10TB + 1x 3.84TB + 1x 800GB | 1Gb IPMI 2x 1Gb 2x 10Gb 2x 40Gb |
Kubernetes + Ceph |
Total Cluster Resources:
- CPU: 116 cores total
- RAM: 924GB total
- Storage: 476.96TB raw (76 drives across all hosts + JBOD)
- GPU: 4x NVIDIA Tesla P100 16GB (64GB VRAM total)
- Network: Multi-tier switching with 40Gb Ceph network
| VM Name | vCPU | RAM | Host | Role | Notes |
|---|---|---|---|---|---|
| talos-control-1 | 4 | 16GB | Proxmox-03 | Control Plane | |
| talos-control-2 | 4 | 16GB | Proxmox-04 | Control Plane | |
| talos-control-3 | 4 | 16GB | Proxmox-02 | Control Plane | |
| talos-node-gpu-1 | 16 | 128GB | Proxmox-03 | Worker | 4x P100 GPU passthrough |
| talos-node-large-1 | 16 | 128GB | Proxmox-03 | Worker | |
| talos-node-large-2 | 16 | 128GB | Proxmox-03 | Worker | |
| talos-node-large-3 | 16 | 128GB | Proxmox-03 | Worker |
Kubernetes Cluster Totals: 76 vCPU, 560GB RAM, 4x Tesla P100 GPUs
| Device | Model | Location | Role | Specs |
|---|---|---|---|---|
| OPNsense Router | Custom (i3-4130T) | House | Gateway/Firewall | 2C/4T @ 2.90GHz, 16GB RAM, 2.5Gb ATT Fiber |
| Brocade ICX6610 | Enterprise Switch | Garage | Core L3 Switch | 48x 1/10Gb ports, 4x 40Gb QSFP+, VLAN routing |
| Arista 7050 | Data Center Switch | Garage | Distribution | 48x 10Gb SFP+, 4x 40Gb QSFP+ |
| Aruba S2500-48p | Access Switch | House | PoE Access | 48x 1Gb PoE+ ports |
| Mikrotik NRay60 | 60GHz Radio (x2) | Both | Wireless Bridge | 1Gbps point-to-point link |
Storage Distribution:
- Proxmox-01: 24x 4TB HDD (96TB) - Unraid VM
- Proxmox-03: 3x 4TB + 7x 10TB + 2x 12TB (106TB) - Unraid VM
- Proxmox-04: 8x 10TB + 1x 3.84TB + 1x 800GB (84.64TB)
- JBOD Shelf: 18x 10TB + 1x 3.84TB + 1x 800GB (184.64TB)
- Total Raw Capacity: 476.96TB across 76 drives
- Network: Dedicated 40Gb network for storage traffic
Thanks to all the people who donate their time to the Home Operations Discord community. Be sure to check out kubesearch.dev for ideas on how to deploy applications or get ideas on what you could deploy.
