Skip to content

dapperdivers/dapper-cluster

πŸš€ My Home Operations Repository 🚧

... managed with Flux, Renovate, and GitHub Actions πŸ€–

DiscordΒ Β  TalosΒ Β  KubernetesΒ Β  FluxΒ Β  Renovate

Home-InternetΒ Β  Status-PageΒ Β  Alertmanager

Age-DaysΒ Β  Uptime-DaysΒ Β  Node-CountΒ Β  Pod-CountΒ Β  CPU-UsageΒ Β  Memory-UsageΒ Β  Power-UsageΒ Β  Alerts


πŸ’‘ Overview

This is a mono repository for my home infrastructure and Kubernetes cluster. I try to adhere to Infrastructure as Code (IaC) and GitOps practices using tools like Ansible, Terraform, Kubernetes, Flux, Renovate, and GitHub Actions.


🌱 Kubernetes

My Kubernetes cluster is deployed with Talos. Storage is provided by multiple solutions including NFS via democratic-csi and local storage options.

Storage Architecture:

  • NFS Storage: democratic-csi for shared filesystem access
  • Local Storage: Talos local storage for high-performance workloads
  • Legacy Storage: 2x Unraid VMs providing media storage (migration in progress)
  • Network: Multi-tier network with 40Gb links for storage traffic
  • Total Capacity: 476.96TB raw across 76 drives

There is a template over at onedr0p/cluster-template if you want to try and follow along with some of the practices used here.

Core Components

Networking:

  • cilium: eBPF-based CNI with kube-proxy replacement, L2 announcements, and advanced networking features.
  • multus-cni: Multiple network interfaces per pod for IoT and legacy network integration.
  • ingress-nginx: Dual ingress controllers (internal + external) for service routing.
  • external-dns: Automatic DNS management (internal via UniFi, external via Cloudflare).
  • k8s-gateway: Internal DNS server for cluster services.
  • cloudflared: Secure Cloudflare tunnels for external access.

Storage:

Security & Secrets:

Observability:

GPU & Hardware:

GitOps & Automation:

GitOps

Flux watches the clusters in my kubernetes folder (see Directories below) and makes the changes to my clusters based on the state of my Git repository.

The way Flux works for me here is it will recursively search the kubernetes/apps folder until it finds the most top level kustomization.yaml per directory and then apply all the resources listed in it. That aforementioned kustomization.yaml will generally only have a namespace resource and one or many Flux kustomizations (ks.yaml). Under the control of those Flux kustomizations there will be a HelmRelease or other resources related to the application which will be applied.

Renovate watches my entire repository looking for dependency updates, when they are found a PR is automatically created. When some PRs are merged Flux applies the changes to my cluster.

Directories

This Git repository contains the following directories under Kubernetes.

πŸ“ kubernetes
β”œβ”€β”€ πŸ“ apps           # applications
β”œβ”€β”€ πŸ“ bootstrap      # bootstrap procedures
β”œβ”€β”€ πŸ“ components     # re-useable components
└── πŸ“ flux           # flux system configuration

Flux Workflow

This is a high-level look how Flux deploys my applications with dependencies. In most cases a HelmRelease will depend on other HelmRelease's, in other cases a Kustomization will depend on other Kustomization's, and in rare situations an app can depend on a HelmRelease and a Kustomization. The example below shows that plex won't be deployed or upgrade until the storage dependencies are installed and in a healthy state.

graph TD
    %% Styling
    classDef kustomization fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    classDef helmRelease fill:#389826,stroke:#fff,stroke-width:2px,color:#fff

    %% Nodes
    A>Kustomization: democratic-csi]:::kustomization
    B[HelmRelease: democratic-csi]:::helmRelease
    C[Kustomization: democratic-csi-driver]:::kustomization
    D>Kustomization: plex]:::kustomization
    E[HelmRelease: plex]:::helmRelease

    %% Relationships with styled edges
    A -->|Creates| B
    A -->|Creates| C
    C -->|Depends on| B
    D -->|Creates| E
    E -->|Depends on| C

    %% Link styling
    linkStyle default stroke:#666,stroke-width:2px
Loading

Networking

My network spans two physical locations connected via a 60GHz wireless bridge, featuring a multi-tier switching architecture optimized for high-performance storage and compute workloads.

Physical Network Topology
graph TB
    %% Styling
    classDef router fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff
    classDef coreswitch fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    classDef distswitch fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
    classDef accessswitch fill:#f39c12,stroke:#fff,stroke-width:2px,color:#fff
    classDef server fill:#8e44ad,stroke:#fff,stroke-width:2px,color:#fff
    classDef wireless fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff

    subgraph House["🏠 House"]
        OPN["OPNsense<br/>192.168.1.1<br/>Gateway"]:::router
        ARUBA["Aruba S2500-48p<br/>192.168.1.26<br/>PoE Access Switch"]:::accessswitch
        CLIENTS["Client Devices"]
        WHOUSE["Mikrotik NRay60<br/>192.168.1.7"]:::wireless

        OPN --- ARUBA
        ARUBA --- CLIENTS
        ARUBA --- WHOUSE
    end

    subgraph Bridge["⚑ 60GHz Bridge"]
        WHOUSE -.1Gbps Wireless.-> WSHOP
    end

    subgraph Garage["🏭 Garage/Shop"]
        WSHOP["Mikrotik NRay60<br/>192.168.1.8"]:::wireless
        BROCADE["Brocade ICX6610<br/>192.168.1.20<br/>Core L3 Switch"]:::coreswitch
        ARISTA["Arista 7050<br/>192.168.1.21<br/>40Gb Distribution"]:::distswitch

        PX1["Proxmox-01<br/>4C/16GB"]:::server
        PX2["Proxmox-02<br/>24C/196GB"]:::server
        PX3["Proxmox-03<br/>64C/516GB"]:::server
        PX4["Proxmox-04<br/>24C/196GB"]:::server

        WSHOP --- BROCADE
        BROCADE <-->|2x 40Gb LAG| ARISTA

        PX1 -->|2x10Gb + 2x1Gb| BROCADE
        PX2 -->|2x10Gb + 2x1Gb| BROCADE
        PX3 -->|2x10Gb + 2x1Gb| BROCADE
        PX4 -->|2x10Gb + 2x1Gb| BROCADE

        PX1 -.40Gb Storage.-> ARISTA
        PX2 -.40Gb Storage.-> ARISTA
        PX3 -.40Gb Storage.-> ARISTA
        PX4 -.40Gb Storage.-> ARISTA
    end

    style House fill:#ecf0f1,stroke:#34495e,stroke-width:3px
    style Garage fill:#ecf0f1,stroke:#34495e,stroke-width:3px
    style Bridge fill:#ffebee,stroke:#c62828,stroke-width:2px
Loading

Key Features:

  • Dual locations connected via 60GHz wireless (1Gbps)
  • Multi-tier switching: Core (Brocade), Distribution (Arista), Access (Aruba)
  • Dedicated 40Gb storage network on Arista
  • LACP bonding on server links for redundancy
Kubernetes Cluster
graph TB
    %% Styling
    classDef control fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    classDef worker fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
    classDef gpu fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff
    classDef vip fill:#f39c12,stroke:#fff,stroke-width:3px,color:#000

    VIP["Kubernetes API VIP<br/>10.100.0.40:6443"]:::vip

    subgraph ControlPlane["Control Plane Nodes"]
        CP1["talos-control-1<br/>10.100.0.50<br/>4 CPU / 16GB"]:::control
        CP2["talos-control-2<br/>10.100.0.51<br/>4 CPU / 16GB"]:::control
        CP3["talos-control-3<br/>10.100.0.52<br/>4 CPU / 16GB"]:::control
    end

    subgraph Workers["Worker Nodes"]
        GPU["talos-node-gpu-1<br/>10.100.0.53<br/>16 CPU / 128GB<br/>4x Tesla P100"]:::gpu
        W1["talos-node-large-1<br/>10.100.0.54<br/>16 CPU / 128GB"]:::worker
        W2["talos-node-large-2<br/>10.100.0.55<br/>16 CPU / 128GB"]:::worker
        W3["talos-node-large-3<br/>10.100.0.56<br/>16 CPU / 128GB"]:::worker
    end

    VIP -.-> CP1
    VIP -.-> CP2
    VIP -.-> CP3

    style ControlPlane fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Workers fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
Loading

Cluster Configuration:

  • Total Nodes: 7 (3 control plane, 4 workers)
  • Total Resources: 76 CPU cores, 560GB RAM, 4x Tesla P100 GPUs
  • OS: Talos Linux
  • CNI: Cilium with eBPF (10.69.0.0/16 pod CIDR)
  • API VIP: 10.100.0.40 (shared across control plane)
VLAN & Network Segmentation
graph LR
    %% Styling
    classDef mgmt fill:#95a5a6,stroke:#fff,stroke-width:2px,color:#fff
    classDef servers fill:#3498db,stroke:#fff,stroke-width:2px,color:#fff
    classDef storage fill:#e74c3c,stroke:#fff,stroke-width:2px,color:#fff
    classDef k8s fill:#2ecc71,stroke:#fff,stroke-width:2px,color:#fff

    subgraph Physical["Physical Networks"]
        V1["VLAN 1<br/>192.168.1.0/24<br/>Management<br/>MTU 1500"]:::mgmt
        V100["VLAN 100<br/>10.100.0.0/24<br/>Servers/VMs<br/>MTU 1500"]:::servers
        V150["VLAN 150<br/>10.150.0.0/24<br/>Storage Public<br/>MTU 9000"]:::storage
        V200["VLAN 200<br/>10.200.0.0/24<br/>Storage Cluster<br/>MTU 9000"]:::storage
    end

    subgraph Kubernetes["Kubernetes Networks"]
        POD["Pod Network<br/>10.69.0.0/16<br/>Cilium CNI"]:::k8s
        SVC["Service Network<br/>10.96.0.0/16<br/>ClusterIP"]:::k8s
    end

    V100 -.Talos VMs.-> POD
    V100 -.Talos VMs.-> SVC
    V150 -.CSI Drivers.-> POD

    style Physical fill:#ecf0f1,stroke:#34495e,stroke-width:2px
    style Kubernetes fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
Loading

Network Details:

  • VLAN 1: Switch management, IPMI, gateway: OPNsense
  • VLAN 100: Kubernetes nodes, gateway: OPNsense, internet access
  • VLAN 150: Storage client connections, jumbo frames, L2 only
  • VLAN 200: Storage cluster traffic (40Gb dedicated), jumbo frames, L2 only
  • Pod Network: Cilium eBPF-based CNI with native routing
  • Service Network: Standard Kubernetes ClusterIP services

For detailed network documentation including switch configurations, see Network Architecture.


😢 Cloud Dependencies

While most of my infrastructure and workloads are self-hosted I do rely upon the cloud for certain key parts of my setup. This saves me from having to worry about three things. (1) Dealing with chicken/egg scenarios, (2) services I critically need whether my cluster is online or not and (3) The "hit by a bus factor" - what happens to critical apps (e.g. Email, Password Manager, Photos) that my family relies on when I no longer around.

Alternative solutions to the first two of these problems would be to host a Kubernetes cluster in the cloud and deploy applications like HCVault, Vaultwarden, ntfy, and Gatus; however, maintaining another cluster and monitoring another group of workloads would be more work and probably be more or equal out to the same costs as described below.

Service Use Cost
Infisical Secrets with External Secrets Free
Cloudflare Domain and S3 Free
GCP Voice interactions with Home Assistant over Google Assistant Free
GitHub Hosting this repository and continuous integration/deployments Free
Migadu Email hosting ~$20/yr
Pushover Kubernetes Alerts and application notifications $5 OTP
UptimeRobot Monitoring internet connectivity and external facing applications Free
Total: ~$2/mo

🌎 DNS

In my cluster there are two instances of ExternalDNS running. One for syncing private DNS records to my UDM Pro Max using ExternalDNS webhook provider for UniFi, while another instance syncs public DNS to Cloudflare. This setup is managed by creating ingresses with two specific classes: internal for private DNS and external for public DNS. The external-dns instances then syncs the DNS records to their respective platforms accordingly.


βš™ Hardware

Compute Infrastructure

Device CPU RAM Storage Network Function
Proxmox-01 Intel Xeon E3-1230 V2
(4 cores @ 3.30GHz)
16GB 24x 4TB HDD 1Gb IPMI
2x 1Gb
2x 10Gb
2x 40Gb
Primary Storage
Proxmox-02 2x Intel Xeon X5680
(24 cores @ 3.33GHz)
196GB 2x 120GB ZFS mirror 1Gb IPMI
2x 1Gb
2x 10Gb
2x 40Gb
Kubernetes + Ceph
Proxmox-03 2x Intel Xeon E5-2697A v4
(64 cores @ 2.60GHz)
516GB 1x 3.92TB SSD + 1x 800GB
4x Tesla P100 16GB
1Gb IPMI
2x 1Gb
2x 10Gb
2x 40Gb
Kubernetes + Ceph
GPU Passthrough
Proxmox-04 2x Intel Xeon X5680
(24 cores @ 3.33GHz)
196GB 8x 10TB + 1x 3.84TB + 1x 800GB 1Gb IPMI
2x 1Gb
2x 10Gb
2x 40Gb
Kubernetes + Ceph

Total Cluster Resources:

  • CPU: 116 cores total
  • RAM: 924GB total
  • Storage: 476.96TB raw (76 drives across all hosts + JBOD)
  • GPU: 4x NVIDIA Tesla P100 16GB (64GB VRAM total)
  • Network: Multi-tier switching with 40Gb Ceph network

Kubernetes VMs (Talos Linux)

VM Name vCPU RAM Host Role Notes
talos-control-1 4 16GB Proxmox-03 Control Plane
talos-control-2 4 16GB Proxmox-04 Control Plane
talos-control-3 4 16GB Proxmox-02 Control Plane
talos-node-gpu-1 16 128GB Proxmox-03 Worker 4x P100 GPU passthrough
talos-node-large-1 16 128GB Proxmox-03 Worker
talos-node-large-2 16 128GB Proxmox-03 Worker
talos-node-large-3 16 128GB Proxmox-03 Worker

Kubernetes Cluster Totals: 76 vCPU, 560GB RAM, 4x Tesla P100 GPUs

Network Equipment

Device Model Location Role Specs
OPNsense Router Custom (i3-4130T) House Gateway/Firewall 2C/4T @ 2.90GHz, 16GB RAM, 2.5Gb ATT Fiber
Brocade ICX6610 Enterprise Switch Garage Core L3 Switch 48x 1/10Gb ports, 4x 40Gb QSFP+, VLAN routing
Arista 7050 Data Center Switch Garage Distribution 48x 10Gb SFP+, 4x 40Gb QSFP+
Aruba S2500-48p Access Switch House PoE Access 48x 1Gb PoE+ ports
Mikrotik NRay60 60GHz Radio (x2) Both Wireless Bridge 1Gbps point-to-point link

Storage

Storage Distribution:

  • Proxmox-01: 24x 4TB HDD (96TB) - Unraid VM
  • Proxmox-03: 3x 4TB + 7x 10TB + 2x 12TB (106TB) - Unraid VM
  • Proxmox-04: 8x 10TB + 1x 3.84TB + 1x 800GB (84.64TB)
  • JBOD Shelf: 18x 10TB + 1x 3.84TB + 1x 800GB (184.64TB)
  • Total Raw Capacity: 476.96TB across 76 drives
  • Network: Dedicated 40Gb network for storage traffic

🌟 Stargazers


πŸ™ Gratitude and Thanks

Thanks to all the people who donate their time to the Home Operations Discord community. Be sure to check out kubesearch.dev for ideas on how to deploy applications or get ideas on what you could deploy.

Packages

No packages published

Contributors 35