No description
  • Jupyter Notebook 92.6%
  • Python 6.8%
  • Dockerfile 0.3%
  • Shell 0.2%
Find a file
2026-06-06 22:03:02 +00:00
.devcontainer Add k9s to devcontainer 2026-06-04 17:49:06 +00:00
.forgejo/workflows Clean up edge deploy SSH files 2026-06-06 08:44:06 +00:00
cert docs: centralize cluster runbooks 2026-05-31 17:49:14 +00:00
cluster Increase Grafana resource limits 2026-06-06 22:03:02 +00:00
docker Prepare Forgejo edge deploy runner 2026-06-06 00:19:41 +00:00
edge/pi-sensor-unit Document smart plant systemd service 2026-06-06 08:54:11 +00:00
readmes Document smart plant systemd service 2026-06-06 08:54:11 +00:00
src chore: remove hardcoded influx test token 2026-05-25 08:44:54 +00:00
.gitignore Create temp-folder for local exchange 2026-06-06 11:53:18 +00:00
deploy_flow.py chore: apply cluster quick wins 2026-05-30 07:39:02 +00:00
environment.yml data cleaning and deleting old files 2026-02-13 18:49:07 +00:00
Makefile Add k9s to devcontainer 2026-06-04 17:49:06 +00:00
README.md Document smart plant systemd service 2026-06-06 08:54:11 +00:00
sealed-secrets-public-key.pem feature: add sealed secrets public key 2026-02-10 18:52:45 +00:00

End-to-End Edge IIoT & MLOps Platform

This repository defines and operates a local Kubernetes-based IIoT and MLOps platform. The cluster is managed through GitOps: the desired state lives in this repository, ArgoCD reconciles that state into Kubernetes, and manual drift is corrected automatically.

The platform connects edge devices, MQTT ingestion, time-series storage, Grafana dashboards, model artifacts, and Prefect-driven inference jobs into one local end-to-end system. It is currently a test and development setup, but the layout deliberately follows a clean operating model: cluster bootstrap, application manifests, secrets, runtime images, and flow deployments are kept separate.

Start Here

If you return to this repository after a break, start here:

git status --short --branch
git remote -v
kubectl -n argocd get applications.argoproj.io
kubectl -n prefect get pods
kubectl -n iot-playground get pods
PREFECT_API_URL=http://192.168.188.247:4200/api prefect deployment ls

Expected high-level state:

  • Git remote origin points to git.jdynamics.de.
  • ArgoCD applications use the HTTPS repo URL https://git.jdynamics.de/dev/iiot-edge-mlops-platform.git.
  • ArgoCD applications are Synced and Healthy.
  • The Prefect worker is running in namespace prefect.
  • IoT services are running in namespace iot-playground.
  • Prefect contains the active deployment Smart Factory Prediction/k3s-production-run.

Documentation

The detailed documentation lives in readmes/. Start with:

Topic Document
Day-to-day kubectl usage readmes/kubectl.md
Local Helm rendering and chart checks readmes/helm.md
Local GitOps validation before merge readmes/gitops-local-validation.md
GitOps and ArgoCD readmes/gitops-argocd.md
Cluster monitoring with existing Grafana readmes/monitoring.md
Traefik, MetalLB, local DNS, TLS readmes/networking-traefik-metallb.md
Secrets, Infisical, SealedSecrets readmes/secrets-infisical-sealedsecrets.md
Infisical Operator readmes/infisical-operator.md
Rathole Grafana tunnel readmes/rathole-grafana.md
Prefect flows and worker runtime readmes/prefect.md
Forgejo runner edge deploy readmes/forgejo-runner-edge-deploy.md
Raspberry Pi Smart Plant service edge/pi-sensor-unit/systemd/README.md
Container images readmes/container-images.md
Local HTTPS certificates readmes/local-https-tls.md
Longhorn and storage notes readmes/longhorn-storage.md
K3s/Proxmox update flow readmes/k3s-proxmox-update-guide.md
Proxmox Cloud-Init template readmes/proxmox-cloud-init-template.md

System Overview

The cluster has two main paths: the sensor data path and the MLOps execution path.

Edge / Raspberry Pi
        |
        | MQTT: sensors/#
        v
Mosquitto  <--- Telegraf subscribes
        |
        v
InfluxDB <--- Grafana reads time series

Prefect Server
        |
        | schedules Kubernetes jobs through the Prefect Worker
        v
Prefect Worker --> Kubernetes Job --> clone this repo --> run flow
                                      |
                                      +--> InfluxDB: read live sensor data
                                      +--> MLflow: load model metadata/model
                                      +--> MinIO: load model artifacts via S3

Short version:

  1. Edge devices publish sensor values to MQTT topics under sensors/#.
  2. Mosquitto receives the MQTT messages.
  3. Telegraf subscribes to MQTT, parses JSON payloads, and writes metrics to InfluxDB.
  4. Grafana visualizes time-series data from InfluxDB.
  5. Prefect orchestrates Python flows as Kubernetes jobs.
  6. Each Prefect job pulls code from this repository and runs inside the private plant-worker image.
  7. MLflow stores model metadata in PostgreSQL and model artifacts in MinIO.
  8. ArgoCD keeps the Kubernetes manifests in sync with main.

Repository Layout

cluster/
  bootstrap/                 ArgoCD app-of-apps definitions
  core/                      Cluster-level components such as ArgoCD, Traefik, MetalLB, Infisical
  apps/
    forgejo-runner/          Dedicated Forgejo Actions runner for edge deploys
    iot-playground/          MQTT, InfluxDB, Telegraf, Grafana, MinIO, MLflow, Rathole
    prefect/                 Prefect server and Prefect worker
  databases/                 Database Helm charts, currently PostgreSQL

docker/
  edge-deploy-runner/        Forgejo runner image with SSH/rsync for RPi deploys
  plant-worker/              Runtime image used by Prefect flow runs

edge/
  pi-sensor-unit/            Edge-side sensor code and RPi service reference

src/
  flows/                     Production Prefect flows
  connection/                Small connection test scripts
  training/                  Training/notebook artifacts

readmes/                     Centralized runbooks and operating notes
deploy_flow.py               Registers or updates the Prefect deployment
src/.env.example             Local environment template without real secrets

Service Reachability

MetalLB assigns local network IPs from this pool:

192.168.188.240-192.168.188.250

Direct local LoadBalancer services:

Service Purpose LAN access
Traefik HTTPS ingress for *.iot.local hostnames 192.168.188.240
ArgoCD GitOps UI/API http://192.168.188.243 or https://192.168.188.243
Mosquitto MQTT ingest for edge devices 192.168.188.245:1883
InfluxDB time-series API/UI http://192.168.188.246:8086
Prefect Server Prefect UI/API http://192.168.188.247:4200
Longhorn storage UI http://192.168.188.242

Traefik is the shared HTTPS ingress endpoint. Point local DNS or /etc/hosts entries for concrete *.iot.local hostnames to 192.168.188.240; Traefik then routes by hostname. Plain /etc/hosts files usually do not support wildcard records, so add names one by one, for example:

192.168.188.240 grafana.iot.local mlflow.iot.local prefect.iot.local infisical.iot.local

Grafana is intentionally exposed only through Traefik locally:

https://grafana.iot.local

For the first public tunnel test, Grafana is also exposed through Rathole:

https://grafana.jdynamics.de

The public HTTPS endpoint and certificate are handled on the external server. Inside the cluster, Rathole forwards to:

grafana.iot-playground.svc.cluster.local:80

Inside Kubernetes, workloads should use service DNS instead of LAN IPs:

Service In-cluster address
Mosquitto mosquitto-service.iot-playground.svc.cluster.local:1883
InfluxDB influxdb-service.iot-playground.svc.cluster.local:8086
Grafana grafana.iot-playground.svc.cluster.local:80
MLflow mlflow.iot-playground.svc.cluster.local:5000
MinIO minio.iot-playground.svc.cluster.local:9000
Prefect API prefect-server.prefect.svc.cluster.local:4200/api

Where To Change What

Task Primary place
Add or remove an ArgoCD-managed app cluster/bootstrap/*.yaml
Change explicit namespace ownership cluster/core/namespaces/namespaces.yaml
Change MetalLB address pool cluster/core/metallb-config/config.yaml
Change Traefik K3s chart config or TLSStore cluster/core/traefik/
Change MQTT broker deployment or config cluster/apps/iot-playground/mosquitto/
Change MQTT-to-InfluxDB ingestion cluster/apps/iot-playground/telegraf/
Change InfluxDB service/PVC/deployment cluster/apps/iot-playground/influxdb/
Change Grafana deployment, ingress, or persistence cluster/apps/iot-playground/grafana/
Change Grafana Prometheus datasource or starter dashboard cluster/apps/iot-playground/grafana/provisioning.yaml
Change Grafana Rathole tunnel cluster/apps/iot-playground/rathole-grafana/
Change cluster monitoring backend cluster/apps/monitoring/kube-prometheus-stack/
Change MLflow deployment or S3/PostgreSQL wiring cluster/apps/iot-playground/mlflow/
Change MinIO deployment or service cluster/apps/iot-playground/minio/
Change Prefect server Helm values cluster/apps/prefect/prefect-server/values.yaml
Change Prefect worker image, command, or env cluster/apps/prefect/prefect-worker/deployment.yaml
Change Prefect worker RBAC or image pull secret reference cluster/apps/prefect/prefect-worker/rbac.yaml
Change Forgejo edge runner manifests cluster/apps/forgejo-runner/
Change Edge deploy workflow .forgejo/workflows/edge-deploy.yaml
Change flow deployment source, image, namespace, or env deploy_flow.py
Change prediction logic src/flows/predict_flow_prod.py
Change runtime image docker/plant-worker/Dockerfile
Change edge deploy runner image docker/edge-deploy-runner/Dockerfile
Change runtime secret sync structure cluster/apps/iot-playground/secrets/infisical-static-secrets.yaml and the matching Infisical path
Change namespace ingress policy cluster/apps/iot-playground/policies/network-policies.yaml
Change local TLS certificate handling readmes/local-https-tls.md and the global-iot-tls Infisical path
Change local-only development variables src/.env based on src/.env.example

GitOps Rules

The root ArgoCD application is:

name: bootstrap
repo: https://git.jdynamics.de/dev/iiot-edge-mlops-platform.git
revision: main
path: cluster/bootstrap

Operational rules:

  • Persistent infrastructure changes belong in Git.
  • Work on a branch, merge into main, then let ArgoCD reconcile.
  • Use direct kubectl changes only for checks, temporary debugging, or bootstrap tasks that cannot live safely in Git.
  • Do not commit plaintext secrets, tokens, local .env files, or private keys.
  • Keep generated/operator-owned secrets out of Git. Put secret values in Infisical.

Lean Clean State

The intended test-cluster baseline is deliberately small:

  • Git remains the source of truth for ArgoCD applications, app manifests, network policies, namespace ownership, Infisical sync objects, the Traefik HelmChartConfig, the shared Traefik TLSStore, and the MetalLB address pool.
  • Infisical remains the source of truth for runtime secret values.
  • SealedSecrets are only used for bootstrap secrets that are needed before Infisical can sync runtime secrets.
  • Kubernetes Dashboard is not installed.
  • Traefik serves local application ingress on 192.168.188.240 for concrete *.iot.local hostnames.
  • The Traefik dashboard stays tunnel-only on pod port 8080.
  • Grafana has no dedicated MetalLB IP.
  • Longhorn backups are intentionally still open.

Quick verification:

kubectl -n argocd get applications.argoproj.io \
  -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status,REV:.status.sync.revision

kubectl get ns kubernetes-dashboard --ignore-not-found
kubectl get clusterrole,clusterrolebinding | grep -Ei 'dashboard|admin-user' || true

kubectl -n kube-system get svc traefik -o wide
kubectl -n kube-system get svc traefik \
  -o jsonpath='{range .spec.ports[*]}{.name}:{.port}->{.targetPort}{"\n"}{end}'

kubectl -n metallb-system get ipaddresspool,l2advertisement
kubectl -n iot-playground get pod -l app=rathole-grafana

Expected results:

  • all ArgoCD applications are Synced and Healthy;
  • no kubernetes-dashboard namespace, dashboard ClusterRole, or old admin-user ClusterRoleBinding exists;
  • Traefik has external IP 192.168.188.240;
  • the Traefik service exposes only web and websecure, not the dashboard entrypoint;
  • dashboard troubleshooting uses kubectl -n kube-system port-forward deployment/traefik 8080:8080;
  • MetalLB has default-pool and default;
  • Rathole Grafana pod is Running if the public tunnel test is enabled.

Known Technical Debt

  • This is a test setup. Some credentials and Helm chart values still reflect that.
  • The registry pull token uses read:package, which is not cleanly repository-scoped in Forgejo today.
  • Longhorn backup design is still open.
  • Longhorn installation and frontend service are not fully GitOps-managed yet.
  • Grafana dashboards and datasources are not fully managed as code.
  • The worker image is referenced as tag v1. For stricter reproducibility, use immutable tags or a digest.
  • The prediction flow requires recent sensor data. Without fresh MQTT/InfluxDB data, a failed prediction run is expected.
  • SealedSecrets can only be decrypted by the matching cluster key. If the cluster is rebuilt, preserve the SealedSecrets key or reseal all secrets.
  • MLflow installs a few Python dependencies at container start. A custom MLflow image would be cleaner.