- Jupyter Notebook 92.6%
- Python 6.8%
- Dockerfile 0.3%
- Shell 0.2%
| .devcontainer | ||
| .forgejo/workflows | ||
| cert | ||
| cluster | ||
| docker | ||
| edge/pi-sensor-unit | ||
| readmes | ||
| src | ||
| .gitignore | ||
| deploy_flow.py | ||
| environment.yml | ||
| Makefile | ||
| README.md | ||
| sealed-secrets-public-key.pem | ||
End-to-End Edge IIoT & MLOps Platform
This repository defines and operates a local Kubernetes-based IIoT and MLOps platform. The cluster is managed through GitOps: the desired state lives in this repository, ArgoCD reconciles that state into Kubernetes, and manual drift is corrected automatically.
The platform connects edge devices, MQTT ingestion, time-series storage, Grafana dashboards, model artifacts, and Prefect-driven inference jobs into one local end-to-end system. It is currently a test and development setup, but the layout deliberately follows a clean operating model: cluster bootstrap, application manifests, secrets, runtime images, and flow deployments are kept separate.
Start Here
If you return to this repository after a break, start here:
git status --short --branch
git remote -v
kubectl -n argocd get applications.argoproj.io
kubectl -n prefect get pods
kubectl -n iot-playground get pods
PREFECT_API_URL=http://192.168.188.247:4200/api prefect deployment ls
Expected high-level state:
- Git remote
originpoints togit.jdynamics.de. - ArgoCD applications use the HTTPS repo URL
https://git.jdynamics.de/dev/iiot-edge-mlops-platform.git. - ArgoCD applications are
SyncedandHealthy. - The Prefect worker is running in namespace
prefect. - IoT services are running in namespace
iot-playground. - Prefect contains the active deployment
Smart Factory Prediction/k3s-production-run.
Documentation
The detailed documentation lives in readmes/. Start with:
| Topic | Document |
|---|---|
Day-to-day kubectl usage |
readmes/kubectl.md |
| Local Helm rendering and chart checks | readmes/helm.md |
| Local GitOps validation before merge | readmes/gitops-local-validation.md |
| GitOps and ArgoCD | readmes/gitops-argocd.md |
| Cluster monitoring with existing Grafana | readmes/monitoring.md |
| Traefik, MetalLB, local DNS, TLS | readmes/networking-traefik-metallb.md |
| Secrets, Infisical, SealedSecrets | readmes/secrets-infisical-sealedsecrets.md |
| Infisical Operator | readmes/infisical-operator.md |
| Rathole Grafana tunnel | readmes/rathole-grafana.md |
| Prefect flows and worker runtime | readmes/prefect.md |
| Forgejo runner edge deploy | readmes/forgejo-runner-edge-deploy.md |
| Raspberry Pi Smart Plant service | edge/pi-sensor-unit/systemd/README.md |
| Container images | readmes/container-images.md |
| Local HTTPS certificates | readmes/local-https-tls.md |
| Longhorn and storage notes | readmes/longhorn-storage.md |
| K3s/Proxmox update flow | readmes/k3s-proxmox-update-guide.md |
| Proxmox Cloud-Init template | readmes/proxmox-cloud-init-template.md |
System Overview
The cluster has two main paths: the sensor data path and the MLOps execution path.
Edge / Raspberry Pi
|
| MQTT: sensors/#
v
Mosquitto <--- Telegraf subscribes
|
v
InfluxDB <--- Grafana reads time series
Prefect Server
|
| schedules Kubernetes jobs through the Prefect Worker
v
Prefect Worker --> Kubernetes Job --> clone this repo --> run flow
|
+--> InfluxDB: read live sensor data
+--> MLflow: load model metadata/model
+--> MinIO: load model artifacts via S3
Short version:
- Edge devices publish sensor values to MQTT topics under
sensors/#. - Mosquitto receives the MQTT messages.
- Telegraf subscribes to MQTT, parses JSON payloads, and writes metrics to InfluxDB.
- Grafana visualizes time-series data from InfluxDB.
- Prefect orchestrates Python flows as Kubernetes jobs.
- Each Prefect job pulls code from this repository and runs inside the private
plant-workerimage. - MLflow stores model metadata in PostgreSQL and model artifacts in MinIO.
- ArgoCD keeps the Kubernetes manifests in sync with
main.
Repository Layout
cluster/
bootstrap/ ArgoCD app-of-apps definitions
core/ Cluster-level components such as ArgoCD, Traefik, MetalLB, Infisical
apps/
forgejo-runner/ Dedicated Forgejo Actions runner for edge deploys
iot-playground/ MQTT, InfluxDB, Telegraf, Grafana, MinIO, MLflow, Rathole
prefect/ Prefect server and Prefect worker
databases/ Database Helm charts, currently PostgreSQL
docker/
edge-deploy-runner/ Forgejo runner image with SSH/rsync for RPi deploys
plant-worker/ Runtime image used by Prefect flow runs
edge/
pi-sensor-unit/ Edge-side sensor code and RPi service reference
src/
flows/ Production Prefect flows
connection/ Small connection test scripts
training/ Training/notebook artifacts
readmes/ Centralized runbooks and operating notes
deploy_flow.py Registers or updates the Prefect deployment
src/.env.example Local environment template without real secrets
Service Reachability
MetalLB assigns local network IPs from this pool:
192.168.188.240-192.168.188.250
Direct local LoadBalancer services:
| Service | Purpose | LAN access |
|---|---|---|
| Traefik | HTTPS ingress for *.iot.local hostnames |
192.168.188.240 |
| ArgoCD | GitOps UI/API | http://192.168.188.243 or https://192.168.188.243 |
| Mosquitto | MQTT ingest for edge devices | 192.168.188.245:1883 |
| InfluxDB | time-series API/UI | http://192.168.188.246:8086 |
| Prefect Server | Prefect UI/API | http://192.168.188.247:4200 |
| Longhorn | storage UI | http://192.168.188.242 |
Traefik is the shared HTTPS ingress endpoint. Point local DNS or /etc/hosts
entries for concrete *.iot.local hostnames to 192.168.188.240; Traefik then
routes by hostname. Plain /etc/hosts files usually do not support wildcard
records, so add names one by one, for example:
192.168.188.240 grafana.iot.local mlflow.iot.local prefect.iot.local infisical.iot.local
Grafana is intentionally exposed only through Traefik locally:
https://grafana.iot.local
For the first public tunnel test, Grafana is also exposed through Rathole:
https://grafana.jdynamics.de
The public HTTPS endpoint and certificate are handled on the external server. Inside the cluster, Rathole forwards to:
grafana.iot-playground.svc.cluster.local:80
Inside Kubernetes, workloads should use service DNS instead of LAN IPs:
| Service | In-cluster address |
|---|---|
| Mosquitto | mosquitto-service.iot-playground.svc.cluster.local:1883 |
| InfluxDB | influxdb-service.iot-playground.svc.cluster.local:8086 |
| Grafana | grafana.iot-playground.svc.cluster.local:80 |
| MLflow | mlflow.iot-playground.svc.cluster.local:5000 |
| MinIO | minio.iot-playground.svc.cluster.local:9000 |
| Prefect API | prefect-server.prefect.svc.cluster.local:4200/api |
Where To Change What
| Task | Primary place |
|---|---|
| Add or remove an ArgoCD-managed app | cluster/bootstrap/*.yaml |
| Change explicit namespace ownership | cluster/core/namespaces/namespaces.yaml |
| Change MetalLB address pool | cluster/core/metallb-config/config.yaml |
| Change Traefik K3s chart config or TLSStore | cluster/core/traefik/ |
| Change MQTT broker deployment or config | cluster/apps/iot-playground/mosquitto/ |
| Change MQTT-to-InfluxDB ingestion | cluster/apps/iot-playground/telegraf/ |
| Change InfluxDB service/PVC/deployment | cluster/apps/iot-playground/influxdb/ |
| Change Grafana deployment, ingress, or persistence | cluster/apps/iot-playground/grafana/ |
| Change Grafana Prometheus datasource or starter dashboard | cluster/apps/iot-playground/grafana/provisioning.yaml |
| Change Grafana Rathole tunnel | cluster/apps/iot-playground/rathole-grafana/ |
| Change cluster monitoring backend | cluster/apps/monitoring/kube-prometheus-stack/ |
| Change MLflow deployment or S3/PostgreSQL wiring | cluster/apps/iot-playground/mlflow/ |
| Change MinIO deployment or service | cluster/apps/iot-playground/minio/ |
| Change Prefect server Helm values | cluster/apps/prefect/prefect-server/values.yaml |
| Change Prefect worker image, command, or env | cluster/apps/prefect/prefect-worker/deployment.yaml |
| Change Prefect worker RBAC or image pull secret reference | cluster/apps/prefect/prefect-worker/rbac.yaml |
| Change Forgejo edge runner manifests | cluster/apps/forgejo-runner/ |
| Change Edge deploy workflow | .forgejo/workflows/edge-deploy.yaml |
| Change flow deployment source, image, namespace, or env | deploy_flow.py |
| Change prediction logic | src/flows/predict_flow_prod.py |
| Change runtime image | docker/plant-worker/Dockerfile |
| Change edge deploy runner image | docker/edge-deploy-runner/Dockerfile |
| Change runtime secret sync structure | cluster/apps/iot-playground/secrets/infisical-static-secrets.yaml and the matching Infisical path |
| Change namespace ingress policy | cluster/apps/iot-playground/policies/network-policies.yaml |
| Change local TLS certificate handling | readmes/local-https-tls.md and the global-iot-tls Infisical path |
| Change local-only development variables | src/.env based on src/.env.example |
GitOps Rules
The root ArgoCD application is:
name: bootstrap
repo: https://git.jdynamics.de/dev/iiot-edge-mlops-platform.git
revision: main
path: cluster/bootstrap
Operational rules:
- Persistent infrastructure changes belong in Git.
- Work on a branch, merge into
main, then let ArgoCD reconcile. - Use direct
kubectlchanges only for checks, temporary debugging, or bootstrap tasks that cannot live safely in Git. - Do not commit plaintext secrets, tokens, local
.envfiles, or private keys. - Keep generated/operator-owned secrets out of Git. Put secret values in Infisical.
Lean Clean State
The intended test-cluster baseline is deliberately small:
- Git remains the source of truth for ArgoCD applications, app manifests, network policies, namespace ownership, Infisical sync objects, the Traefik
HelmChartConfig, the shared TraefikTLSStore, and the MetalLB address pool. - Infisical remains the source of truth for runtime secret values.
- SealedSecrets are only used for bootstrap secrets that are needed before Infisical can sync runtime secrets.
- Kubernetes Dashboard is not installed.
- Traefik serves local application ingress on
192.168.188.240for concrete*.iot.localhostnames. - The Traefik dashboard stays tunnel-only on pod port
8080. - Grafana has no dedicated MetalLB IP.
- Longhorn backups are intentionally still open.
Quick verification:
kubectl -n argocd get applications.argoproj.io \
-o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status,REV:.status.sync.revision
kubectl get ns kubernetes-dashboard --ignore-not-found
kubectl get clusterrole,clusterrolebinding | grep -Ei 'dashboard|admin-user' || true
kubectl -n kube-system get svc traefik -o wide
kubectl -n kube-system get svc traefik \
-o jsonpath='{range .spec.ports[*]}{.name}:{.port}->{.targetPort}{"\n"}{end}'
kubectl -n metallb-system get ipaddresspool,l2advertisement
kubectl -n iot-playground get pod -l app=rathole-grafana
Expected results:
- all ArgoCD applications are
SyncedandHealthy; - no
kubernetes-dashboardnamespace, dashboard ClusterRole, or oldadmin-userClusterRoleBinding exists; - Traefik has external IP
192.168.188.240; - the Traefik service exposes only
webandwebsecure, not the dashboard entrypoint; - dashboard troubleshooting uses
kubectl -n kube-system port-forward deployment/traefik 8080:8080; - MetalLB has
default-poolanddefault; - Rathole Grafana pod is
Runningif the public tunnel test is enabled.
Known Technical Debt
- This is a test setup. Some credentials and Helm chart values still reflect that.
- The registry pull token uses
read:package, which is not cleanly repository-scoped in Forgejo today. - Longhorn backup design is still open.
- Longhorn installation and frontend service are not fully GitOps-managed yet.
- Grafana dashboards and datasources are not fully managed as code.
- The worker image is referenced as tag
v1. For stricter reproducibility, use immutable tags or a digest. - The prediction flow requires recent sensor data. Without fresh MQTT/InfluxDB data, a failed prediction run is expected.
- SealedSecrets can only be decrypted by the matching cluster key. If the cluster is rebuilt, preserve the SealedSecrets key or reseal all secrets.
- MLflow installs a few Python dependencies at container start. A custom MLflow image would be cleaner.