Projects — Improbability Labs Inc.

Software GPU slicing for Slurm

CUDA-level library interposition that enforces per-job GPU memory and compute limits on shared HPC partitions. Enables 2–4× oversubscription on any NVIDIA GPU without hardware MIG. Integrated with Slurm via prolog/epilog scripts and a custom job submission plugin. Enforced system-wide — users cannot bypass it. Open-source contribution by our founder.

CCUDA 12SlurmCMakeLinux

Bare-metal HPC provisioning stack

CIS Level 2 hardened OS images deployed to bare metal via PXE network boot. Supports Slurm compute and login nodes, Kubernetes (RKE2) worker nodes, Ceph storage nodes, and Proxmox hypervisors — all from the same provisioning pipeline. Automated builds via CI/CD with SCAP security validation. Open-source contribution by our founder.

Warewulf 4DockerSlurmRKE2CephProxmox

Edge-to-HPC integration platform

Wiring together Slurm, Kubeflow, GPU inference, edge compute, ArduPilot, Flask APIs, and sensor ingestion into a single backend. Data comes in from endpoints (drones, robots, sensors, anything with a companion computer and a data link), gets processed with HPC and ML, and results or commands go back out. The application layer on top determines what it does — autonomous operations, ISR, monitoring, analytics. In development.

SlurmKubeflowInferenceFlaskArduPilotEdge

HPC algorithmic trading platform

Multi-user trading system running on a Slurm cluster. Parallel backtesting across thousands of strategy permutations, live execution on multiple exchanges, automated portfolio management that promotes winning strategies and culls losers. GPU-accelerated technical analysis.

SlurmPythonCUDA/cuDFMySQL ClusterFlask

HPC RAG platform

Retrieval-augmented generation chatbot built for a research computing organization. Multi-LLM routing (OpenAI, Anthropic, Groq, Ollama), vector search over converted documentation, WebSocket streaming, and Kubernetes deployment with TLS. Built and donated as open source by our founder.

RAGflowLangChainLLMsKubernetesPython

GPU-accelerated technical analysis library

Ported a 130+ indicator technical analysis library from CPU-bound pandas to NVIDIA cuDF for GPU execution. Order-of-magnitude speedup for batch analysis across thousands of instruments. Used LLM-assisted code migration at scale.

CUDAcuDFPythonRAPIDSGPU

Open OnDemand HPC web portals

Deployed and maintained Open OnDemand portals for multi-cluster HPC environments. OIDC authentication, interactive compute sessions (Jupyter, VS Code, MATLAB, RStudio), Globus data transfer integration, and custom dashboard apps. Open-source contributions by our founder.

Open OnDemandOIDCSlurmJupyterGlobus

Production AI platform on Kubernetes

Full Open WebUI deployment on RKE2 Kubernetes: PostgreSQL 16 with pgvector for hybrid vector search, Redis caching, Apache Tika document extraction, LDAP authentication, Whisper speech-to-text, web search integration. Autoscaling 2–10 pods with NFS-backed persistent storage. Deployed by our founder for a research computing group.

KubernetespgvectorRedisWhisperLDAP

Large-scale infrastructure migration

Migrated 700+ virtual machines across four data centres in three countries with 15 minutes of total downtime. Built an automated VMware cluster deployment tool that provisioned 25+ standardized clusters from bare metal using custom Ansible playbooks and Kickstart templates.

VMwareAnsiblePowerCLIKickstartNetworking

Discuss a project