Heroically Excessive Inference Methodology (for) Data Analytics (on) Large Loads
A telemetry-to-insight pipeline for robotics and autonomous systems. Turns fleet telemetry into natural-language insights via GPU-accelerated data loading (cuDF + UVM), NVIDIA NIM on GKE for LLM inference, and model format selection for production deployment.
Named after Heimdall, the Norse guardian from above, who sees and hears everything. Like him, this pipeline watches over your fleet telemetry from the "cloud" and turns it into insights.
Imagine a fleet of a thousand Waymo autonomous vehicles, Tesla Optimus or Unitree G1 humanoid units, or similar, each generating streams of telemetry. You need to identify which units had anomalous brake events or abnormal sensor measurements last week, or which vehicles exceeded a speed threshold in a given region, or which robots showed elevated motor temperatures during a deployment. Manually querying and cross-referencing that data across hundreds of assets does not scale.
H.E.I.M.D.A.L.L addresses this. You load your fleet telemetry into the pipeline, then ask natural-language questions such as "Which vehicles had brake pressure above 90% in the last 24 hours?" or "List robots with gyro z-axis variance exceeding 0.5." The system returns responses with vehicle or robot IDs, timestamps, and relevant metrics. This enables rapid insights and operational visibility across large fleets of cars, autonomous vehicles, or robots; without writing complex queries.
- Introduction
- What You Need
- Choose Your Path
- Quick Start (Notebooks 01 & 02)
- Notebooks
- Setup for Notebook 03 (NIM on GKE)
- Architecture
- Results & Takeaways
- Troubleshooting
- Real Data
- Contributing
- Code of Conduct
| To run… | You need |
|---|---|
| Notebook 01 (Data Ingest) | Colab account, GPU runtime (L4/T4 recommended) |
| Notebook 02 (Local Inference) | Colab + GPU + Hugging Face token (for Gemma 2) |
| Notebook 03 (Full Pipeline) | Colab + GCP account (billing) + NGC API key + NIM on GKE |
┌─────────────────────────────────────────────────────────────────┐
│ New to the project? │
│ → Start with Notebook 01 (Data Ingest pandas/cuDF/cudf.pandas) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Want natural-language telemetry Q&A on your machine? │
│ → Run Notebook 02 (Local Inference) with Gemma 2 2B │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Need production-scale inference in the cloud? │
│ → Deploy NIM on GKE, then run Notebook 03 │
└─────────────────────────────────────────────────────────────────┘
Prerequisites: Google account (for Colab), GPU runtime (T4 is default; L4 works too).
- Open 01 Data Ingest (~5 min) and 02 Inference Pipeline (~10 min first run).
- Runtime → Change runtime type → Hardware accelerator: T4 GPU → Save. (L4 if available.)
- Notebook 02 only: Add your Hugging Face token so the notebook can load Gemma 2:
- Go to huggingface.co/google/gemma-2-2b-it and accept the license.
- Create a token at huggingface.co/settings/tokens (click New token, set role to Read).
- In Colab: click the key icon in the left sidebar → Add new secret → name:
HF_TOKEN, value: paste your token → Save.
- Run all cells.
Run the data ingest notebook on Vertex AI with an NVIDIA L4 GPU, following the same flow as the Accelerated Data Analytics with GPUs Codelab.
- Go to Google Cloud Console
- Navigation menu → Vertex AI → Colab Enterprise
- Click Runtime templates → New template
- Under Runtime basics:
- Display name:
gpu-l4-template - Region: Your preferred region (e.g.
us-central1)
- Display name:
- Under Configure compute:
- Machine type:
g2-standard-4(1× NVIDIA L4 GPU) - Idle shutdown: 60 minutes (or as desired)
- Machine type:
- Click Create
Note: If you see
NVIDIA_L4_GPUS exceeded, your L4 quota is used or too low. See Troubleshooting: L4 quota below.g2-standard-8(1× L4) andg2-standard-16(2× L4) are alternatives if you have quota.
- Click Runtimes → Create
- Under Runtime template, select
gpu-l4-template - Click Create and wait for the runtime to boot (a few minutes)
- Click My notebooks → Import
- Select URL and paste:
https://github.com/KarthikSriramGit/H.E.I.M.D.A.L.L/blob/main/notebooks/01_data_ingest_benchmark.ipynb - Click Import
- Open the imported notebook
- Click the Connect dropdown → Connect to a Runtime
- Select your
gpu-l4-templateruntime → Connect - Run the setup cell (clone + pip install). cuDF will use the L4 GPU.
- Run all cells. The benchmark will execute on the L4.
To enable zero-code-change GPU acceleration for pandas:
%load_ext cudf.pandas
import pandas as pdReference: Accelerated Data Analytics with Google Cloud and NVIDIA Codelab
For quick runs without GCP setup: open the notebook in Colab, then Runtime → Change runtime type → GPU (L4) when available (e.g. Colab Pro).
| Notebook | What it does | Requirements |
|---|---|---|
| 01 Data Ingest | cuDF + UVM loading, pandas vs cuDF vs cudf.pandas benchmark | GPU (L4) |
| 02 Inference Pipeline | Format selection, Gemma 2 2B local inference | GPU + HF token |
| 03 Query Telemetry | Full pipeline with NIM (Llama 3 8B on GKE) | NIM deployed (see below) |
Notebook 03 requires NIM running on GKE. Follow these steps.
- Google account with Google Cloud (billing enabled)
- NGC account for the API key
- Go to console.cloud.google.com
- Create a project or select an existing one
- Enable billing for the project
- Note your Project ID (e.g.
heimdall-487621)
- Go to APIs & Services → Library
- Enable Kubernetes Engine API and Compute Engine API
- Go to ngc.nvidia.com and sign in
- Profile (top right) → Setup → Generate API Key
- Copy the key (starts with
nvapi-)
Open Google Cloud Shell and run:
# Set your values
export PROJECT_ID="your-gcp-project-id"
export ZONE="us-central1-a"
export NGC_CLI_API_KEY="your-ngc-api-key"
# Configure gcloud
gcloud config set project $PROJECT_ID
# Create cluster and GPU node pool (skip if already exists)
gcloud container clusters create nim-demo \
--project=$PROJECT_ID \
--location=$ZONE \
--release-channel=rapid \
--machine-type=e2-standard-4 \
--num-nodes=1
gcloud container node-pools create gpupool \
--accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
--project=$PROJECT_ID \
--location=$ZONE \
--cluster=nim-demo \
--machine-type=g2-standard-16 \
--num-nodes=1
# Connect to cluster
gcloud container clusters get-credentials nim-demo --zone=$ZONE --project=$PROJECT_ID
# Clone repo and deploy NIM
git clone -q https://github.com/KarthikSriramGit/H.E.I.M.D.A.L.L.git
cd H.E.I.M.D.A.L.L
bash src/deploy/gke/deploy_nim.sh
# Expose NIM with LoadBalancer
kubectl patch svc my-nim-nim-llm -n nim -p '{"spec": {"type": "LoadBalancer"}}'
kubectl get svc -n nimIf cluster already exists, run only:
export PROJECT_ID="your-gcp-project-id"
export ZONE="us-central1-a"
export NGC_CLI_API_KEY="your-ngc-api-key"
gcloud config set project $PROJECT_ID
gcloud container clusters get-credentials nim-demo --zone=$ZONE --project=$PROJECT_ID
git clone -q https://github.com/KarthikSriramGit/H.E.I.M.D.A.L.L.git
cd H.E.I.M.D.A.L.L
bash src/deploy/gke/deploy_nim.sh
kubectl patch svc my-nim-nim-llm -n nim -p '{"spec": {"type": "LoadBalancer"}}'
kubectl get svc -n nim-
Wait for EXTERNAL-IP to appear (not
<pending>):kubectl get svc -n nim
-
Wait for the pod to be Running:
kubectl get pods -n nim
-
Note your NIM URL:
http://EXTERNAL_IP:8000(replaceEXTERNAL_IPwith the IP fromkubectl get svc -n nim)
Do not commit this URL. Keep it private.
- Open 03 Query Telemetry in Colab
- Store your NIM URL securely using Colab Secrets:
- Click the key icon (Secrets) in the left sidebar
- Add: Name =
NIM_BASE_URL, Value =http://YOUR_EXTERNAL_IP:8000 - Toggle Notebook access to ON
- Run all cells
To avoid ongoing GCP charges:
kubectl delete namespace nim
gcloud container clusters delete nim-demo --zone=$ZONE --project=$PROJECT_ID --quiet Data Layer Inference Layer Deployment Layer
---------------- ---------------- ----------------
Synthetic Generator --> cuDF + UVM Loader -----------> Format Selector
Benchmark Inference Pipeline
(pandas, cuDF, cudf.pandas) Metrics (p50, p90, TTFT)
Query Engine <-- NIM Client <-- NIM on GKE
Benchmark results from running the notebooks on Colab: L4 GPU for notebook 01, T4 GPU for notebook 02, and NIM (L4 on GKE) for notebook 03.
The benchmark produces a single figure with 3 plots (time comparison, memory comparison, cuDF speedup) and a summary table.
Run notebook 01 to regenerate this figure; it is saved to docs/assets/01_benchmark_pandas_vs_cudf.png.
Takeaway: cuDF gives ~5× faster load and ~10–13× faster groupby/sort, with negligible host memory (data stays in GPU VRAM via UVM spill). cudf.pandas achieves similar GPU performance using the same pandas API—zero code change required. Use cuDF for explicit control, cudf.pandas to accelerate existing pandas code.
Natural-language answers to telemetry queries on a T4 GPU. Typical run: ~21s total for 5 queries (~4.2s avg latency).
Takeaway: Local GGUF inference works for low-latency, offline telemetry Q&A. Good for prototyping; heavier workloads may need cloud scaling.
Same queries via Llama 3 8B on NVIDIA NIM (L4 on GKE). Longer answers, higher latency due to model size and network.
Takeaway: NIM on GKE scales inference for production. Use notebook 02 for fast local iteration and notebook 03 for production-style deployment.
| Issue | Solution |
|---|---|
| L4 quota exceeded | See L4 quota exceeded below. |
| cuDF fails to import | Ensure you selected a GPU runtime (L4 or T4). cuDF requires a GPU. |
| "No GPU" or cuDF falls back to CPU | Runtime → Change runtime type → GPU (L4 or T4) → Save, then restart runtime. |
| Out of memory (past UVM spill) | Reduce row count in the generate cell (e.g. ROWS = 500_000 instead of 2_000_000). |
If you see Quota 'NVIDIA_L4_GPUS' exceeded. Limit: 1.0 in region us-central1:
- Free up quota: Stop or delete any existing Colab Enterprise runtimes or other L4 instances in that region.
- Request more L4 quota (e.g. Iowa / us-central1):
- Go to IAM & Admin → Quotas
- In the filter box, enter
NVIDIA_L4_GPUS - Select your project and Location: us-central1 (Iowa)
- Check the box next to NVIDIA L4 GPUs
- Click Edit quotas (top of page)
- Enter the new limit (e.g.
2if you want one more) - Add a short justification (e.g. "Need additional L4 for Colab Enterprise data analytics workloads")
- Click Submit. Google typically reviews instantly.
- Try another region: Create the runtime template in a different region (e.g.
us-east1,europe-west1) where you may have quota. See L4 availability. - Use T4 instead: Create a template with machine type
n1-standard-4and Attach GPU: NVIDIA Tesla T4 (1). T4 works with cuDF; the notebook runs unchanged. Then import and connect as usual.
| Issue | Solution |
|---|---|
| "HF_TOKEN not set" or 401 | Add HF_TOKEN in Colab Secrets. Accept the Gemma 2 license first. |
| Out of memory (OOM) | Use a GPU runtime (T4 or better). Gemma 2 2B needs ~4–6 GB VRAM. |
| Model download is slow | First run downloads ~5 GB. Use a stable connection; subsequent runs use cache. |
| Issue | Solution |
|---|---|
| ConnectionRefusedError | NIM may not be ready or not reachable. See steps below. |
| EXTERNAL-IP stuck on <pending> | Wait 2–5 min. If it stays pending, check GCP quotas for forwarding rules. |
| Pod not Running | First deployment can take 10–20 min for model download. Run kubectl get pods -n nim -w to watch. |
If you get ConnectionRefusedError when calling NIM from Colab:
-
Check the NIM pod is Running
kubectl get pods -n nim
Wait until
STATUSisRunningandREADYis1/1. -
Test from Cloud Shell (port-forward)
kubectl port-forward svc/my-nim-nim-llm 8000:8000 -n nim & sleep 5 curl -s http://localhost:8000/v1/models kill %1
If this works, NIM is fine; the problem is external access.
-
Check service endpoints
kubectl get endpoints -n nim
my-nim-nim-llmshould have at least one address. If it shows<none>, the pod is not ready. -
Verify LoadBalancer backends
In GCP Console → Network Services → Load balancing, open the load balancer for the NIM service and confirm the backends are healthy. -
Allow firewall (if needed)
gcloud compute firewall-rules create allow-nim-8000 \ --allow tcp:8000 \ --source-ranges 0.0.0.0/0 \ --target-tags $(kubectl get nodes -o jsonpath='{.items[0].metadata.labels.cloud\.google\.com\/gke-nodepool}' 2>/dev/null || echo "gke-nim-demo")
| Issue | Solution |
|---|---|
| Colab disconnects or session dies | Notebooks auto-clone the repo; re-run the setup cell. For long runs, consider Colab Pro for longer sessions. |
| Image or asset not loading | Ensure you cloned the full repo and that docs/ is present. |
See data/README_data_sources.md for a plan to gather real telemetry from nuScenes, KITTI, CARLA, ROS2 bags, and OBD-II CAN data.
- Deploy Faster Generative AI models with NVIDIA NIM on GKE
- Intro to Inference: How to Run AI Models on a GPU
- Speed Up Data Analytics on GPUs
Contributions are welcome. To keep the codebase safe, direct pushes to main are not allowed. All changes must go through pull requests (PRs).
- Fork the repository to your GitHub account.
- Clone your fork locally.
- Create a branch for your change:
git checkout -b feature/your-feature-name - Make your changes, commit, and push to your fork.
- Open a pull request from your fork’s branch to this repo’s
main. - Wait for review; maintainers will review and merge after approval.
This workflow ensures every change is reviewed before it reaches main. Repository branch protection rules enforce this.
Idea areas: adapters for nuScenes/KITTI, ROS2 bag-to-Parquet scripts, NIM prompt templates, benchmark results on different GPU configurations.
This project adheres to a code of conduct that all contributors are expected to follow.
- Be respectful and inclusive. Welcome newcomers and diverse perspectives.
- Focus on constructive feedback. Critique ideas, not people.
- No harassment, trolling, or discriminatory behavior.
- Help keep the community safe and productive for everyone.
Violations can be reported to the maintainers. We reserve the right to remove contributions or block users who do not follow these guidelines.
Apache 2.0



