Skip to content

TheRealFREDP3D/Ollama-Colab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ollama Colab v2.5

Run Ollama (a tool for running large language models locally) on Google Colab with a public HTTPS endpoint via Cloudflare Tunnel or ngrok.

Built as a self-teaching project exploring LLM inference, tunneling, and remote API access.


What it does

  • Installs Ollama on a Colab GPU instance
  • Lets you browse and select any model from the Ollama library
  • Exposes the Ollama API over a public HTTPS tunnel
  • Supports the standard Ollama API and the Anthropic-compatible /v1/messages endpoint (Ollama v0.14.0+)

Quick Start

  1. Open the notebook in Google Colab
  2. Set GPU runtime: Runtime → Change runtime type → T4 GPU → Save
  3. Run Runtime → Run all, or execute each cell in order (Steps 1–10)
  4. Copy the public URL printed in Step 8
  5. Test with curl, Python, or a VS Code extension

Notebook Structure

Step What it does
1 — Install Dependencies Checks Python/GPU, installs missing packages, initialises shared globals
2 — Model Browser Interactive widget to browse/filter/select Ollama models and tags
3 — Configure Environment Memory management policy, tunnel type, loads Colab Secrets
4 — GPU Monitoring One-shot GPU snapshot; monitor_gpu() available for live stats
5 — Install Ollama Downloads and installs the Ollama binary via the official install script
6 — Pull Model Downloads the selected model with streamed progress output
7 — Start Ollama Server Launches ollama serve in the background, waits for the API to respond
8 — Start Tunnel Creates the public HTTPS endpoint (Cloudflare quick/named or ngrok)
9 — Test Endpoint Sends a configurable test prompt and prints the response
10 — Keep Server Alive Blocking loop; stops cell (■) to shut down cleanly

Tunnel Options

Option Credentials needed URL stability
Cloudflare quick tunnel None Changes every run
Cloudflare named tunnel cloudflare_token secret Fixed hostname
ngrok ngrok_authtoken secret Rotates every ~2 hours (free tier)

Add secrets via the 🔑 sidebar in Colab before running Step 3.

Secret name Where to get it
cloudflare_token Cloudflare dashboard → Networks → Tunnels → Create tunnel
ngrok_authtoken ngrok dashboard

API Usage

Replace https://your-url with the URL printed in Step 8.

curl

# Chat
curl -X POST https://your-url/api/chat \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen2.5:14b", "messages": [{"role": "user", "content": "Explain recursion"}]}'

# Anthropic-compatible endpoint (Ollama v0.14.0+)
curl -X POST https://your-url/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{"model": "qwen2.5:14b", "max_tokens": 1024, "messages": [{"role": "user", "content": "Hello!"}]}'

Python

import requests

response = requests.post(
    'https://your-url/api/chat',
    json={
        'model':    'qwen2.5:14b',
        'messages': [{'role': 'user', 'content': 'Write a Flask API'}],
        'stream':   False,
    }
)
print(response.json()['message']['content'])

Structured Output

schema = {
    'type': 'object',
    'properties': {
        'name':   {'type': 'string'},
        'age':    {'type': 'number'},
        'skills': {'type': 'array', 'items': {'type': 'string'}},
    },
    'required': ['name', 'age'],
}

response = requests.post(
    'https://your-url/api/generate',
    json={'model': 'qwen2.5:14b', 'prompt': 'Generate a developer profile',
          'format': schema, 'stream': False},
)

Security

⚠️ Your Ollama instance is publicly accessible without authentication.

  • Use for short testing sessions only
  • Monitor GPU usage regularly in the Colab UI
  • Do not send sensitive or personal data through the public endpoint
  • For production use, add authentication (API key, OAuth, or a reverse proxy)

Emergency shutdown

for name, proc in _bg_processes.items():
    try:
        proc.terminate()
        print(f'Terminated: {name}')
    except Exception:
        pass

Colab Limits

Tier Session length GPU
Free A few hours T4 (16 GB VRAM)
Pro Up to 24 hours L4 (24 GB VRAM)
Pro+ Up to 24 hours A100 (40 GB VRAM)

Recommended Models by GPU

Model Size on disk Fits on T4?
qwen2.5:7b ~4 GB ✅ Comfortable
qwen2.5-coder:14b ~9 GB ✅ Comfortable
deepseek-r1:14b ~9 GB ✅ Comfortable
llama3.1:8b ~5 GB ✅ Comfortable
qwen2.5:32b ~20 GB ⚠️ May exceed T4 VRAM
deepseek-r1:32b ~20 GB ⚠️ May exceed T4 VRAM

Changelog

v2.5

  • Fixed: duplicate model browser definitions removed
  • Fixed: missing ollama serve step added (Step 7)
  • Fixed: missing Ollama install step added (Step 5)
  • Fixed: security advisory was in a code cell (now markdown only)
  • Fixed: _bg_processes and globals now initialised in Step 1 to prevent NameError on out-of-order execution
  • Fixed: GPU monitor HTML table now built as a single string before display() call
  • Fixed: step numbers now match the Table of Contents (1–10, no gaps or duplicates)
  • Improved: memory config widget auto-applies on change and persists to os.environ
  • Improved: type hints and docstrings throughout
  • Improved: check_model_memory_safety uses a single dict lookup instead of nested if-chains

v2.4

  • Initial public release
  • Interactive model browser with Ollama library scraping and static fallback
  • Cloudflare quick tunnel, named tunnel, and ngrok support
  • GPU monitoring widget
  • Memory management policy widget

Resources

About

Run Ollama on Google Colab with a public HTTPS endpoint via Cloudflare Tunnel or ngrok.

Topics

Resources

Stars

Watchers

Forks

Contributors