Docker 101

Why Docker for Data Science?
Core Concepts
Installation & Setup
Dockerfile for ML Projects
Working with Images & Containers
Data & Model Management
GPU Support (CUDA)
Multi-Stage Builds
Docker Compose for ML Services
Model Serving with Docker
Private Registries & Sharing
Best Practices
Common Pitfalls
Cheat Sheet

1. Why Docker for Data Science?

The Problem

Every data scientist knows this workflow:

# New laptop / colleague's machine / cloud VM
git clone https://github.com/team/project.git
pip install -r requirements.txt
# ... next morning: 47 conflicts, 3 broken packages, CUDA mismatch

You’re fighting environment drift: Python version, CUDA toolkit, system libraries, incompatible transitive deps, OS differences.

The Solution

Docker packages your entire environment — OS, system libraries, Python version, CUDA, pip packages, and code — into a single immutable unit called an image. Anyone can spin up an identical copy (a container) on any machine.

Concrete Benefits for AI/ML

Problem	Docker Solution
“It works on my machine”	Same image → identical behavior everywhere
CUDA/cuDNN version hell	Pre-built CUDA images with pinned versions
Jupyter setup per project	One-liner `docker run -p 8888:8888`
Model serving dependencies	Immutable deployment artifact
Team onboarding	Single `docker pull` instead of hours of setup
CI/CD for ML	Test and deploy in the exact same environment
Reproducible research	`Dockerfile` + `requirements.txt` = executable paper

2. Core Concepts

Images vs Containers

┌─────────────────────────────────────────────┐
│                  DOCKER                      │
│                                              │
│  ┌──────────────┐      ┌──────────────┐     │
│  │    IMAGE     │ run  │  CONTAINER   │     │
│  │  (blueprint) │─────▶│  (running)   │     │
│  │              │      │              │     │
│  │  Read-only   │      │  Read-write  │     │
│  │  filesystem  │      │  layer on    │     │
│  │  + metadata  │      │  top of img  │     │
│  └──────────────┘      └──────────────┘     │
│         ▲                       │            │
│         │ build                 │ commit     │
│         │                       ▼            │
│  ┌──────────────┐      ┌──────────────┐     │
│  │  Dockerfile  │      │ New Image    │     │
│  │  (recipe)    │      │ (persisted)  │     │
│  └──────────────┘      └──────────────┘     │
└─────────────────────────────────────────────┘

Image: Immutable snapshot (≈ 1-10 GB). Think of it as a VM template.
Container: A running instance of an image. You can have many containers from one image.
Dockerfile: The recipe to build an image. Text file with instructions.
Registry: Storage for images. Docker Hub is the default public registry.
Volume: Persistent data storage that outlives containers.
Layer: Each instruction in a Dockerfile creates a cached layer. This makes rebuilds fast.

The Layering System

FROM python:3.11-slim        # Layer 1: ~120 MB (cached)
RUN apt-get update && apt-get install -y ...  # Layer 2: ~50 MB (cached)
COPY requirements.txt .       # Layer 3: ~1 KB (cached if file unchanged)
RUN pip install -r req...     # Layer 4: ~200 MB (cached if req.txt unchanged)
COPY . /app                   # Layer 5: ~5 MB (invalidated on code change)

Docker caches each layer. If you only change code (Layer 5), only Layer 5 rebuilds. If you add a dependency, Layer 3 onward rebuilds. This makes development iteration fast.

3. Installation & Setup

macOS

# Download Docker Desktop from https://www.docker.com/products/docker-desktop/
# Or use Homebrew:
brew install --cask docker
# Open Docker.app to start the daemon

Linux

# Ubuntu / Debian
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER  # Run docker without sudo (log out & back in)
newgrp docker                   # Or just use this to refresh group

# Verify
docker --version
docker run hello-world

Windows

Use Docker Desktop with WSL 2 backend. Install WSL 2 first, then Docker Desktop.

Post-Installation Check

docker info                     # System-wide info
docker version                  # Client + server versions
docker run --rm hello-world    # Verify it works

NVIDIA Container Toolkit (GPU support)

# For any OS with Docker installed:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee / /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

4. Dockerfile for ML Projects

Minimal Python ML Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "train.py"]

Build: docker build -t my-ml-project .

Production-Grade ML Dockerfile

# ============================================================
# Production ML Dockerfile with best practices
# ============================================================

# Use specific version tags — never "latest"
FROM python:3.11-slim AS base

# Prevent Python from writing .pyc files
ENV PYTHONDONTWRITEBYTECODE=1
# Ensure Python output is sent straight to terminal (no buffering)
ENV PYTHONUNBUFFERED=1

# Set working directory
WORKDIR /app

# Install system deps needed by many ML libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libgomp1 \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy only what's needed for pip install (layer caching optimization)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY . .

# Non-root user for security
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser

# Metadata
LABEL org.opencontainers.image.source="https://github.com/org/repo"
LABEL org.opencontainers.image.description="ML training pipeline"

# Default command (overridable at runtime)
CMD ["python", "train.py"]

requirements.txt for ML

torch==2.1.0
transformers==4.35.0
datasets==2.14.5
scikit-learn==1.3.2
pandas==2.1.3
numpy==1.24.3
matplotlib==3.8.0
tqdm==4.66.1
wandb==0.15.11

Jupyter Dockerfile

FROM python:3.11-slim

WORKDIR /home/jovyan

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt jupyter

EXPOSE 8888

CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

# Run Jupyter with port mapping and volume for persistence
docker build -t ml-jupyter .
docker run -it --rm \
  -p 8888:8888 \
  -v "$(pwd)/notebooks:/home/jovyan/notebooks" \
  ml-jupyter

5. Working with Images & Containers

Building Images

# Basic build
docker build -t my-ml-image .

# Build with tag
docker build -t my-ml-image:v1.0 .

# Build with no cache (fresh install)
docker build --no-cache -t my-ml-image .

# Build from different Dockerfile
docker build -f Dockerfile.gpu -t my-ml-image:gpu .

# Build with build args
docker build --build-arg CUDA_VERSION=12.1 -t my-ml-image .

Running Containers

# Basic run (foreground)
docker run my-ml-image

# Interactive shell
docker run -it my-ml-image /bin/bash

# Run in background (detached)
docker run -d --name ml-training my-ml-image

# Port mapping (host:container)
docker run -p 8888:8888 my-jupyter-image

# Mount a volume (host:container)
docker run -v /path/to/data:/app/data my-ml-image

# Mount with read-only
docker run -v /path/to/data:/app/data:ro my-ml-image

# Set environment variables
docker run -e WANDB_API_KEY=xxx -e CUDA_VISIBLE_DEVICES=0 my-ml-image

# Resource limits
docker run --memory=8g --cpus=4 my-ml-image

# Remove container after it exits (cleanup)
docker run --rm my-ml-image

# GPU access
docker run --gpus all my-ml-image

# All together (typical ML run)
docker run --rm --gpus all \
  -v /mnt/data:/data \
  -v $(pwd)/checkpoints:/app/checkpoints \
  -e WANDB_API_KEY=$WANDB_API_KEY \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  --memory=32g --cpus=8 \
  my-ml-image:latest python train.py --config configs/exp1.yaml

Managing Containers & Images

# List running containers
docker ps

# List all containers (including stopped)
docker ps -a

# List images
docker images

# Stop a container
docker stop <container_id_or_name>

# Remove a container
docker rm <container_id_or_name>

# Remove an image
docker rmi <image_id_or_name>

# Remove unused images, containers, networks (cleanup)
docker system prune -a

# View container logs
docker logs -f <container_name>    # -f follows output

# Execute command in running container
docker exec -it <container_name> /bin/bash

# Copy files from container
docker cp <container_name>:/app/outputs ./local_outputs

# Copy files into container
docker cp ./config.yaml <container_name>:/app/config.yaml

# Inspect container metadata
docker inspect <container_name>

# View resource usage
docker stats

6. Data & Model Management

The Volume Pattern

Containers are ephemeral. When they’re deleted, their filesystem disappears. Use volumes or bind mounts for:

Input datasets (read-only)
Model checkpoints / outputs (read-write)
Configuration files
Cached Hugging Face datasets and models

# Bind mount (host directory → container path)
docker run -v /absolute/host/path:/container/path ...

# Named volume (managed by Docker)
docker volume create ml-data
docker run -v ml-data:/app/data ...

# Anonymous volume (auto-created, auto-deleted with --rm)
docker run -v /app/data ...

Common Volume Layout for ML Projects

project/
├── data/                # input data → mounted read-only
│   ├── raw/
│   └── processed/
├── checkpoints/         # model weights → mounted read-write
├── configs/             # config files → mounted or baked in
├── outputs/             # logs, metrics, predictions
├── Dockerfile
└── train.py

docker run --rm --gpus all \
  -v /mnt/data/datasets:/data:ro \
  -v $(pwd)/checkpoints:/app/checkpoints \
  -v $(pwd)/configs:/app/configs:ro \
  -v $(pwd)/outputs:/app/outputs \
  my-training-image \
  python train.py --config /app/configs/exp1.yaml

Hugging Face Cache Volumes

# HF datasets and models can be gigabytes — cache them permanently
docker run --rm \
  -v huggingface-cache:/root/.cache/huggingface \
  -v $(pwd)/data:/data \
  my-hf-image

Temporary Data with tmpfs

For data that doesn’t need persistence (intermediate artifacts), use tmpfs — stored in RAM, super fast:

docker run --rm --gpus all \
  --tmpfs /app/tmp:size=4G \
  my-training-image

7. GPU Support (CUDA)

Choosing a Base Image

NVIDIA provides official CUDA base images. Pick the right one for your ML framework:

# PyTorch (includes its own CUDA runtime — use nvidia/cuda as base)
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

# Or build on NVIDIA's official image
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# TensorFlow
FROM tensorflow/tensorflow:2.13.0-gpu

# JAX
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
RUN pip install jax[cuda12_pip]

Dockerfile with GPU Support

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install additional packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "train.py"]

Running GPU Containers

# Single GPU
docker run --gpus all pytorch-image nvidia-smi

# Specific GPU
docker run --gpus '"device=0,1"' pytorch-image python train.py

# All GPUs
docker run --gpus all pytorch-image python train.py

Verify GPU Inside Container

# Inside container: check_gpu.py
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

docker run --rm --gpus all pytorch-image python check_gpu.py
# Output: CUDA available: True
#         GPU count: 2
#         GPU name: Tesla V100-SXM2-32GB

Common CUDA Issues

Problem	Cause	Fix
`CUDA error: no kernel image is available`	CUDA version mismatch	Match image CUDA to driver CUDA
`libcuda.so not found`	`--gpus all` flag missing	Add `--gpus all`
`CUDA driver version insufficient`	Driver too old for image CUDA	Upgrade driver or downgrade image
Out of memory	No memory limit	Add `--memory=32g` or limit `CUDA_VISIBLE_DEVICES`

8. Multi-Stage Builds

Keep final images small by separating build-time and runtime dependencies.

ML Training Image (Smaller Final Image)

# ============================================================
# Stage 1: Builder — installs all build dependencies
# ============================================================
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .

# Install build tools, compile packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && pip install --no-cache-dir --user -r requirements.txt

# ============================================================
# Stage 2: Runtime — minimal image with only what's needed
# ============================================================
FROM python:3.11-slim AS runtime

WORKDIR /app

# Copy only installed Python packages from builder
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH

# Copy application code
COPY . .

# Optional: install runtime-only system packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

CMD ["python", "train.py"]

Model Serving Image (Ultra-Small)

# ============================================================
# Stage 1: Build model artifacts
# ============================================================
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Optional: generate or compile model artifacts
COPY scripts/optimize_model.py .
RUN python optimize_model.py --output /app/optimized_model

# ============================================================
# Stage 2: Production serving
# ============================================================
FROM python:3.11-slim

WORKDIR /app

COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH

# Copy only what's needed for serving
COPY --from=builder /app/optimized_model ./model
COPY serve.py .

EXPOSE 8000

CMD ["python", "serve.py"]

9. Docker Compose for ML Services

Docker Compose orchestrates multi-container setups. Perfect for ML pipelines that need:

Training service (GPU)
Database (PostgreSQL for experiment tracking)
Model registry (MinIO for artifact storage)
Monitoring (Grafana + Prometheus)
Jupyter (for exploration)

docker-compose.yml for ML Pipeline

version: "3.8"

services:
  # ---- ML Training Service ----
  trainer:
    build:
      context: .
      dockerfile: Dockerfile.gpu
    image: ml-trainer:latest
    runtime: nvidia  # GPU access
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - WANDB_API_KEY=${WANDB_API_KEY}
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    volumes:
      - ./data:/data:ro
      - ./checkpoints:/app/checkpoints
      - ./configs:/app/configs:ro
      - ./outputs:/app/outputs
    depends_on:
      - mlflow
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  # ---- MLflow Tracking Server ----
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.7.1
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow
    command: >
      mlflow server
      --host 0.0.0.0
      --backend-store-uri sqlite:///mlflow/mlflow.db
      --default-artifact-root /mlflow/artifacts

  # ---- Model Serving API ----
  api:
    build:
      context: .
      dockerfile: Dockerfile.serve
    image: ml-api:latest
    ports:
      - "8000:8000"
    volumes:
      - ./checkpoints:/app/checkpoints:ro
    environment:
      - MODEL_PATH=/app/checkpoints/best.pt
    depends_on:
      trainer:
        condition: service_completed_successfully
    deploy:
      replicas: 3  # Scale horizontally

  # ---- Jupyter Notebook ----
  jupyter:
    image: jupyter/datascience-notebook:latest
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work
      - ./data:/home/jovyan/data:ro
    environment:
      - JUPYTER_TOKEN=changeme

  # ---- MinIO (S3-compatible storage for artifacts) ----
  minio:
    image: minio/minio:latest
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio-data:/data
    command: server /data --console-address ":9001"

volumes:
  mlflow-artifacts:
  minio-data:

Running Compose

# Start all services
docker compose up -d

# Start only specific services
docker compose up -d trainer mlflow

# Rebuild and start
docker compose up --build -d

# View logs
docker compose logs -f trainer

# Stop all
docker compose down

# Stop and remove volumes (careful!)
docker compose down -v

# Scale a service
docker compose up -d --scale api=3

Compose for Development

# docker-compose.dev.yml — overrides for development
version: "3.8"
services:
  trainer:
    build:
      context: .
      dockerfile: Dockerfile.dev
    volumes:
      - .:/app  # Hot-reload: mount source code
    command: python -m debugpy --listen 0.0.0.0:5678 train.py
    ports:
      - "5678:5678"  # Debugger port

# Combine configs
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -d

10. Model Serving with Docker

FastAPI Model Server

FROM python:3.11-slim

WORKDIR /app

COPY requirements-serve.txt .
RUN pip install --no-cache-dir -r requirements-serve.txt

COPY model.py .
COPY serve.py .

EXPOSE 8000

# Use gunicorn + uvicorn for production
CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "serve:app", "--bind", "0.0.0.0:8000", "--workers", "4"]

# serve.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from model import load_model

app = FastAPI()
model = load_model()

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    confidence: float

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    result = model.predict(req.text)
    return PredictResponse(label=result["label"], confidence=result["confidence"])

@app.get("/health")
async def health():
    return {"status": "ok"}

# Build and run
docker build -t model-api -f Dockerfile.serve .
docker run --rm -p 8000:8000 model-api

# Test
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was amazing!"}'

ONNX Runtime GPU Serving

FROM mcr.microsoft.com/onnxruntime/server:latest-gpu

WORKDIR /app
COPY model.onnx .
COPY serve.py .

EXPOSE 8001
CMD ["python", "serve.py"]

Triton Inference Server

For large-scale deployments, NVIDIA Triton supports multiple frameworks, dynamic batching, and model ensembles:

# Pull Triton image
docker pull nvcr.io/nvidia/tritonserver:23.10-py3

# Run with model repository
docker run --rm --gpus all \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /path/to/model_repo:/models \
  nvcr.io/nvidia/tritonserver:23.10-py3 \
  tritonserver --model-repository=/models

11. Private Registries & Sharing

Pushing to Docker Hub

docker login -u yourusername
docker tag my-ml-image:latest yourusername/my-ml-image:v1.0
docker push yourusername/my-ml-image:v1.0

Pushing to a Private Registry

docker tag my-ml-image:latest registry.example.com/team/my-ml-image:v1.0
docker push registry.example.com/team/my-ml-image:v1.0

Sharing with Colleagues

# Save image to a tar file
docker save my-ml-image:latest | gzip > my-ml-image.tar.gz
# Transfer via SCP, S3, etc.

# Load on another machine
gunzip -c my-ml-image.tar.gz | docker load

12. Best Practices

Dockerfile Best Practices

Pin all versions — never use latest

# Bad
FROM python:latest
RUN pip install torch

# Good
FROM python:3.11-slim
RUN pip install torch==2.1.0

Optimize layer caching — copy requirements.txt before source code

COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .    # Code changes don't invalidate pip layer

Minimize layers — combine related RUN commands

RUN apt-get update && apt-get install -y \
    pkg1 pkg2 \
    && rm -rf /var/lib/apt/lists/*

Use .dockerignore to exclude unnecessary files

__pycache__/
.git/
.env
*.pyc
.ipynb_checkpoints/
data/          # Don't copy large datasets into images
checkpoints/   # Don't copy model weights into images
*.tar.gz
notebooks/

Run as non-root user for security

RUN useradd -m -u 1000 appuser
USER appuser

Keep images small — use -slim variants, multi-stage builds

docker images | grep my-ml
# python:3.11     → ~900 MB
# python:3.11-slim → ~120 MB

Runtime Best Practices

Always use --rm for disposable containers to avoid accumulation
Mount data as volumes — never copy datasets into images
Use environment variables for secrets (API keys, etc.)
Set resource limits — especially for shared GPU servers
Log to stdout/stderr — Docker captures these automatically

Security Best Practices

Never bake secrets into images

# Bad: secret baked into image
ENV WANDB_API_KEY=abc123

# Good: passed at runtime
docker run -e WANDB_API_KEY=$WANDB_API_KEY ...

Use Docker Scout or Trivy to scan images for vulnerabilities

docker scout quickview my-ml-image
docker scout recommendations my-ml-image

13. Common Pitfalls

Disk Space Bloat

# Check disk usage
docker system df

# Clean unused
docker system prune -a --volumes

Prevention: Use .dockerignore, --no-cache-dir in pip, multi-stage builds.

Permissions Issues with Volumes

Symptom: Files created by container owned by root, can’t delete them from host.

Fix: Match UID inside container to host user.

ARG UID=1000
RUN useradd -m -u $UID appuser
USER appuser

docker build --build-arg UID=$(id -u) -t my-ml-image .

Network Performance

Issue: Default bridge network has no DNS resolution between containers.

Fix: Use Docker Compose or --network host for high-performance scenarios.

Container Name Conflicts

# Error: conflict — container name already in use
docker run --name trainer my-ml-image

# Fix: remove existing or use --rm
docker rm trainer
# or
docker run --rm --name trainer my-ml-image

14. Cheat Sheet

Quick Reference

# Build
docker build -t name:tag .                    # Build image
docker build --no-cache -t name:tag .         # Fresh build
docker build --build-arg VAR=val -t name .    # Build with args

# Run
docker run image                              # Run in foreground
docker run -d --name name image               # Run in background
docker run --rm image                         # Auto-remove on exit
docker run -it image /bin/bash                # Interactive shell
docker run -p 8080:80 image                   # Port mapping
docker run -v /host:/container image          # Volume mount
docker run --gpus all image                   # GPU access
docker run -e KEY=val image                   # Environment var

# Manage
docker ps                                     # Running containers
docker ps -a                                  # All containers
docker images                                 # List images
docker stop name                              # Stop container
docker rm name                                # Remove container
docker rmi image                              # Remove image
docker logs -f name                           # Follow logs
docker exec -it name /bin/bash                # Enter running container
docker cp name:/path ./                       # Copy from container
docker system prune -a                        # Clean everything

# Compose
docker compose up -d                          # Start services
docker compose down                           # Stop services
docker compose logs -f service                # Follow service logs
docker compose up --build -d                  # Rebuild and start
docker compose -f prod.yml up -d              # Use alternate compose file

# GPU
docker run --gpus all image nvidia-smi        # Verify GPU
docker run --gpus '"device=0"' image          # Specific GPU

# Registry
docker login                                  # Log in to registry
docker tag src dest                           # Tag image
docker push image                             # Push to registry
docker pull image                             # Pull from registry
docker save image | gzip > file.tar.gz        # Export image
gunzip -c file.tar.gz | docker load           # Import image

# Info
docker version                                # Version info
docker info                                   # System info
docker stats                                  # Live resource usage
docker inspect name                           # Container details
docker system df                              # Disk usage

Common ML Image References

Image	Size	Use Case
`python:3.11-slim`	~120 MB	Minimal base
`pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime`	~7 GB	PyTorch GPU training
`tensorflow/tensorflow:2.13.0-gpu`	~4 GB	TensorFlow GPU training
`nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04`	~2 GB	Custom CUDA setup
`jupyter/datascience-notebook:latest`	~3 GB	Jupyter for data science
`nvcr.io/nvidia/tritonserver:23.10-py3`	~8 GB	Production model serving
`mcr.microsoft.com/onnxruntime/server:latest-gpu`	~2 GB	ONNX inference

Quick Start Template

# 1. Create project structure
mkdir ml-docker-project && cd ml-docker-project
mkdir data checkpoints configs outputs notebooks

# 2. Create requirements.txt
cat > requirements.txt << 'EOF'
torch==2.1.0
transformers==4.35.0
scikit-learn==1.3.2
pandas==2.1.3
numpy==1.24.3
tqdm==4.66.1
EOF

# 3. Create Dockerfile
cat > Dockerfile << 'EOF'
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "train.py"]
EOF

# 4. Create .dockerignore
cat > .dockerignore << 'EOF'
__pycache__/
.git/
*.pyc
data/
checkpoints/
notebooks/
EOF

# 5. Build
docker build -t ml-project .

# 6. Run with GPU and data mounts
docker run --rm --gpus all \
  -v $(pwd)/data:/data:ro \
  -v $(pwd)/checkpoints:/app/checkpoints \
  -v $(pwd)/configs:/app/configs:ro \
  ml-project

Pro Tip: Commit your Dockerfile, docker-compose.yml, and .dockerignore to version control alongside your code. This makes your entire AI project fully reproducible with a single command: docker compose up --build.

Table of Contents