Caching Patterns for Tiny ML on SBCs: Memory, Disk, and Swap Tuning for the Raspberry Pi 5
IoToptimizationtutorial

Caching Patterns for Tiny ML on SBCs: Memory, Disk, and Swap Tuning for the Raspberry Pi 5

ccached
2026-02-14
10 min read
Advertisement

Practical, system-level tuning for Tiny ML on Raspberry Pi 5: zram swap, tmpfs model caches, lazy weight streaming, and Python GC tips to cut latency.

Hook: When your Raspberry Pi 5 chokes on a 100MB model, it's not the chip — it's the cache

Running Tiny ML on single-board computers (SBCs) like the Raspberry Pi 5 in 2026 is normal for production edge inference — but the real friction comes from memory pressure, slow storage, GC pauses and unpredictable swap behavior. This deep-dive gives system-level and runtime recipes you can apply today: swap strategies, tmpfs model caching, lazy-loading/streaming weights, and Python GC and allocator tuning so your small-board ML runs predictably under load.

Executive summary — what to do first

  • Use zram as compressed swap, tune vm.swappiness low (5–10), and prefer zram over SD/eMMC swap to reduce wear and improve latency.
  • Mount a tmpfs for hot models (e.g., /var/tmp/models) and populate it at boot with a systemd service to cut model load time 3–10× versus SD cards.
  • Adopt streaming / memory-mapped weight loading (mmap, NumPy memmap, safetensors/mmap) and shard large models so you only touch layers you need.
  • Tune CPython GC and glibc malloc arena settings to reduce fragmentation and pause jitter during inference.
  • For fleets, serve weights with proper HTTP caching / Range support so devices can stream and reuse partial downloads; use Varnish/Redis at the gateway if you manage many SBCs.

The 2026 context — why this matters now

By late 2025 and into 2026 the SBC ecosystem matured: Pi 5 hardware plus low-cost NPUs (AI HATs) and better quantized toolchains (4-bit quantization, WebNN and optimized ONNX/TFLite backends) mean more models fit the edge. But model sizes are not shrinking as quickly — generative models and multistream sensor fusion still push working set sizes beyond the physical RAM on many Pi configurations (4–8GB). That makes caching and predictable memory management essential.

  • More devices use on-board NPUs or USB/PCIe accelerators; host memory still matters for pre- and post-processing.
  • Transfer optimizations (HTTP/2, QUIC) and server-side caching (Varnish, CDNs) let fleets stream weights on-demand rather than store everything locally.
  • Compressed swap (zram) and memory-mapped weight formats (safetensors, memmap) became widely supported in inference runtimes by 2025.

Part 1 — Swap strategy: zram, swappiness, and SSD vs SD tradeoffs

Swap is a safety net, not a performance feature. On SBCs you have three practical swap choices:

  1. No swap — fast but immediate OOM risk.
  2. Disk swap (SD/eMMC/SSD) — persists but slow and wears flash.
  3. zram (compressed RAM swap) — fast and avoids flash wear; recommended as primary swap for Pi 5 setups.

Install the generator and add a simple config. This uses RAM compression (lz4) and avoids SD wear. Adjust size to your memory — keep it below half of physical RAM for headroom.

# Install (Debian/Raspbian-like)
sudo apt update && sudo apt install systemd-zram-generator

# Create /etc/systemd/zram-generator.conf
# Example: 2GB zram for a 4GB Pi
cat > /etc/systemd/zram-generator.conf <<'EOF'
[zram0]
zram-size = "2G"
compression-algorithm = "lz4"
priority = 100
EOF

# Reboot or start generated swap
sudo systemctl daemon-reload
sudo systemctl start systemd-zram-setup@zram0.service

Swappiness and kernel knobs

Tune swappiness low so the kernel prefers to drop caches and OOM-kill less aggressively:

# Make persistent
sudo sysctl -w vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf

# Reduce page cache pressure so file-backed reads stay hot
sudo sysctl -w vm.vfs_cache_pressure=50
echo 'vm.vfs_cache_pressure=50' | sudo tee -a /etc/sysctl.conf

Sizing rules (simple)

  • 4GB Pi: zram 1.5–2GB
  • 8GB Pi: zram 2–4GB
  • If you have an NVMe/SSD, prefer a small swapfile on SSD as a cold overflow, but keep zram as primary. For SSD and NAND wear considerations see flash and caching strategies.

Part 2 — tmpfs for model caching: mount, populate, and persist

Mounting a tmpfs for hot models drastically reduces model load latency and avoids repeated read latency from SD cards. Strategy: keep a small, curated set of quantized models in tmpfs that your app uses frequently.

Mount and auto-populate a tmpfs on boot

Use a systemd service to copy artifacts from persistent storage (e.g., /srv/models) into a tmpfs (/var/tmp/models) at boot. That keeps your persistent storage unchanged and lets you rebuild at shutdown or on update. See practical local-first edge tooling for device-level workflows in local-first edge tools.

# /etc/fstab entry (optional)
tmpfs /var/tmp/models tmpfs rw,nodev,nosuid,size=2G 0 0

# systemd service to populate tmpfs: /etc/systemd/system/modelcache-populate.service
cat > /etc/systemd/system/modelcache-populate.service <<'EOF'
[Unit]
Description=Populate model tmpfs
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/populate-model-cache.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

# /usr/local/bin/populate-model-cache.sh
#!/bin/bash
set -e
mkdir -p /var/tmp/models
# copy only the quantized filenames configured for your workload
cp /srv/models/*.tflite /var/tmp/models/ || true
cp /srv/models/*.onnx /var/tmp/models/ || true
chown -R pi:pi /var/tmp/models

# enable service
sudo chmod +x /usr/local/bin/populate-model-cache.sh
sudo systemctl daemon-reload
sudo systemctl enable --now modelcache-populate.service

Eviction policies and size limits

  • Keep tmpfs size conservative (20–40% of RAM) so the kernel still has headroom for inference allocations.
  • Use systemd timers or application signals to refresh models during off-peak windows.

Part 3 — Lazy-loading and streaming weights

Lazy-loading avoids allocating the whole model at once. There are three practical approaches that work on Pi 5 in 2026:

  1. Memory-map the model file (mmap or numpy.memmap) so the OS loads pages on demand.
  2. Shard model files by layer or block and load only needed shards.
  3. Stream via HTTP Range requests from a nearby server or CDN that supports partial fetches.

Memory-map example (Python, generic)

For TFLite or custom binary tensors, use mmap to back the model without reading it entirely into RAM. This pattern keeps resident memory low:

import mmap
import os

path = '/var/tmp/models/large_model.bin'
with open(path, 'rb') as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    # mm behaves like a bytes-like object; pass it to your runtime if supported
    # For custom loaders you can parse headers and mmap slices for tensors
    header = mm[:128]
    # When done
    mm.close()

NumPy memmap for large arrays

import numpy as np
arr = np.memmap('/var/tmp/models/weights.dat', dtype='float32', mode='r', shape=(1000000,))
# Access slices without full load
slice0 = arr[0:10000]

Streaming via HTTP Range (server-side: enable Range and caching)

On the server that serves model files, ensure Range requests and caching headers are enabled. This lets devices fetch only the layers they need and reuse cached ranges. Server-side caching and edge strategies overlap with fleet patterns explained in fleet and edge content strategies.

# Example curl partial fetch
curl -H "Range: bytes=1048576-2097151" https://models.example.local/large_model.safetensors -o part.bin

Sharding and lazy state_dict loading (PyTorch pattern)

Shard checkpoints into per-layer files and load them on first use. Many model conversion pipelines (2024–2026 toolchain) support exporting per-layer artifacts for better streaming.

Part 4 — GC and allocator tuning for Python runtimes

Python GC and glibc memory arenas cause unpredictable latency. Tune them for a steady inference pipeline.

1) CPython GC: reduce pause jitter

Defaults are biased toward interactive workloads and can trigger collections at bad times during inference. Increase thresholds and perform explicit collections at safe points (e.g., after batch processing).

import gc
# Increase thresholds to run GC less often
gc.set_threshold(2000, 20, 20)

# Run a manual collection during idle periods
# e.g., after N inferences
if should_collect():
    gc.collect()

2) MALLOC_ARENA_MAX and glibc fragmentation

Reduce glibc arenas to minimize memory fragmentation (especially important on multi-threaded inference):

export MALLOC_ARENA_MAX=1
# Put in /etc/environment for system-wide effect
echo 'MALLOC_ARENA_MAX=1' | sudo tee -a /etc/environment

3) Use a specialized allocator when beneficial

jemalloc or mimalloc can reduce fragmentation for long-running services. Test before deploying — gains depend on workload and build complexity. On Pi 5, prebuilt binaries for mimalloc are lightweight and often drop-in with LD_PRELOAD.

4) Runtime flags and interpreter choices

  • Set PYTHONDONTWRITEBYTECODE=1 to avoid .pyc churn.
  • Disable debug builds of Python in production; they use more memory.
  • Consider PyPy for long-running CPU-heavy pipelines where tracing JIT helps, but test NPU bindings and native libs for compatibility.

Part 5 — Deployment recipes, benchmarks and troubleshooting

Bootstrap recipe: fast checklist

  1. Enable zram with size ~= 25–50% RAM and lz4 compression.
  2. Set vm.swappiness=5–10 and vm.vfs_cache_pressure=50.
  3. Mount tmpfs for hot models (size 20–40% RAM) and populate at boot.
  4. Export MALLOC_ARENA_MAX=1 and PYTHONDONTWRITEBYTECODE=1.
  5. Use mmap/safetensors/memmap strategies for weight access; shard large models.
  6. Tune GC: gc.set_threshold higher and schedule explicit gc.collect during idle windows.

Benchmark expectations (realistic)

  • Model cold load from SD: 800ms–4s (depends on model size and SD speed)
  • Model load from tmpfs: 50ms–500ms — commonly 3–10× faster than SD
  • zram swap activation: keeps system alive but is ~2–4× slower than RAM for random access; still far faster than SD swap due to compression and reduced I/O.
  • GC tuning reduces 95th-percentile latency spikes; you should measure before/after with perf or py-spy.

Troubleshooting checklist

  • If you see OOM kills: check dmesg for OOM killer reasons, increase zram swap or reduce tmpfs size.
  • If model loads are slow but memory is free: ensure tmpfs is used and not falling back to disk via incorrect mounts.
  • If you get degraded throughput under concurrency: check MALLOC_ARENA_MAX and consider using a different allocator or reducing threads.
  • Excessive flash writes: inspect swap usage and move heavy swap to zram or SSD; for flash-wear mitigation strategies see flash and caching strategies for cheap NAND.

Part 6 — Fleet patterns: HTTP caching, Varnish/Redis and partial downloads

When you manage many Pi devices, serving pre-quantized model shards from a nearby edge server saves bandwidth and supports partial downloads.

Server-side: headers and range support

Ensure model artifacts are served with:

  • Cache-Control: public, max-age=31536000 (if immutable/hashed filename)
  • ETag and Last-Modified
  • Accept-Ranges: bytes to permit partial fetch
# Nginx sample snippet
location /models/ {
    add_header Cache-Control "public, max-age=31536000, immutable";
    add_header Access-Control-Allow-Origin "*";
}

Edge caches: Varnish and Redis

Use Varnish as an HTTP cache for model artifacts — it handles Range requests and offloads backend bandwidth. Redis is useful for storing metadata (version maps, shard manifests, device-config) so devices can quickly decide which shards to request.

Client-side: manifest + Range streaming

Ship a small JSON manifest with shard offsets and checksums. The device uses Range to download needed shards into tmpfs and validates checksums before using them.

# manifest.json example
{
  "model": "large_model_v3.safetensors",
  "shards": [
    {"offset":0, "length":1048576, "sha256":"..."},
    {"offset":1048576, "length":1048576, "sha256":"..."}
  ]
}

Part 7 — Advanced: runtime integrations and model formats (2026)

By 2026 most inference runtimes on small boards support memory-mapped formats or have direct mmap options. Some recommended moves:

  • Use safetensors where possible — it's faster to parse and easier to mmap safely than some legacy formats.
  • Prefer quantized exports (4/8-bit) to reduce working set.
  • Use acceleration delegates in TFLite or ONNXRuntime that accept mmap-backed models or file descriptors to avoid copying into process heap. See storage-focused device guidance: storage considerations for on-device AI.

Wrap-up: quick action plan

Follow this minimal rollout plan and measure at each step:

  1. Baseline: measure model load time and P90/P99 latency with current setup.
  2. Enable zram and lower swappiness; measure improvement in stability and tail latency.
  3. Mount tmpfs, populate one hot model and re-measure load times.
  4. Convert your heaviest model to a memory-mappable format (safetensors/memmap) and implement lazy load.
  5. Tune Python GC and MALLOC_ARENA_MAX and run a stress test.

Pro tip: Do not copy the entire model into a Python object unless you must. Keep the file mmapped and only materialize tensors required for the current inference step.

Further reading & tooling (2024–2026): where the ecosystem landed

  • Use safetensors / memmap patterns for safe, fast weight access.
  • Leverage ONNX Runtime and TFLite backends optimized for ARM NPUs introduced in late 2024–2025 and matured by 2026.
  • Adopt edge HTTP caching with Varnish and Range support when you operate fleets.

Final takeaways

On Raspberry Pi 5 and similar SBCs, predictable Tiny ML performance is a systems problem — not just a model problem. Using zram, conservative tmpfs caching, memory-mapped and shardable model formats, and tuned Python GC/allocator settings will reduce latency, limit flash wear, and make behavior under load predictable.

Call to action

Try the checklist above on one Pi 5 device this week: enable zram, mount a tmpfs, and memory-map a model. Measure load times and tail latency, then iterate by tuning GC and swappiness. If you want a ready-made toolbox, clone our repo of systemd services, populate scripts and sample manifests (search: cached.space tiny-ml-pi5). Share your benchmark numbers or ask for a custom tuning plan — let's make your edge ML run reliably.

Advertisement

Related Topics

#IoT#optimization#tutorial
c

cached

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-14T03:01:42.868Z