TinyML – Machine Learning on Microcontrollers (2025 Hands-On Guide)

TinyML – Machine Learning on Microcontrollers

1. What Is TinyML & Why Now?

TinyML brings machine learning inference to microcontrollers (MCUs) that run on milliwatts—or even microwatts—of power. Unlike SBCs (e.g., Raspberry Pi 4), MCUs have kilobytes to a few megabytes of RAM and no operating system, yet can perform keyword spotting, anomaly detection, and simple vision tasks offline with real-time latency.

Benefits: instant response, privacy (no cloud), low bandwidth, long battery life, low BOM cost, and improved reliability in poor connectivity environments.

2. Hardware: Boards, MCUs & Sensors

Popular Boards

  • Arduino Nano 33 BLE Sense (nRF52840, 1 MB Flash, 256 KB RAM; on-board mic/IMU/temp)
  • ESP32-S3 (Xtensa dual-core with vector accel; Wi-Fi/BLE; more RAM via PSRAM)
  • STM32 (Cortex-M4/M7; DSP extensions; wide ecosystem)
  • Raspberry Pi Pico (RP2040) (dual-core M0+; PIO; add external sensors)

Sensor Choices

  • Audio mic (keyword spotting, cough detection)
  • IMU (gesture, activity, machinery vibration)
  • Environmental (temp, humidity, gas—contextual features)
  • Low-res camera (basic motion/object presence)

Selection Criteria

  • RAM/Flash headroom vs. model size
  • Vector/DSP support (CMSIS-NN, ESP-NN)
  • Power modes (deep sleep, ULP co-processor)
  • I/O & radios (BLE/Wi-Fi) for event streaming

3. Workflow: Data → Features → Model → Firmware

  1. Collect: Record representative data on the target sensor at production sampling rates.
  2. Label & Split: Train/val/test; keep a hold-out from a different day/device.
  3. Feature Engineering: MFCCs for audio; spectral stats for vibration; simple image downsampling (e.g., 96×96 grayscale).
  4. Model: Choose MCU-friendly architectures; keep parameters small.
  5. Optimize: int8 quantization; optional pruning/distillation.
  6. Package: Convert to TFLite Micro or ONNX and generate a C array for Flash.
  7. Deploy: Integrate with RTOS/loop; add ring buffers and state smoothing.
  8. Measure: RAM/Flash usage, latency, accuracy, and current draw.

4. Model Architectures That Fit in KBs

Classical ML

  • Logistic regression / linear SVM on MFCC or spectral features
  • Random Forest (shallow) for vibrations/IMU
  • K-NN with prototype compression

Tiny DNNs

  • DS-CNN for keyword spotting (depthwise-separable convs)
  • 1D CNN for IMU sequences
  • Tiny MobileNet for 96×96 grayscale vision
  • TCN (very small) for time-series patterns

Anomaly Detection

  • Autoencoder on spectral features
  • One-Class SVM / Isolation Forest
  • Statistical thresholds with adaptive baselines
Rule of thumb: under ~100K parameters and int8 weights can run comfortably on many Cortex-M4F/M7 class MCUs.

5. Quantization, Pruning & Distillation:

  • Integer Quantization (int8): Converts weights/activations to 8-bit. Calibrate with a representative dataset.
  • Pruning: Remove small-magnitude weights, then fine-tune and re-quantize.
  • Knowledge Distillation: Train a small “student” to match a large “teacher.”
  • Operator Fusion & CMSIS-NN: Use optimized kernels for ARM; ESP-NN for ESP32.
Size/Speed Tradeoff (textual)
FP32 DS-CNN: 420 KB weights → int8: ~110 KB; latency ↓ ~3–5×; accuracy -0.5 to -1.5 pp

6. Deployment with TFLM, Edge Impulse & ONNX:

TensorFlow Lite Micro (TFLM)

  • Compile model as a C array; no dynamic allocation by default.
  • Use arena sizing to control RAM; enable CMSIS-NN or ESP-NN.
  • Integrate with Arduino, Zephyr, FreeRTOS, or bare-metal loops.

Edge Impulse Studio

  • Collect/label data, design DSP blocks + models in GUI.
  • Auto-quantization and deployment to many boards.
  • Generates firmware and profiling reports.

ONNX Runtime for MCUs

  • Convert PyTorch/Sklearn models; subset of ops on embedded.
  • Static memory planning; mixed int8/FP16 on some targets.

7. Power, Latency & Memory Budgets

Latency ≤ 50 ms (audio) RAM budget: 64–256 KB Flash budget: 256 KB–1 MB Sleep current: µA Active current: mA
  • Duty Cycling: Wake on interrupt (sound level, IMU motion), process a short window, go back to sleep.
  • Cascaded Pipelines: Cheap heuristic gate → expensive model only on triggers.
  • Buffering: Use ring buffers for streaming windows; avoid heap fragmentation.
  • Fixed-point DSP: Prefer int16/int8 path; pre-compute FFT twiddles.

8. Privacy, Security & Updates:

  • On-device inference: no raw data leaves device by default.
  • Signed firmware: Secure boot + OTA with signature checks.
  • Data retention: Store only features, not raw audio/video, when possible.
  • Shadow mode: Log predictions for tuning before enabling actions.

9. Use Cases & Project Blueprints:

Keyword Spotting (Wake Word)

Sensors: MEMS mic • Features: MFCC (20–40 coeffs) • Model: DS-CNN (~60–100K params)

Pipeline: VAD gate → MFCC → int8 DS-CNN → smoothing (debounce 200 ms) → trigger GPIO/BLE event.

Vibration Anomaly (Predictive Maintenance)

Sensors: IMU/accelerometer • Features: spectral bands, RMS, kurtosis • Model: 1D CNN or autoencoder

Actions: BLE alert when anomaly score > threshold for N windows; store top-K FFT bins.

Gesture Recognition

Sensors: 6-axis/9-axis IMU • Features: sliding windows (100–200 ms) • Model: TCN/1D CNN

Simple Vision Events

Sensors: tiny grayscale camera • Preproc: downsample + normalize • Model: tiny MobileNet; or motion heuristics + classifier.

10. Example Code: C++ Inference Loop (TFLM)

Minimal inference loop with ring buffer and debounce
// Pseudo-code for Arduino/ESP32 using TFLite Micro
#include "model_data.h"     // const unsigned char g_model[]; size_t g_model_len;
#include "tflite_micro_all.h"

constexpr int kAudioHz = 16000;
constexpr int kWinMs = 30, kHopMs = 10;
RingBuffer<int16_t, kAudioHz * 1> audio_rb; // 1s buffer
int8_t input_tensor[INPUT_BYTES];           // from model metadata
int8_t output_tensor[OUTPUT_BYTES];

void setup() {
  init_mic(kAudioHz);
  init_mfcc(/*coeffs=*/32, kWinMs, kHopMs);
  tflm_init(g_model, g_model_len);          // sets up arena, CMSIS-NN
}

void loop() {
  if (mic_available()) {
    int16_t s = mic_read();
    audio_rb.push(s);
  }
  if (audio_rb.ready_frame(kWinMs, kHopMs)) {
    mfcc_extract(audio_rb.latest_frame(), input_tensor);
    tflm_invoke(input_tensor, output_tensor);
    int cls = argmax(output_tensor);
    if (cls == KEYWORD && debounce_ms(200)) {
      trigger_event(); // GPIO/BLE/Wi-Fi
    }
  }
}

11. Debugging & Benchmarking:

  • Sanity checks: Overfit a tiny subset; ensure pipeline correctness.
  • Feature drift: Log mean/var of features on device; compare with training.
  • Profiling: Toggle GPIO before/after inference; measure with logic analyzer.
  • Memory: Print arena usage; binary size breakdown (map file).
  • Latency: Aim <50 ms for audio; <100 ms for IMU gestures.

Benchmark Table (Example)

TaskBoardModelFlashRAMLatencyAccuracy
KeywordNano 33 BLEDS-CNN int8120 KB60 KB22 ms94%
VibrationESP32-S31D CNN int8150 KB80 KB18 ms95% AUC
GestureRP2040TCN int890 KB48 KB28 ms92%
Note: Numbers above are illustrative—measure on your exact firmware, compiler flags, and kernels.

12. FAQs

Can TinyML train on-device?

Mostly no—training is compute-heavy. Some boards can fine-tune tiny models or thresholds; for full training, use desktop/cloud and deploy weights.

How do I update models in the field?

Use OTA with signed artifacts. Keep model and features backward-compatible; version your DSP blocks.

What about non-audio, non-IMU tasks?

Environmental anomaly detection, smart agriculture (soil moisture patterns), and simple presence detection are viable.

How do I avoid false triggers?

Use multi-stage detection: cheap VAD/motion gate → classifier → temporal smoothing and confidence thresholds.


Post a Comment

0 Comments