TinyML – Machine Learning on Microcontrollers
1. What Is TinyML & Why Now?
TinyML brings machine learning inference to microcontrollers (MCUs) that run on milliwatts—or even microwatts—of power. Unlike SBCs (e.g., Raspberry Pi 4), MCUs have kilobytes to a few megabytes of RAM and no operating system, yet can perform keyword spotting, anomaly detection, and simple vision tasks offline with real-time latency.
2. Hardware: Boards, MCUs & Sensors
Popular Boards
- Arduino Nano 33 BLE Sense (nRF52840, 1 MB Flash, 256 KB RAM; on-board mic/IMU/temp)
- ESP32-S3 (Xtensa dual-core with vector accel; Wi-Fi/BLE; more RAM via PSRAM)
- STM32 (Cortex-M4/M7; DSP extensions; wide ecosystem)
- Raspberry Pi Pico (RP2040) (dual-core M0+; PIO; add external sensors)
Sensor Choices
- Audio mic (keyword spotting, cough detection)
- IMU (gesture, activity, machinery vibration)
- Environmental (temp, humidity, gas—contextual features)
- Low-res camera (basic motion/object presence)
Selection Criteria
- RAM/Flash headroom vs. model size
- Vector/DSP support (CMSIS-NN, ESP-NN)
- Power modes (deep sleep, ULP co-processor)
- I/O & radios (BLE/Wi-Fi) for event streaming
3. Workflow: Data → Features → Model → Firmware
- Collect: Record representative data on the target sensor at production sampling rates.
- Label & Split: Train/val/test; keep a hold-out from a different day/device.
- Feature Engineering: MFCCs for audio; spectral stats for vibration; simple image downsampling (e.g., 96×96 grayscale).
- Model: Choose MCU-friendly architectures; keep parameters small.
- Optimize: int8 quantization; optional pruning/distillation.
- Package: Convert to TFLite Micro or ONNX and generate a C array for Flash.
- Deploy: Integrate with RTOS/loop; add ring buffers and state smoothing.
- Measure: RAM/Flash usage, latency, accuracy, and current draw.
4. Model Architectures That Fit in KBs
Classical ML
- Logistic regression / linear SVM on MFCC or spectral features
- Random Forest (shallow) for vibrations/IMU
- K-NN with prototype compression
Tiny DNNs
- DS-CNN for keyword spotting (depthwise-separable convs)
- 1D CNN for IMU sequences
- Tiny MobileNet for 96×96 grayscale vision
- TCN (very small) for time-series patterns
Anomaly Detection
- Autoencoder on spectral features
- One-Class SVM / Isolation Forest
- Statistical thresholds with adaptive baselines
Rule of thumb: under ~100K parameters and int8 weights can run comfortably on many Cortex-M4F/M7 class MCUs.
5. Quantization, Pruning & Distillation:
- Integer Quantization (int8): Converts weights/activations to 8-bit. Calibrate with a representative dataset.
- Pruning: Remove small-magnitude weights, then fine-tune and re-quantize.
- Knowledge Distillation: Train a small “student” to match a large “teacher.”
- Operator Fusion & CMSIS-NN: Use optimized kernels for ARM; ESP-NN for ESP32.
FP32 DS-CNN: 420 KB weights → int8: ~110 KB; latency ↓ ~3–5×; accuracy -0.5 to -1.5 pp
6. Deployment with TFLM, Edge Impulse & ONNX:
TensorFlow Lite Micro (TFLM)
- Compile model as a C array; no dynamic allocation by default.
- Use arena sizing to control RAM; enable CMSIS-NN or ESP-NN.
- Integrate with Arduino, Zephyr, FreeRTOS, or bare-metal loops.
Edge Impulse Studio
- Collect/label data, design DSP blocks + models in GUI.
- Auto-quantization and deployment to many boards.
- Generates firmware and profiling reports.
ONNX Runtime for MCUs
- Convert PyTorch/Sklearn models; subset of ops on embedded.
- Static memory planning; mixed int8/FP16 on some targets.
7. Power, Latency & Memory Budgets
- Duty Cycling: Wake on interrupt (sound level, IMU motion), process a short window, go back to sleep.
- Cascaded Pipelines: Cheap heuristic gate → expensive model only on triggers.
- Buffering: Use ring buffers for streaming windows; avoid heap fragmentation.
- Fixed-point DSP: Prefer int16/int8 path; pre-compute FFT twiddles.
8. Privacy, Security & Updates:
- On-device inference: no raw data leaves device by default.
- Signed firmware: Secure boot + OTA with signature checks.
- Data retention: Store only features, not raw audio/video, when possible.
- Shadow mode: Log predictions for tuning before enabling actions.
9. Use Cases & Project Blueprints:
Keyword Spotting (Wake Word)
Sensors: MEMS mic • Features: MFCC (20–40 coeffs) • Model: DS-CNN (~60–100K params)
Pipeline: VAD gate → MFCC → int8 DS-CNN → smoothing (debounce 200 ms) → trigger GPIO/BLE event.
Vibration Anomaly (Predictive Maintenance)
Sensors: IMU/accelerometer • Features: spectral bands, RMS, kurtosis • Model: 1D CNN or autoencoder
Actions: BLE alert when anomaly score > threshold for N windows; store top-K FFT bins.
Gesture Recognition
Sensors: 6-axis/9-axis IMU • Features: sliding windows (100–200 ms) • Model: TCN/1D CNN
Simple Vision Events
Sensors: tiny grayscale camera • Preproc: downsample + normalize • Model: tiny MobileNet; or motion heuristics + classifier.
10. Example Code: C++ Inference Loop (TFLM)
// Pseudo-code for Arduino/ESP32 using TFLite Micro
#include "model_data.h" // const unsigned char g_model[]; size_t g_model_len;
#include "tflite_micro_all.h"
constexpr int kAudioHz = 16000;
constexpr int kWinMs = 30, kHopMs = 10;
RingBuffer<int16_t, kAudioHz * 1> audio_rb; // 1s buffer
int8_t input_tensor[INPUT_BYTES]; // from model metadata
int8_t output_tensor[OUTPUT_BYTES];
void setup() {
init_mic(kAudioHz);
init_mfcc(/*coeffs=*/32, kWinMs, kHopMs);
tflm_init(g_model, g_model_len); // sets up arena, CMSIS-NN
}
void loop() {
if (mic_available()) {
int16_t s = mic_read();
audio_rb.push(s);
}
if (audio_rb.ready_frame(kWinMs, kHopMs)) {
mfcc_extract(audio_rb.latest_frame(), input_tensor);
tflm_invoke(input_tensor, output_tensor);
int cls = argmax(output_tensor);
if (cls == KEYWORD && debounce_ms(200)) {
trigger_event(); // GPIO/BLE/Wi-Fi
}
}
}
11. Debugging & Benchmarking:
- Sanity checks: Overfit a tiny subset; ensure pipeline correctness.
- Feature drift: Log mean/var of features on device; compare with training.
- Profiling: Toggle GPIO before/after inference; measure with logic analyzer.
- Memory: Print arena usage; binary size breakdown (map file).
- Latency: Aim <50 ms for audio; <100 ms for IMU gestures.
Benchmark Table (Example)
| Task | Board | Model | Flash | RAM | Latency | Accuracy |
|---|---|---|---|---|---|---|
| Keyword | Nano 33 BLE | DS-CNN int8 | 120 KB | 60 KB | 22 ms | 94% |
| Vibration | ESP32-S3 | 1D CNN int8 | 150 KB | 80 KB | 18 ms | 95% AUC |
| Gesture | RP2040 | TCN int8 | 90 KB | 48 KB | 28 ms | 92% |
12. FAQs
Can TinyML train on-device?
Mostly no—training is compute-heavy. Some boards can fine-tune tiny models or thresholds; for full training, use desktop/cloud and deploy weights.
How do I update models in the field?
Use OTA with signed artifacts. Keep model and features backward-compatible; version your DSP blocks.
What about non-audio, non-IMU tasks?
Environmental anomaly detection, smart agriculture (soil moisture patterns), and simple presence detection are viable.
How do I avoid false triggers?
Use multi-stage detection: cheap VAD/motion gate → classifier → temporal smoothing and confidence thresholds.
0 Comments