Jon Nordby jon@soundsensing.no
November 19, 2019
Internet of Things specialist
Now:

Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks

Want: 1 year lifetime for palm-sized battery
Need: <1mW system power

STM32L4 @ 80 MHz. Approx 10 mW.
Human presence detection. VGG8 on 64x64 RGB image, 5 FPS: 7 mW.
Audio ML approx 1 mW
2.9 TOPS/W. AlexNet, 1000 classes, 10 FPS. 41 mWatt
Audio models probably < 1 mWatt.
Given an audio signal of environmental sounds,
determine which class it belongs to

With 50% of STM32L476 capacity:
eGRU: running on ARM Cortex-M0 microcontroller, accuracy 61% with non-standard evaluation


Models in literature use 95% overlap or more. 20x penalty in inference time!
Often low performance benefit. Use 0% (1x) or 50% (2x).

MobileNet, “Hello Edge”, AclNet. 3x3 kernel,64 filters: 7.5x speedup

EffNet, LD-CNN. 5x5 kernel: 2.5x speedup

Wasteful? Computing convolutions, then throwing away 3/4 of results!

Striding means fewer computations and “learned” downsampling


:::
:::
Inference can often use 8 bit integers instead of 32 bit floats
~ 10mW power,Can this be faster than the standard FFT? And still perform well?
Machine Hearing. ML on Audio
Machine Learning for Embedded / IoT
Thesis Report & Code
Email: jon@soundsensing.no
Email: jon@soundsensing.no




Foreground-only


Standard procedure for Urbansound8k

For each fold of each model
For each model
And the bugs can be hard to spot


Reduces health due to stress and loss of sleep
In Norway
In Europe
Simulation only, no direct measurements
