Jon Nordby jon@soundsensing.no
February 27, 2020
Internet of Things specialist
Now:
Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks
The goals of this talk
you as Developers, understand:
possibilities and applications of Audio ML
overall workflow of creating an Audio Classification solution
what Soundsensing provides in this area
Audio sub-fields
Examples
Expected 10x power efficiency increases.
Noise Classification in Urban environments. AKA “Environmental Sound Classification”
Given an audio signal of environmental sounds,
determine which class it belongs to
Given an audio clip
with some sounds
determine which class it is
Classification simplifications
Key challenges
Best practice: Design the process, document in a protocol
Depends on problem difficulty
Targets
Urbansound8k: 10 classes, 11 hours of annotated audio
Challenge: Keeping quality high, and costs low
How to label
Try the standard audio pipeline, it often does OK.
All available as open source solutions.
Doable, but takes time!
So have a model that performs Audio Classification on our PC.
But we want to monitor a real-world phenomenon.
How to deploy this?
We are building a partner network
Email: jon@soundsensing.no
Think that this sounds cool to work on?
Email: jon@soundsensing.no
Want to invest in a Machine Learning and Internet of Things startup?
Email: ole@soundsensing.no
Machine Hearing. ML on Audio
Machine Learning for Embedded / IoT
Thesis Report & Code
Soundsensing
Email: jon@soundsensing.no
Foreground-only
Reduces health due to stress and loss of sleep
In Norway
In Europe
Simulation only, no direct measurements
Want: 1 year lifetime for palm-sized battery
Need: <1mW
system power
STM32L4 @ 80 MHz. Approx 10 mW.
Human presence detection. VGG8 on 64x64 RGB image, 5 FPS: 7 mW.
Audio ML approx 1 mW
2.9 TOPS/W. AlexNet, 1000 classes, 10 FPS. 41 mWatt
Audio models probably < 1 mWatt.
With 50% of STM32L476 capacity:
eGRU: running on ARM Cortex-M0 microcontroller, accuracy 61% with non-standard evaluation
Can this be faster than the standard FFT? And still perform well?
Models in literature use 95% overlap or more. 20x penalty in inference time!
Often low performance benefit. Use 0% (1x) or 50% (2x).
MobileNet, “Hello Edge”, AclNet. 3x3 kernel,64 filters: 7.5x speedup
EffNet, LD-CNN. 5x5 kernel: 2.5x speedup
Wasteful? Computing convolutions, then throwing away 3/4 of results!
Striding means fewer computations and “learned” downsampling
Inference can often use 8 bit integers instead of 32 bit floats