Environmental Sound Classification on microcontrollers

Jon Nordby
jon@soundsensing.no
tinyML Summit 2021

Introduction

Environmental Noise Pollution

The environmental pollution that affects most people in Europe

13 million suffering from sleep disturbance
900’000 disability-adjusted life years (DALY) lost

Occupational Noise-induced Hearing Loss

The most prevalent occupational disease in the world

40 million affected by hearing loss from work
4 million disability-adjusted life years (DALY) lost

Noise Monitoring with Machine Learning

Wireless Audio Sensor Networks

Alternative A would be to record audio in the sensor and transmit to the cloud. This is a conceptually very simple solution, and one could use a standard neural network in the cloud to do audio classification without much computational constraints on the model.

However this would require a lot of data transfer, which is costly in terms of energy and data traffic in a cellular 4G system.

It also would be very poor for privacy, as potentially sensitive audio such as speech would have to be transported through the network and could potentially be stored in a server.

Alternative B would be to preprocess the data in the sensor, and classify this in the cloud. Would have to reducing the data enough to be privacy friendly and save considerable data traffic, but not so much as to reduce classification performance, which can be a difficult trade-off.

But the best solution both for Privacy and Data Traffic would be the TinyML solution. To do all the processing on the sensor, and only transmit data about the classes to server.

However this means the entire model needs to fit the constraints of the sensor device.

Model Constraints

Example target: STM32L476 microcontroller. With 50% of capacity:

64 kB RAM
512 kB FLASH memory
4.5 M operations/second

Small models Urbansound8K

Green: Feasible region on device. 2021 results not published.

Shrinking
Convolutional Neural Networks
for TinyML Audio

How to did we make the model fit on device?

Pipeline

Typical audio pipeline. Spectrogram conversion, CNN on overlapped windows.

Reduce input dimensionality

Lower sample rate
Lower frequency range
Lower frequency resolution
Lower time duration in window
Lower time resolution

~10x reduction i compute. And easier to learn!

Reduce overlap

Models in literature use 95% overlap or more. 20x penalty in inference time!

Often small performance benefit. Use 0% (1x) or 50% (2x).

Use a small model!

Depthwise-separable Convolution

MobileNet, “Hello Edge”, AclNet. 3x3 kernel,64 filters: 7.5x speedup

Downsampling using max-pooling

Wasteful? Computing convolutions, then throwing away 3/4 of results!

Downsampling using strided convolution

“Learned” downsampling. Striding 2x2: Approx 4x speedup

Quantization

Using int8 instead of float32.
4x improvement in weights (FLASH) and activations (RAM)
4.6X improvement in runtime using CMSIS-NN SIMD

Ref “CMSIS-NN: Efficient Neural Network Kernels for ARM Cortex-M CPUs”

Latest developments

Binary network quantization
Neural Architecture Search
Streaming inference
Learned filterbanks
Hardware acceleration
Learned pooling

TinyML very actively researched, rapid improvements

Outro

Noise Monitoring example

Automated documentation of noise footprint wrt regulations

Based on Noise Event Detection & Classification
Tested successfully at shooting range
Expanding now to Construction and Industry noise

Condition Monitoring example

Condition Monitoring of technical equipment using sound.
Developed based on experience from Noise Monitoring.

Conclusions

Audio classification of Environmental Noise can be done directly on sensor
Made possible with a range of efficient CNN techniques
Integrated into Soundsensing IoT sensors
Used for Noise Monitoring & Condition Monitoring

We are open for partners and pilot projects
Get in touch!
contact@soundsensing.no

Questions ?

TinyML Summit 2021: Environmental Sound Classification on microcontrollers

Jon Nordby
jon@soundsensing.no

Bonus

Bonus slides after this point

Thesis results

All the info

Thesis: Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks

Report & Code: https://github.com/jonnor/ESC-CNN-microcontroller

All models

Model comparison

List of results

Confusion

Grouped classification

Foreground-only

Unknown class

Thesis Methods

Standard procedure for Urbansound8k

Classification problem
4 second sound clips
10 classes
10-fold cross-validation, predefined
Metric: Accuracy

Training settings

Training

NVidia RTX2060 GPU 6 GB
10 models x 10 folds = 100 training jobs
100 epochs
3 jobs in parallel
36 hours total

Evaluation

For each fold of each model

Select best model based on validation accuracy
Calculate accuracy on test set

For each model

Measure CPU time on device

Mel-spectrogram

More resources

Machine Hearing. ML on Audio

github.com/jonnor/machinehearing

Machine Learning for Embedded / IoT

github.com/jonnor/embeddedml

Thesis Report & Code

github.com/jonnor/ESC-CNN-microcontroller