Audio Classification using Machine Learning

Jon Nordby @jononor

EuroPython 2019, Basel

Introduction

Jon Nordby

Internet of Things specialist

  • B.Eng in Electronics (2010)
  • 9 years as Software developer. Embedded + Web
  • M.Sc in Data Science (2019)

Today

  • Consulting on IoT + Machine Learning
  • CTO @ Soundsensing.no

This talk

Goal

a machine learning practitioner

without prior knowledge about sound processing

can solve basic Audio Classification problems

Outline

  • Introduction
  • Audio Classification pipeline
  • Tips & Tricks
  • Pointers to more information

Slides and more: https://github.com/jonnor/machinehearing

Applications

Audio sub-fields

  • Speech Recognition. Keyword spotting.
  • Music Analysis. Genre classification.
  • General / other

Examples

  • Eco-acoustics. Analyze bird migrations
  • Wildlife preservation. Detect poachers in protected areas
  • Manufacturing Quality Control. Testing electric car seat motors
  • Security: Highlighting CCTV feeds with verbal agression
  • Medical. Detect heart murmurs

Digital sound primer

Audio Mixtures

Sounds mix together
Sounds mix together

Audio acquisition

Digital sound representation

  • Quantized in time (ex: 44100 Hz)
  • Quantizied in amplitude (ex: 16 bit)
  • N channels. Mono/Stereo
  • Uncompressed formats: PCM .WAV
  • Lossless compression: .FLAC
  • Lossy compression: .MP3

Spectrogram

Computed using Short-Time-Fourier-Transform (STFT)

A frog croaking with ciccadas in background
A frog croaking with ciccadas in background

Practical example

Environmental Sound Classification

Given an audio signal of environmental sounds,

determine which class it belongs to

  • Widely researched. 1000 hits on Google Scholar
  • Open datasets. ESC-50, Urbansound8k (10 classes), AudioSet (632 classes)
  • 2017: Human-level performance (on ESC-50)

Urbansound8k

10 classes, ~8k samples, ~4s long. ~9 hours total
10 classes, ~8k samples, ~4s long. ~9 hours total

State-of-the-art accuracy: 79% - 82%

Basic Audio Classification pipeline

Pipeline

Analysis windows

Splitting audio stream into windows of fixed length, with overlap. Image: Sajjad2019
Splitting audio stream into windows of fixed length, with overlap. Image: Sajjad2019

Mel-filters

Mel-scale triangular filters. Applied to linear spectrogram (STFT) => mel-spectrogram
Mel-scale triangular filters. Applied to linear spectrogram (STFT) => mel-spectrogram

Normalization

  • log-scale compression
  • Subtract mean
  • Standard scale

Feature preprocessing

Convolutional Neural Network

1: Spectrograms are image-like

2: CNNs are best-in-class for image-classification

=> Will CNNs work well on spectrograms?

Yes!

A bit suprising?

SB-CNN

Salamon & Bello, 2016
Salamon & Bello, 2016

Keras model

Aggregating analysis windows

GlobalAveragePooling -> “Probabilistic voting”

Demo

Demo video

Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks

Report & Code: https://github.com/jonnor/ESC-CNN-microcontroller
Report & Code: https://github.com/jonnor/ESC-CNN-microcontroller

Tips and Tricks

Data Augmentation

  • Adding noise. Random/sampled
  • Mixup: Mixing two samples

Transfer Learning from images

Transfer Learning from image data works

=> Can use models pretrained on ImageNet

Caveats:

  • If RGB input, should to fill all 3 channels
  • Multi-scale,
  • Usually need to fine tune. Some or all layers

Audio Embeddings

  • Model pretrained for sound, feature-extracting only
  • Uses a CNN under the hood

Look, Listen, Learn ({L^3}). 1 second, 512 dimensional vector

import openl3


Annotating audio

Outro

Summary

Pipeline

  • Fixed-length analysis windows
  • log-mel spectrograms
  • ML model
  • Aggregate analysis windows

Models

  1. Audio Embeddings (OpenL3) + simple model (scikit-learn)
  2. Convolutional Neural Networks with Transfer Learning (ImageNet etc)
  3. … train simple CNN from scratch …

Data Augmentation

  1. Time-shift
  2. Time-stretch, pitch-shift, noise-add
  3. Mixup, SpecAugment

More learning

Slides and more: https://github.com/jonnor/machinehearing

Hands-on: TensorFlow tutorial, Simple Audio Recognition

Book: Computational Analysis of Sound Scenes and Events (Virtanen/Plumbley/Ellis, 2018)

Questions

Slides and more: https://github.com/jonnor/machinehearing

?

Interested in Audio Classification or Machine Hearing generally? Get in touch!

Twitter: @jononor

BONUS

Audio Event Detection

Return: time something occurred.

  • Ex: “Bird singing started”, “Bird singing stopped”
  • Classification-as-detection. Classifier on short time-frames
  • Monophonic: Returns most prominent event

Aka: Onset detection

Segmentation

Return: sections of audio containing desired class

  • Postprocesing on Event Detection time-stamps
  • Pre-processing to specialized classifiers

Tagging

Return: All classes/events that occurred in audio.

Approaches

  • separate classifiers per ‘track’
  • joint model: multi-label classifier

Streaming

Real-time classification

TODO: document how to do in Python

Mel-Frequency Cepstral Coefficients (MFCC)

  • MFCC = DCT(mel-spectrogram)
  • Popular in Speech Detection
  • Compresses: 13-20 coefficients
  • Decorrelates: Beneficial with linear models

On general audio, with strong classifier, performs worse than log mel-spectrogram

End2End learning

Using the raw audio input as features with Deep Neural Networks.

Need to learn also the time-frequency decomposition, normally performed by the spectrogram.

Actively researched using advanced models and large datasets.

TODO: link EnvNet