Jon Nordby @jononor
EuroPython 2019, Basel
Internet of Things specialist
Today
Goal
a machine learning practitioner
without prior knowledge about sound processing
can solve basic Audio Classification problems
Outline
Slides and more: https://github.com/jonnor/machinehearing
Audio sub-fields
Examples
Computed using Short-Time-Fourier-Transform (STFT)
Given an audio signal of environmental sounds,
determine which class it belongs to
State-of-the-art accuracy: 79% - 82%
def load_audio_windows(path, ...):
y, sr = librosa.load(path, sr=samplerate)
S = librosa.core.stft(y, n_fft=n_fft,
hop_length=hop_length, win_length=win_length)
mels = librosa.feature.melspectrogram(y=y, sr=sr, S=S,
n_mels=n_mels, fmin=fmin, fmax=fmax)
# Truncate at end to only have windows full data. Alternative: zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
windows = []
for frame_idx in range(start_frame, end_frame, window_hop):
window = mels[:, frame_idx-window_size:frame_idx]
mels = numpy.log(window + 1e-9)
mels -= numpy.mean(mels)
mels /= numpy.std(mels)
assert mels.shape == (n_mels, window_size)
windows.append(mels)
return windows
1: Spectrograms are image-like
2: CNNs are best-in-class for image-classification
=> Will CNNs work well on spectrograms?
Yes!
A bit suprising?
from keras.layers import ...
def build_model(...):
block1 = [
Convolution2D(filters, kernel, padding='same', strides=strides,
input_shape=(bands, frames, channels)),
MaxPooling2D(pool_size=pool),
Activation('relu'),
]
block2 = [
Convolution2D(filters*kernels_growth, kernel, padding='same', strides=strides),
MaxPooling2D(pool_size=pool),
Activation('relu'),
]
block3 = [
Convolution2D(filters*kernels_growth, kernel, padding='valid', strides=strides),
Activation('relu'),
]
backend = [
Flatten(),
Dropout(dropout),
Dense(fully_connected, kernel_regularizer=l2(0.001)),
Activation('relu'),
Dropout(dropout),
Dense(num_labels, kernel_regularizer=l2(0.001)),
Activation('softmax'),
]
layers = block1 + block2 + block3 + backend
model = Sequential(layers)
return model
from keras import Model
from keras.layers import Input, TimeDistributed, GlobalAveragePooling1D
def build_multi_instance(base, windows=6, bands=32, frames=72, channels=1):
input = Input(shape=(windows, bands, frames, channels))
x = input
x = TimeDistributed(base)(x)
x = GlobalAveragePooling1D()(x)
model = Model(input,x)
return model
GlobalAveragePooling -> “Probabilistic voting”
Transfer Learning from image data works
=> Can use models pretrained on ImageNet
Caveats:
Look, Listen, Learn ({L^3}). 1 second, 512 dimensional vector
import openl3
Pipeline
Models
Data Augmentation
Slides and more: https://github.com/jonnor/machinehearing
Hands-on: TensorFlow tutorial, Simple Audio Recognition
Book: Computational Analysis of Sound Scenes and Events (Virtanen/Plumbley/Ellis, 2018)
Slides and more: https://github.com/jonnor/machinehearing
Interested in Audio Classification or Machine Hearing generally? Get in touch!
Twitter: @jononor
Return: time something occurred.
Aka: Onset detection
Return: sections of audio containing desired class
Return: All classes/events that occurred in audio.
Approaches
Real-time classification
TODO: document how to do in Python
On general audio, with strong classifier, performs worse than log mel-spectrogram
Using the raw audio input as features with Deep Neural Networks.
Need to learn also the time-frequency decomposition, normally performed by the spectrogram.
Actively researched using advanced models and large datasets.
TODO: link EnvNet