Jon Nordby jon@soundsensing.no
PyCode 2019, Gdansk
Internet of Things specialist
Today
a Python programmer
without expertice in sound processing and limited machine learning experience
can solve basic Audio Classification problems
Slides and more:
https://github.com/jonnor/machinehearing
Not included
Given an audio signal
of environmental sound
determine which class it belongs to
Classification simplifications
State-of-the-art accuracy: 79% - 82%
def load_audio_windows(path, ...):
y, sr = librosa.load(path, sr=samplerate)
S = librosa.core.stft(y, n_fft=n_fft,
hop_length=hop_length, win_length=win_length)
mels = librosa.feature.melspectrogram(y=y, sr=sr, S=S,
n_mels=n_mels, fmin=fmin, fmax=fmax)
# Truncate at end to only have windows full data. Alternative: zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
windows = []
for frame_idx in range(start_frame, end_frame, window_hop):
window = mels[:, frame_idx-window_size:frame_idx]
mels = numpy.log(window + 1e-9)
mels -= numpy.mean(mels)
mels /= numpy.std(mels)
assert mels.shape == (n_mels, window_size)
windows.append(mels)
return windows
Img: Data Science Central, Albelwi2017
from keras.layers import ...
def build_model(....):
block1 = [
Convolution2D(filters, kernel, padding='same', strides=strides, input_shape=(bands, frames, channels)),
MaxPooling2D(pool_size=pool),
Activation('relu'),
]
block2 = [
Convolution2D(filters*kernels_growth, kernel, padding='same', strides=strides),
MaxPooling2D(pool_size=pool),
Activation('relu'),
]
block3 = [
Convolution2D(filters*kernels_growth, kernel, padding='valid', strides=strides),
Activation('relu'),
]
backend = [
Flatten(),
Dropout(dropout),
Dense(fully_connected, kernel_regularizer=l2(0.001)),
Activation('relu'),
Dropout(dropout),
Dense(num_labels, kernel_regularizer=l2(0.001)),
Activation('softmax'),
]
model = Sequential(block1 + block2 + block3 + backend)
return model
from keras import Model
from keras.layers import Input, TimeDistributed, GlobalAveragePooling1D
def build_multi_instance(base, windows=6, bands=32, frames=72, channels=1):
input = Input(shape=(windows, bands, frames, channels))
x = input
x = TimeDistributed(base)(x)
x = GlobalAveragePooling1D()(x)
model = Model(input,x)
return model
GlobalAveragePooling -> “Probabilistic voting”
Mixup
. Mixing two samples, adjusting class labelsSpecAugment
. Mask spectrogram sections to augment512-d
vectorTry the standard audio pipeline shown!
Start simple!
Use Data Augmentation!
Slides and more: https://github.com/jonnor/machinehearing
Hands-on: TensorFlow tutorial, Simple Audio Recognition
Book: Computational Analysis of Sound Scenes and Events (Virtanen/Plumbley/Ellis, 2018)
Environmental Sound Classification on Microcontrollers using Convolutional Neural Networks
Slides and more: https://github.com/jonnor/machinehearing
Interested in Audio Classification or Machine Hearing generally? Get in touch!
Twitter: @jononor
Email: jon@soundsensing.no
Return: time something occurred.
Aka: Onset detection
Return: sections of audio containing desired class
Return: All classes/events that occurred in audio.
Approaches
Problem
When audio volume is low, normalization will blow up noise. Can easily cause spurious classifications.
Solution
Compute RMS energy of the input. If RMS low, disregard classifier output, mark as Silence instead.
Real-time classification
On general audio, with strong classifier, performs worse than log mel-spectrogram
Using the raw audio input as features with Deep Neural Networks.
Need to learn also the time-frequency decomposition, normally performed by the spectrogram.
Actively researched using advanced models and large datasets.
TODO: link EnvNet
Convolutional Recurrent Neural Networks
Audio sub-fields
Examples