[Example] Simple Audio Detection

In this example, we will show you how to use ESP to create a simple audio classifier that can recognize sounds with distinct frequency spectrum characteristics. If you haven't read the tutorial of performing gesture recognition using accelerometers, you should at least skim it to know the general workflow.

The source code of this example can be found in the repository [link].

Video

Here is a short video of demonstrating piano key press detection.

What you will need?

A Macbook with ESP setup (see the installation guide from the README. We will use the Macbook built-in microphone for this example. Optionally, you can plug in an external microphone. For example, an electret microphone can be hooked up to an Adruino and use this firmware code to collect audio data. You will then update the input stream (line 4) with SerialStream.

Background knowledge

FFT

Fast Fourier Transform (FFT) is a signal processing technique to convert a signal from its original domain (such as time) to a representation in the frequency domain.

SVM

Support Vector Machines (SVM) a discriminative classifier formally defined by a separating hyperplane. The discriminative nature makes it a nice fit for audio classification. For more about SVM, please see the OpenCV SVM tutorial.

Example code walkthrough

void setup() {
    stream.setLabelsForAllDimensions({"audio"});

    calibrator.addCalibrateProcess("Bias", "Remain silent", backgroundCollected)
              .addCalibrateProcess("Range", "Shout as much as possible", shoutCollected);

    pipeline.addFeatureExtractionModule(
        FFT(kFFT_WindowSize, kFFT_HopSize,
            DIM, FFT::RECTANGULAR_WINDOW, true, false));

    pipeline.setClassifier(
        SVM(SVM::LINEAR_KERNEL, SVM::C_SVC, true, true));

    pipeline.addPostProcessingModule(ClassLabelFilter(25, 40));

    useInputStream(stream);
    useCalibrator(calibrator);
    usePipeline(pipeline);
}

The code above first names the input data as "audio".

Then a calibrator is specified with two calibration process:

Keep silent to collect data that represents the background noise.
Shout or make noises so that the range of this microphone can be measured.

This is reflected in the custom GUI generated: Calibration

The pipeline uses the technology we've mentioned in the background part: FFT and SVM. You can see an example of the result of FFT below: Pipeline and FFT

A ClassLabelFilter post-processing module smoothes the prediction results in case of false detection. If 25 out of the past 40 classifications are the same, then we consider it a positive detection. The number 40 comes from back of envelope calculation: we are sampling the audio at roughly 5 kHz, and FFT is computed with a hop size of 128; this means in 1 second, we will have 5000 / 128 = 39 classifcation results. We picked 40 as a nice number. 25 should be a tuneable parameter so that different false positive rate can be controlled. In this example, we just set it directly.

Data Collection

With this custom user code, we launch the application and follow the procedure outlined in another tutorial (todo): collecting calibration samples, collecting training data, naming each classes. In the training tab, if you press f, the training data and their corresponding feature vector (FFT here) will be shown. Below is a screenshot from the demo we prepared in the video above:

Training data with FFT features

Future work and ideas

One possible idea is to use this pipeline to detect customized tangible input devices such as Lamello [1].

[1] Savage, Valkyrie, et al. "Lamello: Passive acoustic sensing for tangible input components." Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2015.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly