Spotify Music Recommender

About

Spotify create playlist recommendation everyday.The company has 70M users, 4M unique songs and 40 features. This means that they have apply O (70M x 4M x 40) = O(12 peta operations = 10 to the power of 15). This is really challenging!

Herein, spotify engineers implemented approximate nearest neighbor approach and published as an open source package: annoy. Annoy comes with a huge speed but it does not guarantee to find the nearest one. It just approximates. It reduces the time complexity to O(log n).

Annoy

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

Python API

AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
a.save(fn, prefault=False) saves the index to disk and loads it (see next function). After saving, no more items can be added.
a.load(fn, prefault=False) loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.
a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

Dot Product distance (contributed by Peter Sobot) reduces the provided vectors from dot (or "inner-product") space to a more query-friendly cosine space using a method by Bachrach et al., at Microsoft Research, published in 2014.

Dataset

128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files. The model used to generate the features is available in the TensorFlow models GitHub repository.

Models for AudioSet: A Large Scale Dataset of Audio Events

This repository provides models and supporting code associated with AudioSet, a dataset of over 2 million human-labeled 10-second YouTube video soundtracks, with labels taken from an ontology of more than 600 audio event classes.

AudioSet was released in March 2017 by Google's Sound Understanding team to provide a common large-scale evaluation task for audio event detection as well as a starting point for a comprehensive vocabulary of sound events.

Features dataset

Frame-level features are stored as tensorflow.SequenceExample protocol buffers. A tensorflow.SequenceExample proto is reproduced here in text format:

context: {
  feature: {
    key  : "video_id"
    value: {
      bytes_list: {
        value: [YouTube video id string]
      }
    }
  }
  feature: {
    key  : "start_time_seconds"
    value: {
      float_list: {
        value: 6.0
      }
    }
  }
  feature: {
    key  : "end_time_seconds"
    value: {
      float_list: {
        value: 16.0
      }
    }
  }
  feature: {
    key  : "labels"
      value: {
        int64_list: {
          value: [1, 522, 11, 172] # The meaning of the labels can be found here.
        }
      }
    }
}
feature_lists: {
  feature_list: {
    key  : "audio_embedding"
    value: {
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
    }
    ... # Repeated for every second of the segment
  }

}

The total size of the features is 2.4 gigabytes. They are stored in 12,228 TensorFlow record files, sharded by the first two characters of the YouTube video ID, and packaged as a tar.gz file.

The labels are stored as integer indices. They are mapped to sound classes via class_labels_indices.csv. The first line defines the column names:

index,mid,display_name

Subsequent lines describe the mapping for each class. For example:

0,/m/09x0r,"Speech"

which means that “labels” with value 0 indicate segments labeled with “Speech”.

How to Download

Download the audio dataset from AudioSet. Scroll down to the section where it says "Manually download" and get the compressed tar.gz file. Double click to uncompress in your project repository. These are youtube videos that have been encoded using the MAX-Audio-Embedding-Generator.

IBM Code Model Asset Exchange: Audio Embedding Generator

This repository contains code to instantiate and deploy an audio embedding model. This model recognizes a signed 16-bit PCM wav file as an input, generates embeddings, applies PCA transformation/quantization, and outputs the result as arrays of 1 second embeddings. The model was trained on AudioSet. As described in the code this model is intended to be used an example and perhaps as a stepping stone for more complex models. See the Usage heading in the tensorflow/models Github page for more ideas about potential usages.

The model files are hosted on IBM Cloud Object Storage. The code in this repository deploys the model as a web service in a Docker container. This repository was developed as part of the IBM Code Model Asset Exchange and the public API is powered by IBM Cloud.

Model Metadata

Domain	Application	Industry	Framework	Training Data	Input Data Format
Audio	Embeddings	Multi	TensorFlow	Google AudioSet	signed 16-bit PCM WAV audio file

1. Build the Model

Clone this repository locally. In a terminal, run the following command:

$ git clone https://github.com/IBM/MAX-Audio-Embedding-Generator.git

Change directory into the repository base folder:

$ cd MAX-Audio-Embedding-Generator

To build the Docker image locally, run:

$ docker build -t max-audio-embedding-generator .

All required model assets will be downloaded during the build process. Note that currently this Docker image is CPU only (we will add support for GPU images later).

2. Deploy the Model

To run the Docker image, which automatically starts the model serving API, run:

$ docker run -it -p 5000:5000 max-audio-embedding-generator

3. Use the Model

The API server automatically generates an interactive Swagger documentation page. Go to http://localhost:5000 to load it. From there you can explore the API and also create test requests.

Use the model/predict endpoint to load a signed 16-bit PCM wav audio file (you can use the car-horn.wav file located in the samples folder) and get embeddings from the API.

You can also test it on the command line, for example:

$ curl -F "audio=@samples/car-horn.wav" -XPOST http://localhost:5000/model/predict

You should see a JSON response like that below:

{
  "status": "ok",
  "embedding": [
    [
      158,
      23,
      150,
      ...
    ],
    ...,
    ...,
    [
      163,
      29,
      178,
      ...
    ]
  ]
}

4. Run the Notebook

Once the model server is running, you can see how to use it by walking through the demo notebook.

This will start the notebook server. You can open the demo notebook by clicking on demo.ipynb.

Steps to Run

Download the data set audioset_v1_embeddings/, class_labels_indices.csv from the AudioSet
Pre-processing: Download the audio data set and convert it into json format by running all cells in AudioSet_Processing.ipynb
Training: Run the cells in Spotify_Audio_Recommender.ipynb to train the annoy recommender
Inference: Run the cell in Spotify_Audio_Recommender.ipynb for prediction

	nns_index = annoy_index.get_nns_by_item(193, 10)

Next Steps

Inference using an UI by uploading a audio file.

Resources

[1] AudioSet Download: https://research.google.com/audioset/download.html

[2] MAX Audio Embedding Generator: https://github.com/IBM/MAX-Audio-Embedding-Generator

[3] Annoy slidedeck: https://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup

[4] Annoy explained in more detail: https://www.youtube.com/watch?v=QkCCyLW0ehU

[5] Spotify Annoy: https://github.com/spotify/annoy

[6] Models for AudioSet: https://github.com/tensorflow/models/tree/master/research/audioset

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
AudioSet_Processing.ipynb		AudioSet_Processing.ipynb
README.md		README.md
Spotify_Audio_Recommender.ipynb		Spotify_Audio_Recommender.ipynb
spotify.jpeg		spotify.jpeg
swagger-screenshot.png		swagger-screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Music Recommender

About

Annoy

Background

Summary of features

Python API

How does it work

Dataset

Models for AudioSet: A Large Scale Dataset of Audio Events

Features dataset

How to Download

IBM Code Model Asset Exchange: Audio Embedding Generator

Model Metadata

Run Locally

1. Build the Model

2. Deploy the Model

3. Use the Model

4. Run the Notebook

Steps to Run

Next Steps

Resources

About

Releases

Packages

Languages

Jeejo-Joy/spotify-music-recommender

Folders and files

Latest commit

History

Repository files navigation

Spotify Music Recommender

About

Annoy

Background

Summary of features

Python API

How does it work

Dataset

Models for AudioSet: A Large Scale Dataset of Audio Events

Features dataset

How to Download

IBM Code Model Asset Exchange: Audio Embedding Generator

Model Metadata

Run Locally

1. Build the Model

2. Deploy the Model

3. Use the Model

4. Run the Notebook

Steps to Run

Next Steps

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages