Project Overview

In this project, Anton Biryukov and I leveraged work done by the VGG group at Oxford to dig into voice recognition and deep learning. We utilized one of VGG's state-of-art CNN's for finding an embedding space with metric encoding perceptual voice similarity. We used some of the pretrained weights but fine-tuned the last layers for a voice-to-person classification problem. For training we used a VGGVox v1 dataset (>40 GB) which contains audio records for 1000+ celebrities, as well as links to their headshot images which we scraped from google. Once the network was trained and working, we built a Flask app that connects to the users microphone, records their speech and shows them who they sounded like over the time of their recording. If there’s no immediate access to a microphone, one can explore examples ran on out-of-sample records downloaded from Youtube (see video demo).

Video Demo: (Warning: Explicit Language)

This video demonstrates the networks predictions and cond confidence in the predictions over time. You can see that it changes it's prediction from Steve Carell to Zach Galifianakis as the conversation goes on.

Originally we hoped to work on diarisation, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity, rather than voice recognition. After exploring some of the work done in diarisation and testing it out for ourselves, we found that it was sill quite vulnerable to noise for our purposes and often estimated the incorrect number of speakers.

Link to Code