How Use Google’s YouTube Speech Recognition without uploading videos to YouTube

speech recognitionspeech to textvideoyoutube

I have a lot of lecture video content that I would like to have the subtitles for. YouTube automatically generates subtitles for videos under certain conditions (those conditions are still somewhat of a mystery to me).

I would like to be able to use this speech recognition technology outside of YouTube. I don't want to upload every video just to get the transcript (too time consuming), plus, I don't think YouTube will do it for videos that are longer than about 30 minutes (most of them are), further, I don't think it will do it for non-public listed videos (which is a problem because it is premium content that is meant to be sold).

Perfect scenario: There is a program that I can run from my desktop to get the transcript out of these videos and it is of equal or better quality than YouTube's and has the time codes similar to an SRT or the XML that YouTube generates [How to get YouTube subtitles].

Acceptable scenario: There are some tricks I can do to force YouTube to transcribe the videos, whether set to private or public, and despite length.

Doable scenario: There is a library or something that I can use to code my own program. I am good with C# and okay with C++ (But I really prefer C#).

Best Answer

Google implemented the Web Speech API (both for speech recognition and synthesis) into Chrome, which you can use if you are a developer. This is what YouTube uses to generate close captioning on some videos. Maybe you'll find code to interact with it.

The data flow would probably be:

A video file => extract and convert audio => send it to Google API => get the text => write into a SRT.

EDIT: there doesn't seem to be an official API page, other than the W3C spec. So here are more links:

These examples are about using the API from inside Chrome, but you can directly query Google's online speech recognition engine. For instance, Jasper, a speech-recognizing personal assistant for Raspberrry Pi, lets you choose Google as the speech recognition engine.