Using colab to Generate srt Subtitles with OpenAI's Whisper
In the era of models, OpenAI’s Whisper model has become very popular recently, and I found that its recognition effect is excellent. It can produce subtitles similar to iFLYTEK’s, and I especially like the fact that it automatically writes certain words according to the official standard, such as “MacBook Pro,” “Type-C,” “HDMI,” “USB3 Gen2”. It won’t write them as “macbook pro” or “usb3,” which I really appreciate. However, it does require a GPU machine. After checking my available resources, I thought of Google Colab. Wouldn’t it be great if I could make it process subtitles with just one click? (This article requires access to international content, but I won’t provide any instructions on how to access it.)(English version Translated by GPT-3.5, 返回中文)
Introduction
I have been using Whisper for the past few days, and it’s really convenient to use. The Chinese recognition accuracy is also quite high. Although iFLYTEK subtitle is also good, it’s a paid service. Luckily, I’ve always been helping a friend with his captions whenever he creates new content. So I can give it a try. Here, “video” specifically refers to YouTube videos.
Then I came across this article Comparison: Which Speech-to-Text Tool is the Real King? - Baidu Baijia, which compares the accuracy of various speech-to-text tools (results not verified). The accuracy for a 30-second video is as follows (data from the article above), and Whisper is leading the way.
Speech Recognition Model | Similarity to Correct Transcription |
---|---|
Feishu Miaoji | 0.9898 |
Whisper’s large-v1 model | 0.9865 |
Jianying | 0.9865 |
Whisper’s large-v2 model | 0.9797 |
Whisper’s large model | 0.9797 |
BiJian | 0.9797 |
Microsoft’s Built-in Speech Recognition | 0.9695 |
NetEase’s Workbench | 0.9662 |
Whisper’s medium model | 0.9625 |
Whisper’s small model | 0.9625 |
Whisper’s base model | 0.8805 |
Whisper’s tiny model | 0.8454 |
Idea
I want to reduce the amount of manual work.
- Provide a video link.
- Google Colab automatically downloads the video, and I only need to modify a variable and run a code cell in Colab.
- Then I can do whatever I want. As long as the webpage stays open and active.
- Colab automatically saves the generated srt file in Google Cloud.
- Bonus: It would be great if I could be notified once the caption processing is completed.
Steps
Step 1: yt-dlp Tool
I found the yt-dlp - Github tool on GitHub. It can parse YouTube and download the specified resolution.
So the first step is clear. We need to download the video. However, in this case, I only need to download the audio. For testing purposes, I chose a video from a well-known YouTuber (I have no association with them or the friend mentioned earlier) Geekerwan - RTX3070TI首发评测:本来是张好显卡,等等吧 - YouTube.
We use the following command to parse the video:
1 | yt-dlp -F "https://www.youtube.com/watch?v=5wF1YItz78Y" |
For example:
1 | [you.....] Extracting URL: https://www.you......com/watch?v=******* |
Obviously, we don’t need the “video_only” part because we only need the audio. So I selected “audio_only” with format_id 140.
Step 2: Downloading the Audio
We use the following command to download the audio. To ensure that the file has a consistent file extension, I added “-o download.m4a” at the end. Otherwise, it would use the video name to name the file.
1 | yt-dlp -f 140 "https://www.youtube.com/watch?v=5wF1YItz78Y" -o download.m4a |
Step 3: Installing the Whisper Dependencies
This step is straightforward. Just follow the instructions on Whisper - Github and execute the command.
1 | pip install openai-whisper |
Step 4: Generating Subtitles
Following the instructions provided by Whisper, we write the following command:
1 | whisper --model large-v2 --model_dir=./ --output_format srt --output_dir outsrt download.m4a |
In the “model” parameter, we use the “large-v2” model. The available models are as follows, and if the local model is not available, it will be automatically downloaded:
tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large (default: small)
The “model_dir” parameter specifies the directory where the model is stored. If you already have the model downloaded on your own server, you can specify it to avoid redundant downloads. The model should be named according to the model name mentioned above, for example, large-v2.pt
. In my actual execution, I omitted this parameter.
The “output_format” parameter specifies the output format as “srt”. The available options are:
txt, vtt, srt, tsv, json, all
The “output_dir” parameter specifies where to output the srt files. I chose to output them to the “outsrt” directory. If the directory does not exist, it will be created automatically.
Finally, we provide the path to the “download.m4a” file, which is the file we want to process.
Logic Summary and Steps
At this point, the logic is complete, and the srt file should have been generated in the output_dir directory. The filename of the srt file is the same as the m4a file, for example, download.srt
.
In Colab, if a code cell starts with “!”, it means that it is executing a shell command. If it doesn’t start with “!”, it means that it is a Python code. Each line in the shell command runs from the “/content” directory, and each code cell has its own environment. For example:
1 | ! mkdir files # 命令执行成功 |
This means that unless you write “!mkdir files && pwd”, you need to be aware that the “pwd” command is executed in the “files” directory.
Mounting Google Drive
I want the processed files to be automatically saved in my Google Drive. Therefore, I need to mount Google Drive first. Since mounting requires authorization, I don’t want to start processing and then be prompted for authorization an hour later.
1 | from google.colab import drive |
After mounting Google Drive, the “gdrive” directory is actually located under “/content/MyDrive”.
Downloading and Installing yt-dlp
First, we need to download yt-dlp, extract it, and set the permissions. Here, I’m defining a variable for convenience. Since I found that the audio format_id is always 140, I hardcoded it. After downloading, I delete yt-dlp just in case there are any detection mechanisms (although it’s unnecessary).
1 | !export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a |
Installing Dependencies
At this point, we already have the “download.m4a” file. Now, we need to install the Whisper dependencies. Here, I’m also installing the “requests” library because I want to be notified once the processing is complete. We need a network request library for that.
1 | !pip install openai-whisper requests |
Start Processing
I want it to output “DONE” once the processing is complete.
1 | !whisper --model large-v2 --output_format srt --output_dir output_dir download.m4a && echo "DONE!" |
Uploading Files
Copying the files to Google Drive.
1 | !cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles |
Testing the Entire Code
The complete code block looks like this:
1 | from google.colab import drive |
After running it, the authorization prompt will appear first.
1 | Permit this notebook to access your Google Drive files? |
Then, don’t close the webpage (Pro+ users can disregard this). You can have coffee or play games while the processing is being done. After it’s finished, the file will automatically appear in the specified folder in your Google Drive.
This is the complete output from Colab. The whole process took 5 minutes and 46 seconds.
1 | Mounted at /content/gdrive |
Final Step: Notification
We want to be notified once the processing is completed. For iOS users like me, I’m using the “bark” method to receive push notifications.
To get the address for the notification, search for “Bark” on your iPhone. It will provide you with a URL that looks like this:
1 | https://api.day.app/一段token/信息文本 |
So, I wrote a simple Python code using the installed “requests” library.
1 |
|
The result looks like this:
Final Notebook Summary
1 | from google.colab import drive |
“””