Using colab to Generate srt Subtitles with OpenAI's Whisper

发表于 2023-04-20 分类于 Technical Log Disqus：本文字数： 2.5k 阅读时长 ≈ 9 分钟

In the era of models, OpenAI’s Whisper model has become very popular recently, and I found that its recognition effect is excellent. It can produce subtitles similar to iFLYTEK’s, and I especially like the fact that it automatically writes certain words according to the official standard, such as “MacBook Pro,” “Type-C,” “HDMI,” “USB3 Gen2”. It won’t write them as “macbook pro” or “usb3,” which I really appreciate. However, it does require a GPU machine. After checking my available resources, I thought of Google Colab. Wouldn’t it be great if I could make it process subtitles with just one click? (This article requires access to international content, but I won’t provide any instructions on how to access it.)(English version Translated by GPT-3.5, 返回中文)

Introduction

I have been using Whisper for the past few days, and it’s really convenient to use. The Chinese recognition accuracy is also quite high. Although iFLYTEK subtitle is also good, it’s a paid service. Luckily, I’ve always been helping a friend with his captions whenever he creates new content. So I can give it a try. Here, “video” specifically refers to YouTube videos.

Then I came across this article Comparison: Which Speech-to-Text Tool is the Real King? - Baidu Baijia, which compares the accuracy of various speech-to-text tools (results not verified). The accuracy for a 30-second video is as follows (data from the article above), and Whisper is leading the way.

Speech Recognition Model	Similarity to Correct Transcription
Feishu Miaoji	0.9898
Whisper’s large-v1 model	0.9865
Jianying	0.9865
Whisper’s large-v2 model	0.9797
Whisper’s large model	0.9797
BiJian	0.9797
Microsoft’s Built-in Speech Recognition	0.9695
NetEase’s Workbench	0.9662
Whisper’s medium model	0.9625
Whisper’s small model	0.9625
Whisper’s base model	0.8805
Whisper’s tiny model	0.8454

Idea

I want to reduce the amount of manual work.

Provide a video link.
Google Colab automatically downloads the video, and I only need to modify a variable and run a code cell in Colab.
Then I can do whatever I want. As long as the webpage stays open and active.
Colab automatically saves the generated srt file in Google Cloud.
Bonus: It would be great if I could be notified once the caption processing is completed.

Steps

Step 1: yt-dlp Tool

I found the yt-dlp - Github tool on GitHub. It can parse YouTube and download the specified resolution.

So the first step is clear. We need to download the video. However, in this case, I only need to download the audio. For testing purposes, I chose a video from a well-known YouTuber (I have no association with them or the friend mentioned earlier) Geekerwan - RTX3070TI首发评测：本来是张好显卡，等等吧 - YouTube.

We use the following command to parse the video:

1	yt-dlp -F "https://www.youtube.com/watch?v=5wF1YItz78Y"

For example:

[you.....] Extracting URL: https://www.you......com/watch?v=*******
[you.....] *******: Downloading webpage
[you.....] *******: Downloading android player API JSON
[you.....] *******: Downloading player 6f20102c
WARNING: [you.....] *******: nsig extraction failed: You may experience throttling for some formats
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = ...... ; player = .....
[info] Available formats for *******:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC        VBR ACODEC      ABR ASR MORE INFO
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
....
140 m4a   audio only      2 │    5.49MiB  129k dash  │ audio only        mp4a.40.2  129k 44k medium, m4a_dash
251 webm  audio only      2 │    4.73MiB  112k dash  │ audio only        opus       112k 48k medium, webm_dash
17  3gp   176x144      8  1 │    2.70MiB   64k https │ mp4v.20.3     64k mp4a.40.2    0k 22k 144p
597 mp4   256x144     15    │  990.76KiB   23k dash  │ avc1.4d400b   23k video only          144p, mp4_dash
....
315 webm  3840x2160   60    │  402.64MiB 9499k dash  │ vp9         9499k video only          2160p60, webm_dash

Obviously, we don’t need the “video_only” part because we only need the audio. So I selected “audio_only” with format_id 140.

Step 2: Downloading the Audio

We use the following command to download the audio. To ensure that the file has a consistent file extension, I added “-o download.m4a” at the end. Otherwise, it would use the video name to name the file.

1	yt-dlp -f 140 "https://www.youtube.com/watch?v=5wF1YItz78Y" -o download.m4a

Step 3: Installing the Whisper Dependencies

This step is straightforward. Just follow the instructions on Whisper - Github and execute the command.

1	pip install openai-whisper

Step 4: Generating Subtitles

Following the instructions provided by Whisper, we write the following command:

1	whisper --model large-v2 --model_dir=./ --output_format srt --output_dir outsrt download.m4a

In the “model” parameter, we use the “large-v2” model. The available models are as follows, and if the local model is not available, it will be automatically downloaded:

tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large (default: small)

The “model_dir” parameter specifies the directory where the model is stored. If you already have the model downloaded on your own server, you can specify it to avoid redundant downloads. The model should be named according to the model name mentioned above, for example, large-v2.pt. In my actual execution, I omitted this parameter.

The “output_format” parameter specifies the output format as “srt”. The available options are:

txt, vtt, srt, tsv, json, all

The “output_dir” parameter specifies where to output the srt files. I chose to output them to the “outsrt” directory. If the directory does not exist, it will be created automatically.

Finally, we provide the path to the “download.m4a” file, which is the file we want to process.

Logic Summary and Steps

At this point, the logic is complete, and the srt file should have been generated in the output_dir directory. The filename of the srt file is the same as the m4a file, for example, download.srt.

In Colab, if a code cell starts with “!”, it means that it is executing a shell command. If it doesn’t start with “!”, it means that it is a Python code. Each line in the shell command runs from the “/content” directory, and each code cell has its own environment. For example:

! mkdir files # 命令执行成功
! cd files # 进入到files，但是没卵用
! pwd # 输出 /content
! ls # 输出 files sample_data

This means that unless you write “!mkdir files && pwd”, you need to be aware that the “pwd” command is executed in the “files” directory.

Mounting Google Drive

I want the processed files to be automatically saved in my Google Drive. Therefore, I need to mount Google Drive first. Since mounting requires authorization, I don’t want to start processing and then be prompted for authorization an hour later.

1
2
3

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

After mounting Google Drive, the “gdrive” directory is actually located under “/content/MyDrive”.

Downloading and Installing yt-dlp

First, we need to download yt-dlp, extract it, and set the permissions. Here, I’m defining a variable for convenience. Since I found that the audio format_id is always 140, I hardcoded it. After downloading, I delete yt-dlp just in case there are any detection mechanisms (although it’s unnecessary).

1
2

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a
!rm -rf yt-dlp

Installing Dependencies

At this point, we already have the “download.m4a” file. Now, we need to install the Whisper dependencies. Here, I’m also installing the “requests” library because I want to be notified once the processing is complete. We need a network request library for that.

1	!pip install openai-whisper requests

Start Processing

I want it to output “DONE” once the processing is complete.

1	!whisper --model large-v2 --output_format srt --output_dir output_dir download.m4a && echo "DONE!"

Uploading Files

Copying the files to Google Drive.

1	!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

Testing the Entire Code

The complete code block looks like this:

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a

!pip install openai-whisper requests

!whisper --model large-v2 --output_format srt --output_dir outsrt download.m4a && echo "DONE!"

!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

After running it, the authorization prompt will appear first.

1
2

Permit this notebook to access your Google Drive files?
This notebook is requesting access to your Google Drive files. Granting access to Google Drive will permit code executed in the notebook to modify files in your Google Drive. Make sure to review notebook code prior to allowing this access.

Then, don’t close the webpage (Pro+ users can disregard this). You can have coffee or play games while the processing is being done. After it’s finished, the file will automatically appear in the specified folder in your Google Drive.

file-in-drive

This is the complete output from Colab. The whole process took 5 minutes and 46 seconds.

Mounted at /content/gdrive
--2023-04-20 07:36:19--  https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp
Resolving github.com (github.com)... a.b.c.d
Connecting to github.com (github.com)|a.b.c.d|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/yt-dlp/yt-dlp/releases/download/2023.03.04/yt-dlp [following]
--2023-04-20 07:36:19--  https://github.com/yt-dlp/yt-dlp/releases/download/2023.03.04/yt-dlp
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: ...... [following]
--2023-04-20 07:36:19--  https://objects.githubus.....t-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... a1.b1.c1.d1, a1.b1.c1.d1, a1.b1.c1.d1, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|a1.b1.c1.d1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2747279 (2.6M) [application/octet-stream]
Saving to: ‘yt-dlp’

yt-dlp              100%[===================>]   2.62M  --.-KB/s    in 0.04s   

2023-04-20 07:36:19 (70.9 MB/s) - ‘yt-dlp’ saved [2747279/2747279]

[youtube] Extracting URL: https://www.youtube.com/watch?v=5w*****78Y
[youtube] 5w*****78Y: Downloading webpage
[youtube] 5w*****78Y: Downloading android player API JSON
[youtube] 5w*****78Y: Downloading player 6f20102c
WARNING: [youtube] 5w*****78Y: nsig extraction failed: You may experience throttling for some formats
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = ****** ; player = https://www.youtube.com/s/player/6f20102c/player_ias.vflset/en_US/base.js
[info] 5w*****78Y: Downloading 1 format(s): 140
[dashsegments] Total fragments: 1
[download] Destination: download.m4a
[download] 100% of    5.49MiB in 00:00:00 at 37.43MiB/s
[FixupM4a] Correcting container of "download.m4a"
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai-whisper
.....
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... done
  Created wheel for openai-whisper: filename=openai_whisper-20230314-py3-none-any.whl size=796926 sha256=25ed5b9392f9e546a02428e155b3b832633eee99f065fa254447a2c17a61f10f
  Stored in directory: /root/.cache/pip/wheels/c4/85/e6/0bb9507b8e4f3f6d9c6dcf318bc3514739430375aa8e9eaf5b
Successfully built openai-whisper
Installing collected packages: ffmpeg-python, tiktoken, openai-whisper
Successfully installed ffmpeg-python-0.2.0 openai-whisper-20230314 tiktoken-0.3.1
100%|██████████████████████████████████████| 2.87G/2.87G [00:24<00:00, 129MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.840 --> 00:02.960] 刚刚测完3080Ti
[00:02.960 --> 00:06.120] 我们又给大家测3070Ti了
[00:06.120 --> 00:09.600] 本来这么多新品扎堆出应该是很让人兴奋的一件事
......
[02:32.800 --> 02:34.440] 尤其是在4K分辨率下
[02:34.440 --> 02:37.360] 3070Ti的表现是会比较受限的
[02:37.360 --> 02:40.520] 另一个游戏赛博朋克2077也是差不多的情况
[02:40.520 --> 02:44.680] 不开光追的时候3070Ti能比3070强出一些
[02:44.680 --> 02:46.760] 但差距没有之前那么大
[02:46.760 --> 02:48.200] 当打开了光追之后
[02:48.200 --> 02:50.600] 3070Ti就和3070拉不开差距了
[02:50.600 --> 02:54.600] 尤其是4K下和3080还差了挺多的
[02:54.600 --> 02:55.480] 最后一个游戏
[02:55.480 --> 02:56.640] 地铁增强版
[02:56.640 --> 02:57.720] 情况也是差不多
[02:57.720 --> 03:00.720] 3070Ti稍强于3070
[03:00.720 --> 03:03.280] 但不管是2K还是4K分辨率
[03:03.280 --> 03:05.360] 差距都不是特别大
.....
[05:32.440 --> 05:34.080] 放个指导价有什么用呢
[05:34.080 --> 05:35.960] 说得好像谁能买到一样
[05:35.960 --> 05:38.400] 所以你要是对3070Ti感兴趣的话
[05:38.400 --> 05:39.520] 就慢慢等吧
[05:39.520 --> 05:42.720] 这个卡跟什么3060之类的不太一样
[05:42.720 --> 05:45.400] 它降到原价之后还是值得考虑的
[05:45.400 --> 05:45.800] 好了
[05:45.800 --> 05:47.840] 以上就是本期节目的全部内容了
[05:47.840 --> 05:50.040] 喜欢的话不妨长按点赞一键三连
[05:50.040 --> 05:51.560] 我们下次再见了
[05:51.560 --> 05:52.040] 拜拜
DONE!

Final Step: Notification

We want to be notified once the processing is completed. For iOS users like me, I’m using the “bark” method to receive push notifications.

To get the address for the notification, search for “Bark” on your iPhone. It will provide you with a URL that looks like this:

1	https://api.day.app/一段token/信息文本

So, I wrote a simple Python code using the installed “requests” library.


import requests
from urllib.parse import quote as urlencode
requests.get("https://api.day.app/一段token/" + urlencode("Colab 字幕处理已经完成！"))

The result looks like this:

bark-receive

Final Notebook Summary

from google.colab import drive
drive.mount('/content/gdrive')
!mkdir /content/gdrive/MyDrive/srtFiles

!export YT_VIDEO="5wF1YItz78Y" && wget https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -O yt-dlp && chmod +x yt-dlp && ./yt-dlp -f 140 "https://www.youtube.com/watch?v=$YT_VIDEO" -o download.m4a
!pip install openai-whisper requests
!whisper --model large-v2 --output_format srt --output_dir outsrt download.m4a && echo "DONE!"
!cp outsrt/download.srt /content/gdrive/MyDrive/srtFiles

import requests
from urllib.parse import quote as urlencode
requests.get("https://api.day.app/一段token/" + urlencode("Colab 字幕处理已经完成！"))

“””