The software you can use is Vosk-api, a modern speech recognition toolkit based on neural networks. It supports 7+ languages and works on variety of platforms including RPi and mobile.
First you convert the file to the required format and then you recognize it:
ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav
Then install vosk-api with pip:
pip3 install vosk
Then use these steps:
git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav > result.json
The result will be stored in json format.
The same directory also contains an srt subtitle output example, which is easier to evaluate and can be directly useful to some users:
python3 -m pip install srt
python3 ./test_srt.py test.wav
The example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:
one zero zero zero one
nine oh two one oh
zero one eight zero three
The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".
The SRT generated above reads:
1
00:00:00,870 --> 00:00:02,610
what zero zero zero one
2
00:00:03,930 --> 00:00:04,950
no no to uno
3
00:00:06,240 --> 00:00:08,010
cyril one eight zero three
so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.
Next I also tried with the vosk-model-en-us-aspire-0.2
which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3
and is listed at https://alphacephei.com/vosk/models:
mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model
and the result was:
1
00:00:00,840 --> 00:00:02,610
one zero zero zero one
2
00:00:04,026 --> 00:00:04,980
i know what you window
3
00:00:06,270 --> 00:00:07,980
serial one eight zero three
which got one more word correct.
Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3.
When you have to transcribe, i.e. type, content that comes from an audio file, you want to
- be able to control audio (stop, pause, play) via the keyboard while typing the text (so you do not have to leave the text editor and switch to an audio application to do just that).
- You might also want to adjust the audio speed (to speed up slow speech and slow down too fast speech).
- You want the audio position to be a little bit rewinded when you continue audio playback, just to be sure you didn't miss anything.
I found that a software called transcribe (launchpad) provides the solution I was looking for.
You type the text in a text editor while transcribe plays the audio file in another window. To pause or continue audio playback, you use any keyboard shortcut that you configured system-wide for audio in the system settings. I use F7 to pause, F8 to continue.
To install transcribe:
First, you have to add a PPA:
sudo add-apt-repository ppa:frederik-elwert/transcribe
sudo apt-get update
Then you can install it:
sudo apt-get install transcribe
Best Answer