The short version of the question: I am looking for a speech recognition software that runs on Linux and has decent accuracy and usability. Any license and price is fine. It should not be restricted to voice commands, as I want to be able to dictate text.
More details:
I have unsatisfyingly tried the following:
- CMU Sphinx
- CVoiceControl
- Ears
- Julius
- Kaldi (e.g., Kaldi GStreamer server)
- IBM ViaVoice (used to run on Linux but was discontinued years ago)
- NICO ANN Toolkit
- OpenMindSpeech
- RWTH ASR
- shout
- silvius (built on the Kaldi speech recognition toolkit)
- Simon Listens
- ViaVoice / Xvoice
- Wine + Dragon NaturallySpeaking + NatLink + dragonfly + damselfly
- https://github.com/DragonComputer/Dragonfire: only accepts voice commands
All the above-mentioned native Linux solutions have both poor accuracy and usability (or some don't allow free-text dictation but only voice commands). By poor accuracy, I mean an accuracy significantly below the one the speech recognition software I mentioned below for other platforms have. As for Wine + Dragon NaturallySpeaking, in my experience it keeps crashing, and I don't seem to be the only one to have such issues unfortunately.
On Microsoft Windows I use Dragon NaturallySpeaking, on Apple Mac OS X I use Apple Dictation and DragonDictate, on Android I use Google speech recognition, and on iOS I use the built-in Apple speech recognition.
Baidu Research released yesterday the code for its speech recognition library using Connectionist Temporal Classification implemented with Torch. Benchmarks from Gigaom are encouraging as shown in the table below, but I am not aware of any good wrapper around to make it usable without quite some coding (and a large training data set):
System Clean (94) Noisy (82) Combined (176) Apple Dictation 14.24 43.76 26.73 Bing Speech 11.73 36.12 22.05 Google API 6.64 30.47 16.72 wit.ai 7.94 35.06 19.41 Deep Speech 6.56 19.06 11.85 Table 4: Results (%WER) for 3 systems evaluated on the original audio. All systems are scored only on the utterances with predictions given by all systems. The number in the parentheses next to each dataset, e.g. Clean (94), is the number of utterances scored.
There exist some very alpha open-source projects:
- https://github.com/mozilla/DeepSpeech (part of Mozilla's Vaani project: http://vaani.io (mirror))
- https://github.com/pannous/tensorflow-speech-recognition
- Vox, a system to control a Linux system using Dragon NaturallySpeaking: https://github.com/Franck-Dernoncourt/vox_linux + https://github.com/Franck-Dernoncourt/vox_windows
- https://github.com/facebookresearch/wav2letter
- https://github.com/espnet/espnet
- http://github.com/tensorflow/lingvo (to be released by Google, mentioned at Interspeech 2018)
I am also aware of this attempt at tracking states of the arts and recent results (bibliography) on speech recognition. as well as this benchmark of existing speech recognition APIs.
I am aware of Aenea, which allows speech recognition via Dragonfly on one computer to send events to another, but it has some latency cost:
I am also aware of these two talks exploring Linux option for speech recognition:
- 2016 – The Eleventh HOPE: Coding by Voice with Open Source Speech Recognition (David Williams-King)
- 2014 – Pycon: Using Python to Code by Voice (Tavis Rudd)
Best Answer
Right now I'm experimenting with using KDE connect in combination with Google speech recognition on my android smartphone.
KDE connect allows you to use your android device as an input device for your Linux computer (there are also some other features). You need to install the KDE connect app from the Google play store on your smartphone/tablet and install both kdeconnect and indicator-kdeconnect on your Linux computer. For Ubuntu systems the install goes as follows:
The downside of this installation is that it installs a bunch of KDE packages that you don't need if you don't use the KDE desktop environment.
Once you pair your android device with your computer (they have to be on the same network) you can use the android keyboard and then click/press on the mic to use Google speech recognition. As you talk, text will start to appear where ever your cursor is active on your Linux computer.
As for the results, they are a bit mixed for me as I'm currently writing some technical astrophysics document and Google speech recognition is struggling with the jargon that you don't typically read. Also forget about it figuring out punctuation or proper capitalization.