2023-04-02 • 8 min read • 434 views

Improving your Publishing Workflow with OpenAI Whisper Speech Recognition

AI OpenAI Python Blogging Workflow Whisper

If you do any sort of writing, no matter whether it's for a blog post, an article or some other publication, you will know that writing anything of quality is never easy and that it takes time and effort. One of the main time sinks is simply the typing and reworking your ideas and phrasing during this process.

Enter Whisper

Whisper (see Github for code) is a neural net for speech recognition that approaches human level robustness and accuracy on English. Whisper can also recognize other languages, but sice I'm primarily interested in English speech recognition, we'll just focus on that. It should be noted that Whisper is created by OpenAI, the same folks who brought us ChatGPT.

Whisper Options & Variations

Whisper has several models (ranging from tiny to large) which offer increasingly more sophisticated and accurate speech recognition. The size (and performance indicators) of these models are outlined in the table given below:

Size	Parameters	English-only Model	Multilingual Model	Required VRAM	Relative Speed
tiny	39M	tiny.en	tiny	~1GB	~32x
base	74M	base.en	base	~1GB	~16x
small	244M	small.en	small	~2GB	~6x
medium	769M	medium.en	medium	~5GB	~2x
large	1550M	N/A	large	~10GB	1x

I am sure that there are relevant differences between these models, but during my tests any differences I observed were very minor, in some cases the only difference was that some extra punctuation was inserted into the generated texts. This seems to suggest hat even the small model is good enough for most purposes.

Installing Whisper

Installing Whisper is actually quite easy. The following instructions are based on Linux (Ubuntu 22.04) but given that Whisper is exposed through a Python API, assuming that you have a current version of Python 3 installed, the same command should also work on Windows or on a Max (disclaimer: I don't have Windows or a Mac, so this is an untested assumption):

pip3 install git+https://github.com/openai/whisper.git

This command will download a whole bunch of libraries (don't ask me what they do, I have no idea) and install Whisper. Assuming you have an audio file called audio.wav, you'll be able to run the following command:

whisper --model small audio.wav

whisper --model small audio.wav > extracted.txt

if you want to save the recognized text to a file. You can see that I pass a --model flag to Whisper; the first time you use a model it will be downloaded, after that, it can just be used. Please also note that you can pass an MP3 file to Whisper, I just happened to have saved my recorded audio as WAV and didn't want to bother converting it. If you're wondering how to record audio, Audacity is a free, cross platform audio recorder which will do everything you need (and more).

If you're on Windows, there is a nice article here which details installing Whisper on Windows along with some related nice setup tricks.

Problems

First of all, let me make it clear that I usually don't need great GPU performance for my day-to-day work. A solid desktop which lets me get work done is all I need, whether I have full GPU support or not is not really relevant to me. As such, the problems I'm about to descibe here are most probably:

Linux specific
Unresolved due to my lazyness in tinkering around until I find a solution

The main issue for me (which is more of an annoyance than a problem) is that Whisper (and below that in the software stack, Python torch) refuses to use my GPU (clinfo sees it, but Whisper doesn't). This means that I can't use the GPU (specifically CUDA) and that Whisper has to use the CPU to work it's magic. No big deal, I have a 16-core CPU, so it doesn't take too long, but it's an annoyance since the GPU would probably be significantly faster. But lets not complain about minor annoyances, the fact that this works at all with this little effort is already a pleasant surprise.

How good is it?

I'm sure everybody reading this is on the edge of their seats now, wondering, well, how good Whisper really is. Well, to get right to the point, it is really good, much better than I expected. To test Whisper, I recorded about 2 minutes of audio, speaking normally without any special attention given to my pronounciation or talking speed. I then had Whisper perform it's magic: my CPU monitor across all 16 CPU cores lit up and eventually Whisper finished it's task. The results were excellent: even when using the small model, there was only 1 incorrect word in the generated text. The medium model had exactly the same result and the large model had the same 1 word error, but inserted additional (but correct) punctuation into the generated text. As such, at least according to my test, there is no need to use the large or medium model, the difference in extraction quality is tiny and since you'll have to check and massage the output anyways (if you're like me, what you say is not the same as what you would want to write), the tiny model is probably good enough. If you find that the extraction quality is not good enough, you can always run the same audio file through a larger/better model.

Performance

Transcribing 2 minutes of audio took about 1.5 minutes on the small model and about 8 minutes on the large model. Not very fast, but keep in mind that I'm using the CPU for this, which is probably completely unoptimized for such a task when compared to a GPU. I'd expect a modern GPU to be an order of magnitude (or even more) faster than the CPU, so performance should not be an issue. Though even at this speed, extracting the text from audio is still faster (and requires less effort) than typing it all out. Be advised though that if you have to resort to CPU based processing and you have a quad-code CPU (or something of that sort) the extraction speed will be painfully slow.

Ubuntu 23.04 will be released soon, I will then try again to install the GPU drivers and setup Whisper to use the GPU. For now, I'm just happy that this works.

Using Whisper for Blogging & Writing

During my tests I discovered that just starting to speak without planning was difficult because I had to make it up as I went along, which initially resulted in a somewhat incoherent train of thoughts. On my second attempt, I wrote an outline of what I wanted to talk about as bullet points and then things were much smoother and I had no more (well, at least much less) problems giving a coherent talk. The result was good enough that all it would need is some minor editing to turn it something useful.

My conclusion is that Whisper can save you lots of typing and let you focus on your content, althoug the process of this is not the same as when you're writing. But especially for people who are not very proficient at typing, Whisper should be a significant improvment in the speed at which they can produce content and save them some time and frustration.

Using Whisper through an Online Service

Given the current explosion of services based on OpenAI, there will be (and probably already are) a ton of services which use Whisper to provide audio transcription services. It you intend to use Whisper seriously and regularly for your work, then given the time it saves you, it is probably well worth paying some subscription fee to such a service. You'll be able since to focus on your content generation rather than messing around with software which for most people (emphasis on 'most', we software developers are a weird bunch) is probably the right decision to make.

Final Thought

Are you ready to perform some magic on your PC with minimal effort? OpenAI/Whisper is at your service, to be your perfect scribe, never tiring, never complaining and never making an error. It's almost like magic. OK, magic might be a bit of an exaggeration, but Whispers capabilities are really impressive. Also, when you see it maxing out all your CPU cores, it gives you new respect for the work your brain does effortlessly all day long with minimal energy use. Whisper and your brain are probably not very comparable, but you probably/hopefully get the point I'm trying to make here ...