Can a computer analyze audio quicker than real time playback

audiodigital-audiotranscription

So let’s say that your computer is transcribing audio (of someone speaking) to text. Because it’s looking at the digital values of the audio, does it “render” the transcription quicker than the time it takes to play it in real time? I would imagine that it is not “listening” like a human would, rather it processes it digitally. Am I right in this assumption?

The same question would apply to analyzing video.

My confusion is: When playing audio back at a faster rate, the words become unclear, so how does the computer compensate for that? Excuse me if I am missing something basic and fundamental here.


Edit: When I use the term “real time” in this question, I don't mean at the time of recording, and then transcribing in real time. Rather, I mean playback at 1x speed (or real time playback speed). It seems some people didn't catch what I meant.

Best Answer

Yes. Absolutely.

Algorithms can process data as fast as they can read them and get them through the CPU.

If data is on disk, for example, a modern NVMe can read at 5+ GB/s which is much faster than bit-rates normally used to store voice data. Of course, the actual algorithm being applied can be more or less complex, so we cannot guarantee it will be processed at the maximum read speed but there is nothing inherent that limits such analysis to be in real-time speed.

The same principle applies to video but that requires much more throughput due to the huge amount of data in such files. That obviously depends on resolution, frame-rate and complexity of the analysis. It is actually difficult to perform sophisticated video analysis in real-time because analysis is almost always done on decompressed video, so the processor must have time to decode and analyze in a short period of time and keep data flowing so that by the time some analysis is done, the next block of video is already decoded and in memory. This is something I worked on for almost a decade.

When you playback video faster, words are unclear to you but the data is exactly the same. The speed at which audio is being processed does not affect the ability of the algorithm to understand it. Software knows exactly how much time each audio sample represents.

Related Question