Challenges with Encoding
To be able to convert speech data into text, we needed an encoding that is loss-less. One of the popular option is FLAC (Free Lossless Audio Codec). We started using libflac.js to capture the audio in flac format and then we hit a browser issue. In firefox the buffers won't clear up properly resulting in the browser hanging for long responses. Then we started going back to WAV which is the native format supported by most browsers. This was much easier to handle. This helped us overcome the memory leak issue we were facing in flac. On the flip side, since there was no compression involved the file sizes were 5 to 6 times bigger. You win some, you lose some I guess.
Problems with Large audio files
We persisted with WAV format due to browser issues - and due to the inability and lack of time on our end to dig deep into why flac wont work reliably. But now larger files meant we had to transfer more data over the wire. This was challenging in some situations where bandwidth was limited, and we soon had users raising issues. We had to move fast.
Data chunkingTo over come this we started sending data in chunks of smaller sizes, we chose a limit of about 1 MB. So now we would have 10 to 20 files each of size ~1MB depending on the response from user. Now the challenge was to merge them into a single audio file.
Merging audio filesOur application is built on Google cloud, and we decided to rely on the merge feature of Google cloud storage. File merge was quick to realize, but although the files were merged properly, they were not complete audio files because of the WAV format. Below shows an audio WAV format and as you can see you simply cant get a merged WAV files just by doing byte appending during merge. To overcome this we needed a audio library that can merge two audio files into one. We used sox library for that.
On demand Audio Processing VM
Installing the audio library on our application fleet meant we take a hard dependency on it. We knew at some point we would try to solve this problem through Streaming so this was a going to be a short term solution. So we launch an on demand audio processing instance through a Cloud function. Below diagram explains its details. Audio processor has one primary job - to merge the files into a single audio file. And then it also does the speech to text processing.
- InterviewParrot Core app
- Cloud Storage for saving chunk files and the merged file
- Message Queue for sending the merge instructions
- Cloud function to listen to merge instruction and launch audio processing VM
This is by no means the best way to solve this problem, but it is one of the ways we solved it and we strive to make it better. Good thing is, this is working for us in production and has been stable so far - a benchmark we like to follow.