Meta’s AI-assisted audio codec claims 10x compression rate compared to MP3s


Encodec uses artificial intelligence to maximize efficiency and save bandwidth.

TL;DR: Encodec is a new generation audio codec based on a complex neural network design, a system that can squeeze a lot of sound juice into minimal storage space. The codec will work to work with Metaverse and optimize mobile phone calls.

Thanks to the high efficiency and integrated support of iconic products such as the immutable Winamp player, the MP3 codec became the de facto standard for sharing audio files on the Internet in the nineties and later. Now the new codec wants to make history again, offering even more extreme efficiency gains and bandwidth savings. The secret lies in an artificial intelligence algorithm capable of “hyper-compressing” audio streams.

Meta researchers have conceptualized Encodec as a potential solution to support “current and future” high-quality experiences in the metaverse. The new technology is a neural network trained to “push the boundaries of what is possible” in audio compression for online applications. The system can achieve “approximately 10x compression” compared to the MP3 standard.

Meta trained the AI “from start to finish” to achieve a certain target size after compression. The encoder can compress a 64 Kbit/s MP3 data stream into 6 Kbit/s, which means it only needs 6144 bytes (yes, bytes) to maintain the same quality as the original. Researchers say the codec can compress 48 kHz stereo audio samples for speech — a first in the industry.

The AI-based approach can “compress and decompress audio in real time to cutting-edge size reduction” with potentially incredible results, as can be seen from an example published on the Meta blog about artificial intelligence. Classical codecs, such as MP3, Opus or EVS, decompose the signal into different frequencies and encode as efficiently as possible using psychoacoustics (the study of human perception of sound). Codec methods are based on a complex scheme consisting of three parts: an encoder, a quantizer and a decoder.

The encoder takes uncompressed data and converts it into a larger representation with a lower frame rate. The quantizer compresses this stream to the target size, while preserving the most important information for restoring the original signal. Finally, the decoder converts the compressed signal into a waveform that is “as similar as possible to the original.”

The Encodec machine learning model identifies sound changes that are invisible to humans, using discriminators to improve the perceived quality of the generated sounds. Meta described this process as a “cat-and-mouse game” where the discriminator distinguishes between the original and reconstructed samples. The end result is excellent audio compression in low-bitrate speech (from 1.5 kbit/s to 12 kbit/s).

According to Meta, Encodec can encode and decode audio data in real time on a single CPU core, and it still offers many areas of improvement even for smaller files. In addition to supporting the new generation of Metaverse in modern Internet connections, the new model can potentially guarantee a higher quality of phone calls in areas where mobile coverage is far from optimal.


Please enter your comment!
Please enter your name here