Meta Introduces the 1st Self-Supervised Algorithm for Speech, Vision & Text

Meta announced data2vec, the first high-performance self-supervised algorithm that learns the same way in multiple modalities, including speech, vision and text. Most machines learn exclusively from labelled data.

However, through self-supervised learning, machines are able to learn about the world just by observing it and then figuring out the structure of images, speech or text. This is a more scalable and efficient approach for machines to tackle new complex tasks, such as understanding text for more spoken languages.

Self-supervised learning algorithms for images, speech, text or other modalities function in very different ways, which has limited researchers in applying them more broadly. Because an algorithm designed for understanding images can’t be directly applied to reading text, it’s difficult to push several modalities ahead at the same rate. With data2vec, Meta has developed a unified way for models to predict their own representations of the input data, regardless if it’s speech, text or audio. By focusing on these representations, a single algorithm can work with completely different types of input.

With data2vec, Meta is closer to building machines that learn about different aspects of the world around them without having to rely on labelled data. This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. Data2vec will also enable us to develop more adaptable AI, which will be able to perform tasks beyond what’s possible today.

If you’re a researcher interested in building upon our work, you can access the open source code and release pretrained models on GitHub.

News Source: Meta Newsroom