New Deep Learning Strategy Could Enhance Computer Vision

A deep learning system takes textual hints from the context of images to describe them without the need for prior human annotations.

Since its humble beginnings at the turn of the millennium, deep learning, as both a scientific discipline and an industry, has come a long way.

From smartphone assistants to pattern recognition software, security solutions, and other applications, deep learning is becoming a multi-billion dollar business poised for great growth over the few next years.

Self-Supervised Deep Learning

The power and appeal of deep learning is all about their ability to recognize different types of patterns like faces, voices, objects, images, and codes.

AI software doesn’t understand what these things really are, and all they see is digital data, and they’re pretty good at that.

The great computer vision capability of deep learning algorithms enable them to tell these things apart, categorize, and classify them.

To do so, however, this software needs to be supervised.

They require human manual input in the form of annotations to guide them before they generalize and build on what they learned into new, similar situations.

Building and labeling large datasets is a complicated and time-consuming task.

Unsupervised machines will be completely autonomous as all they need is data taken directly from their environment. From there, they would take the information to make predictions and yield the expected results.

To design unsupervised, or self-supervised deep learning systems, computer scientists take inspirations from how human intelligence works.

Now, an international team of computer vision scientists has devised a method to enable deep learning software to learn the visual features of images without the need for annotated examples.

Researchers from Carnegie Mellon University (U.S.), Universitat Autonoma de Barcelona (Spain), and the International Institute of Information Technology (India), worked on the study,

Unsupervised Computer Vision Algorithms, it’s a Matter of Semantics

In the study, the team built computational models that use textual information about images found on websites, like Wikipedia, and linked them to the visual features of these images.

“We aim to give computers the capability to read and understand textual information in any type of image in the real world,” said Dimosthenis Karatzas, a research team member.

In the next step, researchers used the models to train deep learning algorithms to pick adequate visual features that textually describe images.

Instead of labeled information about the content of a particular image, the algorithm takes non-visual cues from the semantic textual information found around the image.

“Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches,” wrote researchers in the paper.

This is not a fully unsupervised system as algorithms still need models to train on, but the technique shows that deep learning algorithms can tap into the internet to enhance their unsupervised learning abilities.

“We will continue our work on the joint-embedding of textual and visual information,” said Karatzas. “looking for novel ways to perform semantic retrieval by tapping on noisy information available in the Web and Social Media.”