AI Vision is Biased Toward Texture, not Shape

The human brain is designed to instantly recognize the objects present in rich settings, just like in reality, whether in images or videos. However, the ability to identify the fake ones is another story.

This task isn’t as easy for artificial intelligence. It takes a lot of data and training time (and carbon emissions!) for object detection algorithms to develop the ability to identify objects.

While the field of object detection technology has grown significantly in recent years, there’s still a lot to be learned about AI vision.

How deep learning algorithms see things anyway?

The Texture Bias of AI Vision Algorithms

Deep learning models, like Convolutional Neural Networks (CNNs), detect and recognize various objects by learning how to tell increasingly complex shapes apart.

If you want a CNN algorithm to recognize cats, you have to feed it a large dataset of images featuring cats with every imaginable posture until it no longer mistakes the shape of a cat.

However, the slightest noise in the image, like changing the colorization of a small pixel block, or adversarial attack, could leave the neural network scratching its head as to what it’s looking at.

A group of MIT students fooled Google’s cloud-based AI vision system to mistake a cat for a bowl of guacamole, and a 3D-printed turtle for a rifle.

But why are CNNs that easy to trick? German researchers think they have the answer.

A research study from the University of Tübingen in Germany suggests that image textures could play a more prominent role in AI vision than objects themselves.

For their study, the team devised a series of tests for AI vision models and humans to see how they would fare with images with a “texture-shape cue conflict.” Ninety-seven human participants and four CNNs took part in the experiment to identify the objects and animals shown in a series of images.

All human observers and CNNs were able to correctly recognize objects in almost all the images with no distortion. But CNNs fared worse than human participants when shown images with textures removed, unable to work with a given object’s shape alone. A cat with an elephant texture is simply an elephant to neural networks.

In the paper, submitted to the International Conference on Learning Representations (ICLR), the authors wrote:

“ImageNet-trained CNNs are strongly biased towards recognizing textures rather than shapes, which is in stark contrast to human behavioral evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on ‘Stylized-ImageNet,’ a stylized version of ImageNet.”

According to the researchers, their experiment highlights the benefit of shape-based representation and its potential in helping build object detection and image recognition software that can’t be easily thrown off by image distortions.

More efficient and robust AI vision systems are very critical in dynamic situations, like a self-driving car on the road where textures are constantly changing around moving and still objects.

“Current state-of-the-art CNNs are very susceptible to random noise such as rain or snow in the real world, a problem for autonomous driving. The fact that the shape-based CNN that I trained turned out to be much more robust on nearly all tested sorts of noise seems like a promising result on the way to more robust models,” Robert Geirhos, coauthor of the paper, told The Register.