Cow-Sharks: exploring Shape vs. Texture biases in Deep Neural Networks

Christian Wolf
4 min readNov 30, 2020

The context to this post are recent papers providing evidence that prediction performance of deep networks trained for image classification is in majority tue to texture biases, i.e. networks take decisions based on textures / small patches :

I then stumbled across this nice image on Twitter, shared by https://twitter.com/SolTight

What follows, is a summary of a Twitter thread I started.

I passed this image through a pretrained ResNet 101 with PyTorch, and at first sight the results seem to be counterintuitive? It predicts:

Tiger shark 23%
Hammerhead 21%
Great white shark 16%
Gar, garfish 11%
Sturgeon 3%

So … at first sight, shape > [ texture + mountain context ], at least this seems to be the case for this example. Note that we do not really know what the answer should be … the species does not exist. But for ResNet-101, the shape of the object seems to be more important than color, texture, and the fact that the background is in the mountains and not under water.

A nice baseline is to pass the the same image through BagNet, which, roughly divides the image into small patches, creates a representation by patch, and then puts all these representations into a “bag”, i.e. a structure which does not handle spatial relationships between the different patches. This kind of neural network has a strong inductive bias towards texture, as shape requires spatial relationships, at least if the scale of the shape is bigger than the patches of the decomposition.

So, BagNet provides the following results:

Sorrel 81%
Ibizan hound, Ibizan Podenco 11%
Basenji 4%
Gazelle 2%
Ox 1%

This is not surprising, the prediction is now clearly biased towards textures, which is what BagNet actually sees.

But if we investigate this a bit further, and check the results of the originally used ResNet-101 model, we can find that our intuition (shape > texture) might be wrong also for this example. Let’s check the results for two different cropped images:

ResNet-101
Hammerhead 52%
Great white shark 25%
Tiger shark 16%
Killer whale 6%
Sea lion 0.1%

BagNet-33
Electric ray 10%
Sturgeon 9%
Sorrel 7%
Tiger shark 7%
Stingray 5%

And:

ResNet-101
Sorrel 92%
Ox 6%
Rhodesian ridgeback 0.4%
Ibizan hound 0.2%
Boxer 0.2%

BagNet-33
Sorrel 24%
Gyromitra 16%
Ox 6%
Hog, pig 4%
Flute, transverse flute 4%

Clearly, cropping more and more information from the image brings the prediction of the more powerful ResNet-101 closer to the texture biased BagNet. In particular, this fin on the back of the creatures seems to be a killer feature.

A sanity check of the model output on a cow :

ResNet-101
Ox 93%
Oxcart 3%
Gazelle 0.8%
Water buffalo, water ox 0.8%
Ram, tup 0.7%

BagNet-33
ox 98%
oxcart 0.5%
Arabian camel, dromedary, Camelus dromedarius 0.3%
water buffalo, water ox, Asiatic buffalo, Bubalus bubalis 0.1%
bison 0.1%

And finally, one more variant of the original image:

ResNet-101
Hammerhead 53%
Great white shark 14%
Triceratops 8%
Airship 6%
Tiger shark 3%

BagNet-33
Albatross, mollymawk 35%
Goose 15%
Killer whale 7%
Drake 6%
Whippet 4%

--

--

Christian Wolf

Scientist in Machine/Deep Learning, AI, Robotics, Computer Vision. AI research chair @INSAdeLyon. Austrian living in France. 3 person family, 3 nationalities.