Cow-Sharks: exploring Shape vs. Texture biases in Deep Neural Networks
The context to this post are recent papers providing evidence that prediction performance of deep networks trained for image classification is in majority tue to texture biases, i.e. networks take decisions based on textures / small patches :
- Geirhos et al., ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, ICLR 2019.
- Hermann et la., The Origins and Prevalence of Texture Bias in Convolutional Neural Networks, NeurIPS 2020.
- ???, Shape or Texture: Understanding Discriminative Features in CNNs, submitted to ICLR 2021.
- Brendel et al., Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet, ICLR 2019. (“BagNet”).
I then stumbled across this nice image on Twitter, shared by https://twitter.com/SolTight
What follows, is a summary of a Twitter thread I started.
I passed this image through a pretrained ResNet 101 with PyTorch, and at first sight the results seem to be counterintuitive? It predicts:
Tiger shark 23%
Great white shark 16%
Gar, garfish 11%
So … at first sight, shape > [ texture + mountain context ], at least this seems to be the case for this example. Note that we do not really know what the answer should be … the species does not exist. But for ResNet-101, the shape of the object seems to be more important than color, texture, and the fact that the background is in the mountains and not under water.
A nice baseline is to pass the the same image through BagNet, which, roughly divides the image into small patches, creates a representation by patch, and then puts all these representations into a “bag”, i.e. a structure which does not handle spatial relationships between the different patches. This kind of neural network has a strong inductive bias towards texture, as shape requires spatial relationships, at least if the scale of the shape is bigger than the patches of the decomposition.
So, BagNet provides the following results:
Ibizan hound, Ibizan Podenco 11%
This is not surprising, the prediction is now clearly biased towards textures, which is what BagNet actually sees.
But if we investigate this a bit further, and check the results of the originally used ResNet-101 model, we can find that our intuition (shape > texture) might be wrong also for this example. Let’s check the results for two different cropped images:
Great white shark 25%
Tiger shark 16%
Killer whale 6%
Sea lion 0.1%
Electric ray 10%
Tiger shark 7%
Rhodesian ridgeback 0.4%
Ibizan hound 0.2%
Hog, pig 4%
Flute, transverse flute 4%
Clearly, cropping more and more information from the image brings the prediction of the more powerful ResNet-101 closer to the texture biased BagNet. In particular, this fin on the back of the creatures seems to be a killer feature.
A sanity check of the model output on a cow :
Water buffalo, water ox 0.8%
Ram, tup 0.7%
Arabian camel, dromedary, Camelus dromedarius 0.3%
water buffalo, water ox, Asiatic buffalo, Bubalus bubalis 0.1%
And finally, one more variant of the original image:
Great white shark 14%
Tiger shark 3%
Albatross, mollymawk 35%
Killer whale 7%