Mental Model
You’ve got visual data (like an image) that u want a neural network to extract features from, but u don’t want to use a standard Artificial Neural Network.
Why? Because an ANN flattens everything into dense layers, treating the image like a flat list of pixels and assuming every pixel is equally related to every other pixel.
Transclude of convolutional-neural-networks-2026-04-02-21.30.57.excalidraw
Images don’t work like that. Pixels close to each other form meaningful structures (like edges), while distant pixels are largely independent. How do u then exploit this spatial structure?
The Exploitation
Instead of connecting every pixel in the current layer with every pixel in the previous layer, u only connect it to ones within its neighborhood (surrounding it).
Transclude of convolutional-neural-networks-2026-04-02-21.14.32.excalidraw
This is what a filter/kernel does.
A Convolutional layer computes the value of a pixel in its feature map by looking at the one from the previous layer in its same spot and applying its filter to it.
Mental Model
A CNN is a tool that, given a raw image, extracts its features and then uses them to make predictions.
Thus, what it does can be split into 2 phases:
- feature extraction phase
- prediction phase Its prediction phase is identical to a regular NN, and thus the interesting phase is the feature extraction one.
Feature Extraction Phase
Given an NxM inp
Usually, an activation function phase is applied to feature maps of a layer, and sometimes a pooling phase.
Transclude of convolutional-neural-networks-2026-04-02-00.24.51.excalidraw
Transclude of convolutional-neural-networks-2026-04-02-00.44.21.excalidraw…
Transclude of convolutional-neural-networks-2026-04-02-14.30.00.excalidraw
Connections
Raw Thoughts
A CNN is a neural network that i decided not to extract the input features with, since i found a better way to do so, a spatial way to do so, kernels. CNNs use spatial kernels to extract features from raw images. The interesting part about CNNs is how they extract features, not in how they classify.
CNNs are explicitly designed to exploit the fact that in visual data, local neighborhoods of pixels are highly correlated, while distant pixels are largely independent. They drop the “fully connected” assumption in ANNs.
Similar to how bayesian networks drop the assumption that all nodes affect one another and instead build a hierarchical cause-effect structure.