linear classifiers

Linear classifiers are yet another algorithm, which is based on the parametric approach. A classifier is referred to as a linear classifier if the decision boundary is a linear function. Recall that for KNNs, the decision boundaries are determined by the value of K (number of nearest neighbours) and the distance metric. 
In essence, this is essentially what a linear classifier represents: 
﻿
You basically apply a function F(x, W) on the input image, which spits out numbers giving class scores for each category. For instance, on the CIFAR10 dataset, seeing as there are 10 classes, it would give us 10 numbers. The higher the number, the more the classifier's confidence in the data belonging to that label. 
As it is a LINEAR classifier, the function would be:
F(x, W) = Wx + B ; where W = weights, x = pixels of the input image and B = a bias that offsets the value. 
This can be understood in three different perspectives. These are as shown below:
Algebraic view (manipulation of matrices)
Geometric view (hyperplanes = decision boundaries)
Visual view (one "template" per category) 
Let's explore each of these:
algebraic viewConsider a 2x2 pixel grayscale image that we are trying to classify. This means that it would have a total of 4 pixels. These pixels are stretched a column matrix of order 4 x 1.
﻿
Let's say that the picture can be classified under 3 categories. Based on the equation, F = Wx + b, it can be algebraically represented as: 
﻿
The weight matrix is of order 3x4, meaning that each row has a set of weights for each pixel that corresponds to a class. Thus, every row of W is a classifer for one of the classes. 
The final product is a matrix of 3 rows, with one element each that corresponds to the class score for that row. 
In order to further simplify this representation, we may use the bias trick to represent the weight and bias as one parameter. To do this, we add an additional column to the weight matrix (new order = 3 x 5) with the bias values and add an additional row to the pixel matrix with the constant element 1 (new order = 5 x 1). These matrices are multiplied to give us the same final matrix of class scores. The bias trick has been illustrated in the below image: 
﻿
An important insight this gives us is that predictions are linear, i.e:
F(x, w) = Wx 
F(cx, w) = c * Wx = c * F(x, w)
geometric viewAs the image is stretched into a column matrix, they exist in the vector space as a point. The dataset can hence be visualised as a set of points in 4D space (as it is a 2x2 grayscale image). 
Similarly, each class score, being derived from a linear equation, exists as a linear function over this space.  In higher dimensions, we'd have hyperplanes serving as decision boundaries for different classes. This helps us understand what's possible and what's not. 
While it's difficult to represent a 4D space here, we can imagine squashing all those dimensions into just 2 dimensions to better understand what our classifier is doing, as is shown in CS231N: 
﻿
In this 2D representation, each image is a single point. The lines represent all points in space(/images) that get a score of 0 for that label. For instance, the red line represents all images that get a score of 0 for the car class. Similarly, the arrow represents the direction of increase. This means that scores linearly increase as we move in the direction of the arrow. 
Moreover, if we change one of the rows of W, the corresponding lines in the pixel space will rotate in different directions, changing the decision boundaries. The geometric interpretation is also useful to understand the role of the bias terms. Without them, all decision boundaries would be forced to cross the origin at x = 0 regardless of the weight. The biases thus provide an offset. 
visual viewYet another interpretation of the working of linear classifiers is the visual view; each row of weights (W) for a class is thought to correspond to a "template" for the class. The template is a prototype for the class. 
The classifier compares the input image to the template for each class by computing the dot product between the row of weights and input image vector. This measures how well the template matches the features present in the images. The dot products give the scores for each class. 
Here, what we're essentially visualizing is breaking up the weight matrix into 3 different matrices, each representing a template, and multiplying them with the pixel matrix. 
It is also important to note that these template images formed may not necessarily be part of the dataset - rather, it could be an epic crossover between different images. For instance, look at the template images for the CIFAR10 dataset:
﻿
The horse template appears to be two-headed, which is definitely not normal and probably not part of the dataset. Rather, what has happened here is that the linear classifier has merged various images of horses facing different directions into a single template. 
why linear classifiers >>> Linear classifiers present many advantages in comparison with KNNs, most notably: 
Speed: classifying the test image involves simple manipulation of matrices (addition, multiplication) unlike KNNs, which compare the test image to the training images. Thus, it is much faster. 
The training data is not memorised, it is only used to learn and select the parameters of W, b. Once this is done, the training set can be discarded and a random image can be classified with the learnt parameters. 
But how? How are we selecting the weights and biases? In the next post, I explore this question further, so check it out here (link to be added lol). 
linear classifiers

algebraic view

geometric view

visual view

why linear classifiers >>>

Comments (loading...)

deep learning for biomedical imaging