A simple neural network written in plain C++
No Python, no PyTorch, no NumPy, … just 260 lines of plain C++ code to train, evaluate and test a simple binary classifier that distinguishes between the handwritten digits 0 and 1.
Reading Bishop’s book on deep learning recently, I was reminded of how simple the backpropagation algorithm actually is, it fits on a single page. On another note, I have not touched C++ for a while, although I have always enjoyed working with it. So why not implement a neural network in C++? The result is a model that trains in ~10s and achieves ~99% accuracy. The code is available on github.
Data
For a binary classifier a dataset with two different classes is required. Here we simply take all the images of 0s and 1s from the MNIST dataset. The images have a size of 28×28 pixels. By flattening them we get vectors with 784 elements.
Model
The model is structured into layers, and each layer contains units (“neurons”):
- Input layer: these units do not compute anything, they are just holding the 784 pixel values of the input image. While not strictly required having such a “dummy” layer, it makes the code more consistent.
- Hidden layers: each unit in a hidden layer receives inputs from all units of the previous layer. These inputs are linearly combined into a single value, and finally a nonlinear function is applied to it to get the output of the unit.
- Last layer: only a single unit (outputting the prediction of the model) is sitting in the final layer, which works like the units in the hidden layers, except there is no nonlinear function applied.
Code: building the model
The model is defined in the RegressionModel class. Yes, to keep things simple the classification is done via regression, the model is trained to output 0.0 for the digit 0 and 1.0 for the digit 1, and the classification is done by checking if the output is smaller or bigger than 0.5.
The constructor creates the layers and puts them into the layer list (std::vector, to be more precise). The input layer contains units whose sole purpose is to hold the pixel values of the currently fed image. Then the hidden layers are created, and finally the last layer with its single unit is put into the list of layers.
// create a model with given number of units in input layer, number of units in hidden layers, and number of layers
RegressionModel(size_t num_input, size_t num_hidden, size_t num_layer) {
// input layer to store the inputs for each pass
layer.push_back(Layer());
for (size_t i = 0; i < num_input; ++i) {
layer.back().push_back(Unit(0));
}
// hidden layers
if (num_layer >= 2) {
for (size_t i = 0; i < num_layer - 1; ++i) {
layer.push_back(Layer());
for (size_t j = 0; j < num_hidden; ++j) {
layer.back().push_back(Unit(i == 0 ? num_input : num_hidden, true));
}
}
}
// last layer contains one single unit
layer.push_back(Layer());
layer.back().push_back(Unit(num_hidden, false));
}
Each layer contains units (neurons). For the input layer only a value (the pixel value) is set when the forward method is called, while for the other layers a weight list and a bias value is required to process the incoming data. The last few variables are needed for the backward method in which the gradients are computed.
// Unit is the main buildig block of the model, aka "neuron"
struct Unit {
Unit(size_t a_num_input, bool a_has_activation)
:has_activation(a_has_activation),
weight(random_weight(a_num_input)) {
}
explicit Unit(FloatType a_value)
: value(a_value) {
}
bool has_activation{};
FloatArray weight;
FloatType bias{};
FloatType value{};
FloatType delta{};
FloatArray grad_weight;
FloatType grad_bias{};
};
Code: training the model
The followings steps are executed to train the model:
- forward pass: feed the image through the model and compute its prediction
- compute the loss, for regression that simple is (true value — predicted value)²
- backward pass: for each parameter, compute the derivative of the loss with respect to the parameter, which tells us how we have to change this parameter value so the loss increases/decreases
- update step: we do a gradient descent step by slightly adjusting every parameter in the direction that decreases the loss
The forward method takes a single input image at a time and puts its values into the first layer. Then, the loop takes care of feeding the data from layer to layer, until we get to the last layer. The output value (model prediction) of the single unit in the last layer is returned.
// compute forward pass
FloatType forward(const FloatArray& input_data) {
// store inputs in first layer, as they are required in the backward pass
for (size_t i = 0; i < input_data.size(); ++i) {
layer[0][i].value = input_data[i];
}
// feed forward
for (size_t i = 1; i < layer.size(); ++i) {
Layer& curr_layer = layer[i];
const Layer& input_layer = layer[i - 1];
for (Unit& curr_unit : curr_layer) {
FloatType pre_activation = curr_unit.bias;
for (size_t j = 0; j < input_layer.size(); ++j) {
pre_activation += input_layer[j].value * curr_unit.weight[j];
}
curr_unit.value = curr_unit.has_activation ? std::max<FloatType>(0, pre_activation) : pre_activation;
}
}
// return output of single unit in last layer
return layer.back()[0].value;
}
The backward method is used to compute the gradients of the model parameters. The incoming error signal is computed in the main function. I suggest having a look at some book, e.g. Bishop or Goodfellow for more details.
// compute backward pass
void backward(FloatType error_signal) {
// set the error signal (derivative of loss wrt last units pre-activation) for single unit in last layer
layer.back()[0].delta = error_signal;
// reversed index r counts from back, i is to be used to index the vector
for (size_t r = 0; r < layer.size() - 1; ++r) {
const size_t i = layer.size() - r - 1; // index to be used for the vector
// going over all units of current layer
Layer& curr_layer = layer[i];
for (size_t j = 0; j < curr_layer.size(); ++j) {
Unit& curr_unit = layer[i][j];
// for all but last layer compute delta (incoming error signal)
if (r > 0) {
curr_unit.delta = 0;
const Layer& next_layer = layer[i + 1]; // closer to the output
for (size_t k = 0; k < next_layer.size(); ++k) {
const Unit& next_unit = next_layer[k];
curr_unit.delta += next_unit.delta * next_unit.weight[i];
}
}
// compute gradient for weights and for bias (derivative of loss wrt each parameter)
FloatArray grad;
const Layer& prev_layer = layer[i - 1];
for (size_t k = 0; k < prev_layer.size(); ++k) {
const Unit& prev_unit = prev_layer[k];
grad.push_back(curr_unit.delta * prev_unit.value);
}
curr_unit.grad_weight = grad;
curr_unit.grad_bias = curr_unit.delta;
}
}
}
Finally, we apply the gradient descent step where we update the parameters in a way to decrease the loss.
// do a small step in direction of the negative gradient, as this reduces the loss
void step(FloatType lr) {
for (size_t i = 1; i < layer.size(); ++i) {
Layer& curr_layer = layer[i];
for (Unit& unit : curr_layer) {
for (size_t j = 0; j < unit.weight.size(); ++j) {
unit.weight[j] -= lr * unit.grad_weight[j];
}
unit.bias -= lr * unit.grad_bias;
}
}
}
Results
After a few seconds the model is fully trained, and gives ~99% accuracy on the testset. The program also outputs the samples and its predictions, here are two of them.
X
XXXXXXXXXX
XXXXXXXXXXXX
XXX XXXX
XX XXX
XX XXX
XX XXX
XX XX
XX XX
XX XX
XX XX
XX XX
XX XXX
XX XX
XX XXX
XX XXX
XX XXXX
XXXXXXXXXXXXX
XXXXXXXXXX
XXXXXXX
Predicted: 0 (0.155461) Target: 0
X
XXX
XXX
XXX
XXX
XXXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XXX
XX
XXX
XXX
XXX
X
Predicted: 1 (0.962643) Target: 1