A simple neural network written in plain C++

No Python, no PyTorch, no NumPy, … just 260 lines of plain C++ code to train, evaluate and test a simple binary classifier that distinguishes between the handwritten digits 0 and 1.

6 min readOct 15, 2024

Reading Bishop’s book on deep learning recently, I was reminded of how simple the backpropagation algorithm actually is, it fits on a single page. On another note, I have not touched C++ for a while, although I have always enjoyed working with it. So why not implement a neural network in C++? The result is a model that trains in ~10s and achieves ~99% accuracy. The code is available on github.

Data

For a binary classifier a dataset with two different classes is required. Here we simply take all the images of 0s and 1s from the MNIST dataset. The images have a size of 28×28 pixels. By flattening them we get vectors with 784 elements.

Model

The model is structured into layers, and each layer contains units (“neurons”):

Input layer: these units do not compute anything, they are just holding the 784 pixel values of the input image. While not strictly required having such a “dummy” layer, it makes the code more consistent.
Hidden layers: each unit in a hidden layer receives inputs from all units of the previous layer. These inputs are linearly combined into a single value, and finally a nonlinear function is applied to it to get the output of the unit.
Last layer: only a single unit (outputting the prediction of the model) is sitting in the final layer, which works like the units in the hidden layers, except there is no nonlinear function applied.

Code: building the model

The model is defined in the RegressionModel class. Yes, to keep things simple the classification is done via regression, the model is trained to output 0.0 for the digit 0 and 1.0 for the digit 1, and the classification is done by checking if the output is smaller or bigger than 0.5.

The constructor creates the layers and puts them into the layer list (std::vector, to be more precise). The input layer contains units whose sole purpose is to hold the pixel values of the currently fed image. Then the hidden layers are created, and finally the last layer with its single unit is put into the list of layers.

// create a model with given number of units in input layer, number of units in hidden layers, and number of layers
RegressionModel(size_t num_input, size_t num_hidden, size_t num_layer) {
 // input layer to store the inputs for each pass
 layer.push_back(Layer());
 for (size_t i = 0; i < num_input; ++i) {
  layer.back().push_back(Unit(0));
 }

 // hidden layers
 if (num_layer >= 2) {
  for (size_t i = 0; i < num_layer - 1; ++i) {
   layer.push_back(Layer());
   for (size_t j = 0; j < num_hidden; ++j) {
    layer.back().push_back(Unit(i == 0 ? num_input : num_hidden, true));
   }
  }
 }

 // last layer contains one single unit
 layer.push_back(Layer());
 layer.back().push_back(Unit(num_hidden, false));
}

Each layer contains units (neurons). For the input layer only a value (the pixel value) is set when the forward method is called, while for the other layers a weight list and a bias value is required to process the incoming data. The last few variables are needed for the backward method in which the gradients are computed.

// Unit is the main buildig block of the model, aka "neuron"
struct Unit {
 Unit(size_t a_num_input, bool a_has_activation)
  :has_activation(a_has_activation),
  weight(random_weight(a_num_input)) {
 }

 explicit Unit(FloatType a_value)
  : value(a_value) {
 }

 bool has_activation{};

 FloatArray weight;
 FloatType bias{};

 FloatType value{};

 FloatType delta{};
 FloatArray grad_weight;
 FloatType grad_bias{};
};

Code: training the model

The followings steps are executed to train the model:

forward pass: feed the image through the model and compute its prediction
compute the loss, for regression that simply is (true value — predicted value)²
backward pass: for each parameter, compute the derivative of the loss with respect to the parameter, which tells us how we have to change this parameter value so the loss increases/decreases
update step: we do a gradient descent step by slightly adjusting every parameter in the direction that decreases the loss

The forward method takes a single input image at a time and puts its values into the first layer. Then, the loop takes care of feeding the data from layer to layer, until we get to the last layer. The output value (model prediction) of the single unit in the last layer is returned.

// compute forward pass
FloatType forward(const FloatArray& input_data) {
 // store inputs in first layer, as they are required in the backward pass
 for (size_t i = 0; i < input_data.size(); ++i) {
  layer[0][i].value = input_data[i];
 }

 // feed forward
 for (size_t i = 1; i < layer.size(); ++i) {
  Layer& curr_layer = layer[i];
  const Layer& input_layer = layer[i - 1];

  for (Unit& curr_unit : curr_layer) {
   FloatType pre_activation = curr_unit.bias;
   for (size_t j = 0; j < input_layer.size(); ++j) {
    pre_activation += input_layer[j].value * curr_unit.weight[j];
   }
   curr_unit.value = curr_unit.has_activation ? std::max<FloatType>(0, pre_activation) : pre_activation;
  }
 }

 // return output of single unit in last layer
 return layer.back()[0].value;
}

The backward method is used to compute the gradients of the model parameters. The incoming error signal is computed in the main function. I suggest having a look at some book, e.g. Bishop or Goodfellow for more details.

 // compute backward pass
 void backward(FloatType error_signal) {
  // set the error signal (derivative of loss wrt last units pre-activation) for single unit in last layer
  layer.back()[0].delta = error_signal;

  // reversed index r counts from back, i is to be used to index the vector
  for (size_t r = 0; r < layer.size() - 1; ++r) {
   const size_t i = layer.size() - r - 1; // index to be used for the vector

   // going over all units of current layer
   Layer& curr_layer = layer[i];
   for (size_t j = 0; j < curr_layer.size(); ++j) {
    Unit& curr_unit = layer[i][j];
    // for all but last layer compute delta (incoming error signal)
    if (r > 0) {
     curr_unit.delta = 0;
     const Layer& next_layer = layer[i + 1]; // closer to the output
     for (size_t k = 0; k < next_layer.size(); ++k) {
      const Unit& next_unit = next_layer[k];
      curr_unit.delta += next_unit.delta * next_unit.weight[i];
     }
    }

    // compute gradient for weights and for bias (derivative of loss wrt each parameter)
    FloatArray grad;
    const Layer& prev_layer = layer[i - 1];
    for (size_t k = 0; k < prev_layer.size(); ++k) {
     const Unit& prev_unit = prev_layer[k];
     grad.push_back(curr_unit.delta * prev_unit.value);
    }
    curr_unit.grad_weight = grad;
    curr_unit.grad_bias = curr_unit.delta;
   }
  }
 }

Finally, we apply the gradient descent step where we update the parameters in a way to decrease the loss.

 // do a small step in direction of the negative gradient, as this reduces the loss
 void step(FloatType lr) {
  for (size_t i = 1; i < layer.size(); ++i) {
   Layer& curr_layer = layer[i];
   for (Unit& unit : curr_layer) {
    for (size_t j = 0; j < unit.weight.size(); ++j) {
     unit.weight[j] -= lr * unit.grad_weight[j];
    }
    unit.bias -= lr * unit.grad_bias;
   }
  }
 }

Results

After a few seconds the model is fully trained, and gives ~99% accuracy on the testset. The program also outputs the samples and its predictions, here are two of them.

              X
         XXXXXXXXXX
        XXXXXXXXXXXX
        XXX      XXXX
        XX        XXX
        XX         XXX
       XX          XXX
       XX           XX
       XX           XX
      XX            XX
      XX            XX
      XX            XX
      XX           XXX
      XX           XX
      XX          XXX
      XX         XXX
      XX       XXXX
      XXXXXXXXXXXXX
       XXXXXXXXXX
        XXXXXXX

Predicted: 0 (0.155461) Target: 0



                X
               XXX
               XXX
              XXX
              XXX
             XXXX
             XXX
             XXX
             XXX
            XXX
            XXX
            XXX
           XXX
           XXX
           XXX
           XX
          XXX
          XXX
          XXX
           X

Predicted: 1 (0.962643) Target: 1