Skip to content

Implementing a flexible neural network with backpropagation from scratch

Implementing your own neural network can be hard, especially if you’re like me, coming from a computer science background, math equations/syntax makes you dizzy and you would understand things better using actual code.

Today I’ll show you how easy it is to implement a flexible neural network and train it using the backpropagation algorithm. I’ll be implementing this in Python using only NumPy as an external library.

After reading this post, you should understand the following:

  • How to feed forward inputs to a neural network.
  • Use the Backpropagation algorithm to train a neural network.
  • Use the neural network to solve a problem.

In this post, we’ll use our neural network to solve a very simple problem: Binary AND.
The code source of the implementation is available here.

Background knowledge

In order to easily follow and understand this post, you’ll need to know the following:

  • The basics of Python / OOP.
  • An idea of calculus (e.g. dot products, derivatives).
  • An idea of neural networks.
  • (Optional) How to work with NumPy.

Rest assured though, I’ll try to explain everything I do/use here.

Neural networks

Before throwing ourselves into our favourite IDE, we must understand what exactly are neural networks (or more precisely, feedforward neural networks).

A feedforward neural network (also called a multilayer perceptron) is an artificial neural network where all its layers are connected but do not form a circle. Meaning that the network is not recurrent and there are no feedback connections.

An example of a Feedforward Neural Network
An example of a Feedforward Neural Network

These neural networks try and approximate a function f(X) where X are the inputs that get fed forward through the network to give us an output.

The lines connecting the network’s nodes (neurons) are called weights, typically numbers (floats) between 0 and 1.
Also, each neuron has a bias unit (a float between 0 and 1) that helps shift the results.
These are called the parameters of the network.

An example of a network's parameters
An example of a network’s parameters

When inputs are fed forward through the network, each layer will calculate the dot product between its weights and the inputs, add its bias then activate the result using an activation function (e.g. sigmoid, tanh).

    \[(X\cdot W_l + \beta_l)\]

These activation functions are used to introduce non linearity.

Implementing a flexible neural network

In order for our implementation to be called flexible, we should be able to add/remove layers without changing the code. This means that hard-coding weights and layers is a no go.

First of all, let’s import NumPy and set a seed for us to get the same results when generating random numbers:


We’ll create a class that represents our network’s hidden and output layers.


In the class’s constructor __init__ we’ll initialize the layer’s weights and bias using the length of the input and the number of neurons this layer has. Specifying the weights and bias is optional as they will be randomly generated if not provided.
N.B: The layer’s input is not always our initial input X, it can also be the output of a previous layer.

If we take the network shown in the figures above, we can represent its first hidden layer (3×4) as follows: hidden_layer_1 = Layer(3, 4).
As a result, the weights and bias will be:

Printed out nicely, we get:

    \[ W_l = \begin{bmatrix} 0.54 & 0.27 & 0.42 & 0.84 \\ 0.00 & 0.12 & 0.67 & 0.82 \\ 0.13 & 0.57 & 0.89 & 0.20 \end{bmatrix} , \beta_l = \begin{bmatrix} 0.18 \\ 0.10 \\ 0.21 \\ 0.97 \end{bmatrix} \]

The columns in the weights matrix W_l represent our layer’s neurons and each row is the weights between one input neuron and all our layer’s neurons.


As the inputs get fed forward through our network, each layer must calculate the output using its weights and the received inputs, apply an activation function (if chosen) on the output then return the result to us.

Here’s an explanation of the process:

  1. The input goes through our activate function.
  2. It calculates X\cdot W_l + \beta_l.
  3. Applies the chosen activation function A(r).
  4. Saves the result in last_activation (I’ll explain why later).
  5. Returns the result.

N.B: I only implemented here 2 activation functions, tanh (2 \cdot \sigma \left( 2 x \right) - 1) and sigmoid (\frac{1}{1 - e^{-x}}), but there are many more available.

If we want to active the input [1, 2, 3] for example, we’ll write the following: layer.activate([1, 2, 3]), which results in: [1.14829064 2.35516451 4.65967912 4.10271179], a vector of length 4 (our number of neurons).

    \[ r = \begin{bmatrix} 0.54 & 0.27 & 0.42 & 0.84 \\ 0.00 & 0.12 & 0.67 & 0.82 \\ 0.13 & 0.57 & 0.89 & 0.20 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0.18 \\ 0.10 \\ 0.21 \\ 0.97 \end{bmatrix} \\ = \begin{bmatrix} 1.15 \\ 2.35 \\ 4.66 \\ 4.10 \end{bmatrix} \]

The dot product between a matrix A(N\times M) and a vector V(N) is simply a new vector R(M) made of dot products between each element in V and each row A_i. You can find more information about this here.

Don’t forget that we didn’t specify any activation function, so the result we’re seeing here is only the dot product.
If we did actually specify one, let’s say sigmoid, then the result would be the following (note that all the values are now between 0 and 1):

    \[ \sigma( \begin{bmatrix} 1.15 \\ 2.35 \\ 4.66 \\ 4.10 \end{bmatrix} ) = \begin{bmatrix} 0.76 \\ 0.91 \\ 0.99 \\ 0.98 \end{bmatrix} \]

Neural networks

We’ll also create a class to represent a neural network:

The code is pretty straight forward, we have a constructor a initialize an empty list _layers and a function add_layer that appends layers to that list (of course, the layers are of type Layer, the one we created earlier).

Feed forward

As previously explained, inputs travel forward through the network.
We’ll go ahead and implement it in our NeuralNetwork class:

  • The function feed_forward takes the input X and feeds it forward through all our layers.
  • Each next layer will take the output of the previous one as the new input.
  • The function predict chooses the winning probability and returns its index (interpreted as the winning class).

Our neural network

Let’s now create a neural network that we will use throughout the post.

Representation of the neural network
Representation of the neural network

Our neural network’s goal is to be able to perform the Binary AND operation.
Although at the moment, it sucks at it:

ABA ∧ BPredicted


Now we’re at the most important step of our implementation, the backpropagation algorithm.
Simply put, the backpropagation is a method that calculates gradients which are then used to train neural networks by modifying their weights for better results/predictions.
The algorithm is supervised, meaning that we’ll need to provide examples (inputs and targets) of how it should work in order for it to actually help us.

In this implementation, we’ll use the Mean Squared Sum Loss as the loss function for the backpropagation algorithm, where y_i is the target value and \^y_i is the predicted value.

    \[ \frac{1}{2}\sum{(y_i - \^y_i)}^2 \]

How does it work?

Let’s represent our neural network the following way:

Simpler representation of our neural network
Simpler representation of our neural network

Where hl_1\_X and hl_2\_X are the outputs of the hidden layer 1 and the hidden layer 2.
We can then see that the output is simply a chain of functions:

    \[ o = O(hl_2(hl_1(X)) \]

Calculating the gradients and changing our weights is now a step closer, all we have to do is use the Chain Rule and update every single weight we have.
If you want to read more about the exact math behind this, I highly recommend you this article.

Calculating the errors and the deltas

For each layer starting from the last layer (output layer):

  • Calculate the error E:
    For the output layer: y - o where o is the output that we get from feed_forward(X).
    For the hidden layers: W_{l+1} \cdot \delta_{l+1} where l+1 is the next layer and \delta is its delta.
  • Calculate the delta: E \times A'(o_i) where A' is the derivative of our activation function and o_i is the last activation that we stored previously.

First, let’s add the derivatives of our activation functions:

Then calculate the errors and the deltas:

Updating the weights

All we have to do now is update our weight matrices using the calculated deltas and a learning rate.
The learning rate is a number between 0 and 1 that controls how much we adjust the weights. A big learning rate may cause you to miss the global minimum while a small learning rate might be too slow to converge.

  1. Loop over the layers (forwardly).
  2. Choose the input to use (X or o_{l-1} depending on the current layer.
  3. Update the layer’s weights using W_l = W_l + \delta_l \times X^T \times \alpha.

Our backpropagation function becomes:

Training the neural network

Finally, we need to train our neural network using the backpropagation function we implemented above.
We will use the stochastic gradient descent algorithm, where we update the weights with each row of our input (you can also take mini-batches instead).

We repeat the process for max_epochs epochs (also called cycles or runs).
At every 10th epoch, we will print out the Mean Squared Error and save it in mses which we will return at the end.

Here’s the complete implementation:

Training the neural network to solve the Binary AND problem

The learning rate 0.3 and max_epochs 290 were chosen with trial and error, nothing fancy.
Here’s the result of our training:

Changes in the MSE
Changes in the MSE

Our neural network can successfully do binary AND operations now!

Fun note


Aside from being able to do binary AND operations, the neural network seems to always predict 1 when a positive number is used and 0 when a negative number is used.
Sadly, this remains a secret from us…


Implementing a neural network can be challenging at first, especially since a lot of articles makes it seem like its hard and you need to be a math genius to get it done. In reality, it’s fairly simple to do it. All you need to have is some basic knowledge in calculus and machine learning.

I hope this post helped you understand how a neural network functions and especially how it can be trained using the Backpropagation algorithm. If you have any suggestions, improvements, questions or whatever feel free to comment below!

Published inAlgorithmsDeep Learning

Leave a Reply

11 Comment threads
24 Thread replies
Most reacted comment
Hottest comment thread
12 Comment authors
Mateus Abreu de AndradeabcJakub SvobodawildanzHaytam Recent comment authors
newest oldest most voted
Notify of

This article is really good, but i got 1 problem, and i never find the the answer.
Every example that i see always show in AND problem. What i want to know is, is it possible to create multiple possibilities to predict from 1 set of data?
What i mean is , let say i have 1 row of data, but i want to predict the result with 5 different possibilities. Should i repeat the process for each possibilities, or is it possible to make the output layer to give prediction to all 5 possibilities?

wancong zhang
wancong zhang

Very cool article and implementation.

I just want to point out one mistake in your matrix multiplication. In your diagram and description you multiplied a 3×4 matrix (W) with a 3×1 matrix (X). Instead it should be the product of 4×3 matrix and 3×1 matrix.


which python or anaconda version are you using


Great article. But I don’t understand how to generalize backpropagation for any loss function. Is there any simple tutorial as this one?


Great article, but don’t stop please – show how to create CNN, LSTM, RNN, Decision Tree, SVM, Random Forrest thank you 🙂 btw you can create universal derivative function based on rule df(x) = (f(x – h) + f(h) / h)


Your implementation never solves the XOR problem. There is something wrong with the code you published. Changing the training data to
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
Accuracy: 50.00%
no matter how many iterations. I thought it was suspicious you used AND as an example instead of XOR.


Why are there two output neurons for AND operation? Shouldn’t there be just one?


Because the output is 2 values that we interpret as the probability of the result being 0 or 1, which we then pick the highest (argmax).


Although if I would redo the code, I would only use 1 output node (if it’s only for a binary classification).


Ok. Thanks. Got me confused there for a bit.


I’ve solved all the problems so far. Really quality content here. There is just one thing left. I just can’t seem to see where are the biases adjusted. How should their adjusting be implemented


Halo can you tell me, when i change layer to

nn.add_layer(Layer(3, 8, ‘sigmoid’))
nn.add_layer(Layer(8, 3, ‘sigmoid’))

why it become error ?

Jakub Svoboda
Jakub Svoboda

this has been a really helpful article, but there seems to be a mistake in that the biases are never updated. Could you provide us with the line of code to compute the delta of biases, if you know how?


Jakub Svoboda
Jakub Svoboda

this has been a really helpful article, but there seems to be a mistake, in that the biases are never updated. Could you provide us with the line of code to compute the delta of biases, if you know how?