Implementing your own neural network can be hard, especially if you’re like me, coming from a computer science background, math equations/syntax makes you dizzy and you would understand things better using actual code.

Today I’ll show you how easy it is to implement a flexible neural network and train it using the backpropagation algorithm. I’ll be implementing this in Python using only NumPy as an external library.

After reading this post, you should understand the following:

• How to feed forward inputs to a neural network.
• Use the Backpropagation algorithm to train a neural network.
• Use the neural network to solve a problem.

In this post, we’ll use our neural network to solve a very simple problem: Binary AND.
The code source of the implementation is available here.

## Background knowledge

In order to easily follow and understand this post, you’ll need to know the following:

• The basics of Python / OOP.
• An idea of calculus (e.g. dot products, derivatives).
• An idea of neural networks.
• (Optional) How to work with NumPy.

Rest assured though, I’ll try to explain everything I do/use here.

## Neural networks

Before throwing ourselves into our favourite IDE, we must understand what exactly are neural networks (or more precisely, feedforward neural networks).

A feedforward neural network (also called a multilayer perceptron) is an artificial neural network where all its layers are connected but do not form a circle. Meaning that the network is not recurrent and there are no feedback connections.

These neural networks try and approximate a function where are the inputs that get fed forward through the network to give us an output.

The lines connecting the network’s nodes (neurons) are called weights, typically numbers (floats) between 0 and 1.
Also, each neuron has a bias unit (a float between 0 and 1) that helps shift the results.
These are called the parameters of the network.

When inputs are fed forward through the network, each layer will calculate the dot product between its weights and the inputs, add its bias then activate the result using an activation function (e.g. sigmoid, tanh). These activation functions are used to introduce non linearity.

## Implementing a flexible neural network

In order for our implementation to be called flexible, we should be able to add/remove layers without changing the code. This means that hard-coding weights and layers is a no go.

First of all, let’s import NumPy and set a seed for us to get the same results when generating random numbers:

### Layers

We’ll create a class that represents our network’s hidden and output layers.

#### Initialisation

In the class’s constructor __init__ we’ll initialize the layer’s weights and bias using the length of the input and the number of neurons this layer has. Specifying the weights and bias is optional as they will be randomly generated if not provided.
N.B: The layer’s input is not always our initial input , it can also be the output of a previous layer.

If we take the network shown in the figures above, we can represent its first hidden layer (3×4) as follows: hidden_layer_1 = Layer(3, 4).
As a result, the weights and bias will be:

Printed out nicely, we get: The columns in the weights matrix represent our layer’s neurons and each row is the weights between one input neuron and all our layer’s neurons.

#### Activation

As the inputs get fed forward through our network, each layer must calculate the output using its weights and the received inputs, apply an activation function (if chosen) on the output then return the result to us.

Here’s an explanation of the process:

1. The input goes through our activate function.
2. It calculates .
3. Applies the chosen activation function .
4. Saves the result in last_activation (I’ll explain why later).
5. Returns the result.

N.B: I only implemented here 2 activation functions, tanh and sigmoid , but there are many more available.

If we want to active the input [1, 2, 3] for example, we’ll write the following: layer.activate([1, 2, 3]), which results in: [1.14829064 2.35516451 4.65967912 4.10271179], a vector of length 4 (our number of neurons). The dot product between a matrix and a vector is simply a new vector made of dot products between each element in and each row . You can find more information about this here.

Don’t forget that we didn’t specify any activation function, so the result we’re seeing here is only the dot product.
If we did actually specify one, let’s say sigmoid, then the result would be the following (note that all the values are now between 0 and 1): ### Neural networks

We’ll also create a class to represent a neural network:

The code is pretty straight forward, we have a constructor a initialize an empty list _layers and a function add_layer that appends layers to that list (of course, the layers are of type Layer, the one we created earlier).

### Feed forward

As previously explained, inputs travel forward through the network.
We’ll go ahead and implement it in our NeuralNetwork class:

• The function feed_forward takes the input and feeds it forward through all our layers.
• Each next layer will take the output of the previous one as the new input.
• The function predict chooses the winning probability and returns its index (interpreted as the winning class).

### Our neural network

Let’s now create a neural network that we will use throughout the post.

Our neural network’s goal is to be able to perform the Binary AND operation.
Although at the moment, it sucks at it:

ABA ∧ BPredicted
0001
0101
1001
1111

### Backpropagation

Now we’re at the most important step of our implementation, the backpropagation algorithm.
Simply put, the backpropagation is a method that calculates gradients which are then used to train neural networks by modifying their weights for better results/predictions.
The algorithm is supervised, meaning that we’ll need to provide examples (inputs and targets) of how it should work in order for it to actually help us.

In this implementation, we’ll use the Mean Squared Sum Loss as the loss function for the backpropagation algorithm, where is the target value and is the predicted value. #### How does it work?

Let’s represent our neural network the following way:

Where and are the outputs of the hidden layer 1 and the hidden layer 2.
We can then see that the output is simply a chain of functions: Calculating the gradients and changing our weights is now a step closer, all we have to do is use the Chain Rule and update every single weight we have.

#### Calculating the errors and the deltas

For each layer starting from the last layer (output layer):

• Calculate the error :
For the output layer: where is the output that we get from feed_forward(X).
For the hidden layers: where is the next layer and is its delta.
• Calculate the delta: where is the derivative of our activation function and is the last activation that we stored previously.

First, let’s add the derivatives of our activation functions:

Then calculate the errors and the deltas:

#### Updating the weights

All we have to do now is update our weight matrices using the calculated deltas and a learning rate.
The learning rate is a number between 0 and 1 that controls how much we adjust the weights. A big learning rate may cause you to miss the global minimum while a small learning rate might be too slow to converge.

1. Loop over the layers (forwardly).
2. Choose the input to use ( or depending on the current layer.
3. Update the layer’s weights using .

Our backpropagation function becomes:

### Training the neural network

Finally, we need to train our neural network using the backpropagation function we implemented above.
We will use the stochastic gradient descent algorithm, where we update the weights with each row of our input (you can also take mini-batches instead).

We repeat the process for max_epochs epochs (also called cycles or runs).
At every 10th epoch, we will print out the Mean Squared Error and save it in mses which we will return at the end.

Here’s the complete implementation:

## Training the neural network to solve the Binary AND problem

The learning rate 0.3 and max_epochs 290 were chosen with trial and error, nothing fancy.
Here’s the result of our training:

Our neural network can successfully do binary AND operations now!

## Fun note

ABPrediction
201
0-10
452031
-2100
0851
0-3280

Aside from being able to do binary AND operations, the neural network seems to always predict 1 when a positive number is used and 0 when a negative number is used.
Sadly, this remains a secret from us…

## Conclusion

Implementing a neural network can be challenging at first, especially since a lot of articles makes it seem like its hard and you need to be a math genius to get it done. In reality, it’s fairly simple to do it. All you need to have is some basic knowledge in calculus and machine learning.

I hope this post helped you understand how a neural network functions and especially how it can be trained using the Backpropagation algorithm. If you have any suggestions, improvements, questions or whatever feel free to comment below!

Published inAlgorithmsDeep Learning 4 Followers

Most reacted comment
12 Comment authors     Recent comment authors
Subscribe
Notify of Guest
foolish_programmer

Every example that i see always show in AND problem. What i want to know is, is it possible to create multiple possibilities to predict from 1 set of data?
What i mean is , let say i have 1 row of data, but i want to predict the result with 5 different possibilities. Should i repeat the process for each possibilities, or is it possible to make the output layer to give prediction to all 5 possibilities? Guest
wancong zhang

Very cool article and implementation.

I just want to point out one mistake in your matrix multiplication. In your diagram and description you multiplied a 3×4 matrix (W) with a 3×1 matrix (X). Instead it should be the product of 4×3 matrix and 3×1 matrix. Guest
Rutuja

which python or anaconda version are you using Guest
Alexander

Great article. But I don’t understand how to generalize backpropagation for any loss function. Is there any simple tutorial as this one? Guest
Suleyman

Great article, but don’t stop please – show how to create CNN, LSTM, RNN, Decision Tree, SVM, Random Forrest thank you 🙂 btw you can create universal derivative function based on rule df(x) = (f(x – h) + f(h) / h) Guest
Oskarooo

Your implementation never solves the XOR problem. There is something wrong with the code you published. Changing the training data to
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([, , , ])
yeilds
Accuracy: 50.00%
no matter how many iterations. I thought it was suspicious you used AND as an example instead of XOR. Guest
Oskarooo

Why are there two output neurons for AND operation? Shouldn’t there be just one? Guest
zHaytam

Because the output is 2 values that we interpret as the probability of the result being 0 or 1, which we then pick the highest (argmax). Guest
zHaytam

Although if I would redo the code, I would only use 1 output node (if it’s only for a binary classification). Guest
Oskarooo

Ok. Thanks. Got me confused there for a bit. Guest
Oskarooo

I’ve solved all the problems so far. Really quality content here. There is just one thing left. I just can’t seem to see where are the biases adjusted. How should their adjusting be implemented Guest
wildan

Halo can you tell me, when i change layer to

why it become error ? Guest
Jakub Svoboda

Hi,
this has been a really helpful article, but there seems to be a mistake in that the biases are never updated. Could you provide us with the line of code to compute the delta of biases, if you know how?

Thanks Guest
Jakub Svoboda

Hi,
this has been a really helpful article, but there seems to be a mistake, in that the biases are never updated. Could you provide us with the line of code to compute the delta of biases, if you know how?

Thanks wpDiscuz