In this blog post, we will talk about object detection, take a look at our data, and understand and create a simple convolution neural network to classify six different “natural scenes.”
Object detection is a computer vision method in which an AI is trained to identify and locate objects in an image or video. This method can be broken into two major steps: object localization and object recognition.
Object localization is locating the presence of objects in an image and finding a bounding box for that object. Object recognition is classifying the objects that the model found.
Understanding and preprocessing the data
In this blog post, we will be using the Intel Image Classification dataset. It contains various images of Natural Scenes around the world. Our goal is to make a model that can take in an image and accurately return one of the six classes: buildings, forest, glacier, mountain, sea, or street. The first step in creating a machine learning model is to find patterns and trends in the data and visualize it.
Here is my Kaggle Notebook if you would like to follow along: https://www.kaggle.com/krish45732/intel-image-classification-notebook. (The output of the notebook and the output on this blog post may be different)
We will start by importing libraries that we will need later on
The data is separated into six folders: one for each class. We will make a
get_images function to extract our images into one array. First, we make empty lists to store the images (
x_list) and their labels (
y_list). Next, the outer loop will cycle through each class (0–6). The inner loop will cycle through each image in the classes’ respective folders. We will use the cv2 library to read the images, resize them to (150, 150, 3) and append them to
x_list. We also append the class number to the
y_list variable. Lastly, we convert the python lists into NumPy arrays, shuffle the data, and return the arrays. The
get_images_pred is similar to the function we just made, but it has only one loop and doesn’t have
to_onehot function will be used later to convert the y values to a one-hot array.
Next, we create a dictionary to hold a mapping from the number of the class to its name. Then, we use the
get_images function to get the training and validation images. The validation dataset will be used to check if our model is overfitting or “memorizing” the training dataset.
Next, we will visualize the data using matplotlib. This lets us confirm that we got the right images with the right classes and preview the images we will be using.
The output images will vary since we shuffled the images before, but confirm that the labels match their images. Here is my output:
Now that we have taken a look at our dataset, we can get started with creating our model. First, we convert our label arrays into one-hot arrays using the
to_onehot function we previously made.
Artificial neural networks (ANN)
Artificial neural networks are basically a complex math function. They are loosely based on neural networks in the human brain. Here is a broad overview of ANN’s. An artificial neural network begins with an input layer. This is where the input data — what the neural network is going to learn about— is fed in. Next, there are one or more hidden layers. These hidden layers are the bulk of the math function and find key features in the input images. Each hidden layer adds even more complexity to the function. In the image below, i1 and i2 are the two input nodes. The input is then multiplied by a weight (w1, w2, w3, w4) and the bias value (b1) is added on. This value passed through an activation function such as Sigmoid, ReLU, TanH, and more. These functions help normalize the output of the node into a 0 to 1 range or -1 to 1 range. This value is then passed on to the corresponding node in the next layer and the process is repeated. For the final output layer (o1 and o2), the values are passed through a different activation function and depending on the type of problem, it will return a classification, probability, or something similar. When training, lots of input data is fed into the neural network. The output is compared to the expected value to find the error of the model. Then, the model undergoes back-propagation in where the weights and biases of the neural network are adjusted to reduce the error.
Convolution neural networks (CNN)
A convolution neural network is an algorithm commonly used in object detection or object recognition. CNN’s usually perform better on images compared to regular perceptrons. It essentially reduces the image size while maintaining the image’s key features to make processing easier. As shown below, convolution neural networks can have convolution layers, max-pooling layers, flatten layer, dense/fully connected layers, and more.
In the GIF on the left, the leftmost image is the input; the middle image is the convolution filter, and the rightmost image is the output. In this example, the 3x3 filter is multiplied onto a 3x3 section of the image. The values are summed up and put in the output image. Some things that can change include the filter size, the stride (how many pixels the filter moves by), and padding (which can keep the image the same size). To learn more about convolution layers you can read this article.
Max-pool layers downsample the image into key features reducing the computation power needed. The max-pool layer takes the max value using a filter. The flatten layer takes the output from the last convolution/pooling layer and flattens it down into a one-dimensional array to be used in the dense layers.
Creating a convolution neural network
First, we create a sequential model since the data will pass through in chronological order. Conv2D layers are the convolution layers. MaxPool2D is a max-pooling layer that downsamples the image to highlight key features in it. After we pass it through a couple of convolution layers and max-pooling layers, we flatten the dataset so it can be used in the dense layers. Dense layers are fully connected neural network layers. The dropout layer helps reduce overfitting or “memorizing” the training images. The last dense layer gives out an array that sums to one with the decimal probabilities for each class. Next, we compile the model and print out a summary of it. Using the summary we can see the output shape of each layer, parameters per layer, and the total number of parameters. In this model, we have over 1.1 million parameters.
Training the model
Now that the model is created and compiled we can train it using the simple Keras
fit function. We pass in the training data, the number of epochs, and the validation data. The number of epochs is the number of times it will pass through the entire dataset, in this case, 25. I suggest you train this using a GPU from Google Colab, Kaggle Kernels, or something similar to speed up the process. For me, it took roughly 25 minutes to train using Kaggle.
Analyzing the results
Now that the training is complete, we can take a look at the results. We will plot the training and validation accuracy. We will also plot a line for the best validation epoch and its accuracy.
Note that the graph will vary. Using the graph, we can see that our training accuracy and validation accuracy increased a lot in the first few epochs. Near the end, the training accuracy was still going up but started to flatten out. However, the validation accuracy was going down. This is called overfitting in which the model starts to memorize the training set. So, the training set accuracy will go up, but the validation set accuracy, which is not being trained on, will go down. According to the print statement and the graph, epoch 15 was the best epoch with a validation accuracy of 83.33%. In future articles, we will talk about how to reduce overfitting and increase accuracy.
Now, we can use our model to predict images it has not seen before. First, we load the prediction dataset using the
get_images_pred function we made before. Then we use the
predict_classes function from Keras to predict the classes. Lastly, we plot 16 images from the prediction set with their labels.
The output can vary but compare the labels to what you think that image is. Here is my output:
In this article, we created a simple CNN to recognize different “natural scenes.” We learned how to load in the images, plot them, create and compile a model, train the model, analyze the data, and make predictions! I hope you had fun and learned something new! In the future, we will look at how to reduce overfitting, increase accuracy, and transfer learning.
Krish Ranjan is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.