MXNet as simple as possible

This article describes our recent experience with MXNet deep learning framework. This is a step by step instruction showing how to use CNN with pre-trained models for solving image recognition tasks (for instance, recognizing pizza type), how to prepare data using MXNet tools and fine-tune the weights of a pre-trained network. I’ll try to cover some problems occurred during development. Hope you’ll enjoy the journey.

If you want to start diving into neural networks, you’ll discover that the most popular framework is Google TensorFlow. It has a huge community and plenty of open sourced projects assisting you in providing needed solutions in a short time. Apache MXNet is currently in incubation project, but it has detailed documentation and is easy to use, with the ability to choose between imperative and symbolic programming styles, making it a great candidate for both beginners and experienced engineers.

Some pros of MXNet:

  • Support of multiple GPUs (with optimized computations and fast context switching)
  • Clean and easy maintainable code (Python, R, Scala and other APIs)
  • Fast problem solving ability (vital, for newbies in deep learning, like me)

To get started using MXNet for research, checkout the crash course.

Problem Definition

The task that I’ll solve during this article is to recognize pizza type by photo. First of all, I’ll show you how to collect and prepare data for training, validation and test, how to work with images, do model selection and fine-tune a new model on top of pre-trained network, modifying network architecture. Finally, I’ll show you how to test output model.


Environment overview

As you understand, we are going to work with images, that’s where GPU comes in. We have NVidia Graphic cards on Linux machine, to get GPU info, use command ‘nvidia-smi’

Here is an output:
As you can see, we have two NVidia cards GTX-1080.
You can use your CPU also, but pay attention that CPU usage greatly increases training/validation time.
It’s very easy to switch between CPU and GPU in MXNet coce; we will review this option later.
Also, we use Python 3 and MXNet 1.1.


We have collected a dataset using a Python crawler application that pulling images from Google Images service based on 10 pizza types we initially selected. We collected 10k unfiltered images of 299×299 size

Data preparation

Data quality is crucial for any machine learning task, so let’s spend some time on making it really useful for the task. It’s also useful to clean and sort data manually (thanks to my colleagues), because one pizza type can look different depending on restaurant or receipt. Make sure that labeled images look similar. After the cleanup process we ended up with 2000 images in the data set, this means approximately 200 pizza images per type.
I have already divided data into 3 folders: test, train and validation (train contains 60% from the original data set, validation – 20% and test – 20%) Each folder contains 10 sub-folders corresponding to every pizza type.
(test folder content)
MXNet works with different data types and it has mechanisms that could iterate over your data, combine data into batches and apply different modification options (like cropping or mirroring images).
I’ll use ‘’ for image iteration. More details regarding this iterator workflow can be found here.

Prior to launching the iteration process I need to create two specific files ‘.rec’ and ‘.lst’. The last one contains a list of all images in your data block (list item format: index label path), while ‘.rec’ file specifies a binary image representation.
MXNet provides a im2rec tool that helps you with the creation of such files.

(after script invokes, im2rec creates ‘.lst’ and ‘.rec’ files)

The script recursively walks through the folders and lists all images in ‘.lst’
Also it resizes images to 299×299 with 90% quality from the original image, this is a requirement for Inception architecture, and lists a binary representation of each image into ‘.rec’ file.
(im2rec output)

Also, some additional parameters could be useful: ‘–center-crop’ makes square images and crops to center, ‘—color’ is used for gray images or forcing colors. A list of all parameters is here

Don’t worry if your data set has not been divided into training and test sets. It could be done by specifying ‘—test-ratio’ and ‘—train-ratio’. In that case Im2rec tool will create two separate ‘.rec’ files for training and validation.

(a part of data-val.lst, data-val.rec is unreadable)

index_0 label_0 image_path_0
index_1 label_1 image_path_1

Working with Images

It’s time to create iterators for reading the batches of images. Iterators divide all images into batches that the GPU will use, limited by GPU memory availability. Two iterators are needed for training: validation iterator and train iterator. There are a couple of parameters that need clarification:

Shape – is a feature vector, the size of which is obtained with formula

data_shape = channels*height*width

In our dataset, all images are in RGB color, so dimension is 3, one for each color.
Also, height and width are deduced empirically. You can resize your data images and take a look to understand how images are transformed. All data should be of one shape size.

Batch size – the number of images that will be processed in one iteration. Try to find the batch size value per one GPU, it depends on GPU memory size. As mentioned above, I used 2 GPUs so batch size doubled. For example: on 128 I got “Out Of Memory”; my suggestion is starting with power of two numbers to find suitable value.

If you forget to shuffle or crop images with im2rec tool. Don’t worry you can simply add all needed transformation flags, like ‘shuffle’ or ‘rand_crop’.
You can also make augmentation with ‘kwargs’ parameter. The list of available augmentors can be found here.

(source code of iterators)

Transfer learning

The pizza dataset that is used in this example is small. Training a network from scratch on small dataset will not provide a representative value of the weights, so fine-tuning is a nice option to train on top of pre-trained models.
We chose the Inception architecture quite empirically, having achieved low accuracy with ResNet 256, so the following choice of the architecture was purely through hands on experience.
Taking the Inception pre-trained model we remove the last Fully-Connected layer, or in other words, we copy everything prior this layer. It is also important to define the activation function to be used. We use ReLu, which is the mostwidely used in similar tasks. We set dropout to 0.7 to prevent the neural network from overtraining. Also, a new fully-connected layer is added with our classes and softmax output layer. One important issue in Transfer Learning is weight copying from the original pre-trained model.

(finetune method)

All these parameters and network configurations combined will be used in the training and validation of the new model.
(training and validation source code)

As it was mentioned earlier, the choice of MXNet context is rather simple. The number of GPU is 2, and ‘devs’ list contains 2 contexts. Lets preliminary define the essential metrics here, they are ‘accuracy’ and ‘cross-entropy’.
The optimisation function used is SGD with a learning rate of 0.01. Be careful while changing the learning rate, as it drags the indication weight of your neurons in backward operation. To train and validate the new model the fit method is going to be used in the object Module. For the parameters we will re-use the iterators created previously, as well as the pretrained model settings, optimiser function and callbacks for the checkpoints. One training epoch is one checkpoint. These callbacks can be overridden per your needs. The number of epochs is calculated experimentally.
We achieved 93% classification accuracy. This is not a production-ready value, but this is a great starting point.

Testing model

Keeping the model at each epoch, you can later check it on the test data.
(Testing models for each checkpoint)
Create an iterator for test data, then load the model by epoch, after use. Modules bind method for binding params used for executor construction (executors starts the computation process in parallel on different devices).
Modules score determines the metrics, that later can be easily visualized.


MXNet is a convenient framework for rapid prototyping and structuring of deep learning projects. It has a fast configurable environment (context), various approaches to programming and rich functionality. People who previously used TensorFlow can easily jump into MXNet framework and benefit from its versatile options.
Training time for our model was less than training with TensorFlow, and the resulting model was smaller.
The drawback of MXNet is relatively small community mastering it, thus it is slightly challenging at times to get the answer to the arising questions. The framework is relatively young and a lot of features are being constantly added, so it is advisory to keep an eye on its development blog and Amazon Machine Learning topics.
Checkout source code from GitHub.