Pizza type recognition using MXNet and TensorFlow

In the last few years, TensorFlow has became the industry standard for any task related to neural networks and deep learning. However, are there any other credible alternatives, or has Google attained the full monopoly and there is nothing new to invent? Let’s find it out!

Our team conducted a little investigation and chose Apache MXNet as the best competitor to compare with TensorFlow.

The main problem of TensorFlow is the speed. So, let’s see what we can get from MXNet at this point:

  • Device Placement: With MXNet, it’s easy to specify where each data structures should reside
  • Multi-GPU training: MXNet makes it easy to scale computation with a number of available GPUs
  • Automatic differentiation: MXNet automates the derivative calculations that once bogged down neural network research
  • Optimized Predefined Layers: While you can code up your own layers in MXNet, the predefined layers are optimized for speed, outperforming competing libraries.

We were especially interested in the last point which affects user experience in real world applications. Switching framework in production is a very risky idea, so we decided to test this reasoning in battle conditions by creating two models using the Tensorflow/Keras and MXNet frameworks.

We used pizza type recognition as the domain as it is complex enough to use some advanced techniques and common enough to create a dataset in a short time. At least, so we thought The domain was challenging enough as even human beings can be confused with different toppings.

pizza

This is what we had at the start:

  • 5k images Dataset
  • Good hands-on experience with TensorFlow
  • Desire to get into the MXNet that everyone is talking about
  • 2 GPUs

Let’s go!

Algorithms and approaches:

InceptionV3 & ResNet – we tested both architectures for the best results and capped in 56% accuracy for ResNet and 94%(for MXNet)/82%(TensoFlow) for InceptionV3. As a result, we decided to take Inception to production

MXNetTensorFlow
import mxnet as mx
import logging
 
head = '%(asctime)-15s %(message)s'
logging.basicConfig(level=logging.DEBUG, format=head)
 
num_classes = 10
batch_per_gpu = 16
num_gpus = 2
batch_size = batch_per_gpu * num_gpus
 
 
def get_iterators(batch_size, data_shape=(3, 299, 299)):
    train = mx.io.ImageRecordIter(
        path_imgrec='./data-train.rec',
        data_name='data',
        label_name='softmax_label',
        batch_size=batch_size,
        data_shape=data_shape,
        shuffle=True,
        rand_crop=True,
        rand_mirro=True)
    val = mx.io.ImageRecordIter(
        path_imgrec='./data-val.rec',
        data_name='data',
        label_name='softmax_label',
        batch_size=batch_size,
        data_shape=data_shape,
        rand_crop=True,
        rand_mirro=True)
    return train, val
 
 
def do_finetune(symbol, arg_params):
    all_layers = symbol.get_internals()
    net = all_layers["flatten_output"]
    net = mx.symbol.Activation(data=net, name='relu1', act_type="relu")
    net = mx.symbol.Dropout(data=net, p=0.7, name='dp', mode='always')
    net = mx.symbol.FullyConnected(data=net, num_hidden=num_classes, name='fc1')
    net = mx.symbol.SoftmaxOutput(data=net, name='softmax')
    new_args = dict({k: arg_params[k] for k in arg_params if 'fc1' not in k})
    return net, new_args
 
 
def fit(symbol, arg_params, aux_params, train, val):
    devs = [mx.gpu(i) for i in range(num_gpus)]
    mod = mx.mod.Module(symbol=symbol, context=devs)
    metrics = mx.metric.create(['ce', 'acc'])
    mod.fit(train, val,
            num_epoch=100,
            arg_params=arg_params,
            aux_params=aux_params,
            allow_missing=True,
            epoch_end_callback=mx.callback.do_checkpoint("Inception", 1),
            kvstore='device',
            optimizer='sgd',
            optimizer_params={'learning_rate': 0.01, 'wd': 0.0005, 'momentum': 0.9},
            initializer=mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2),
            eval_metric=metrics,
            validation_metric=metrics)
 
 
sym, arg_params, aux_params = mx.model.load_checkpoint('Inception-BN', 00)
(train, val) = get_iterators(batch_size)
(new_sym, new_args) = do_finetune(sym, arg_params)
fit(new_sym, new_args, aux_params, train, val)
import argparse
import os
import os.path
 
from PIL import ImageFile
from keras import optimizers
from keras.applications.inception_v3 import InceptionV3
from keras.callbacks import ModelCheckpoint, EarlyStopping, LearningRateScheduler
from keras.layers import Dense, Dropout, Flatten, AveragePooling2D
from keras.models import Model
from keras.preprocessing.image import ImageDataGenerator
from keras.regularizers import l2
from keras_sequential_ascii import keras2ascii
 
ImageFile.LOAD_TRUNCATED_IMAGES = True
 
num_classes = 10
batch_size = 16
epochs = 100
 
 
def get_num_of_files(root_dir):
    total = 0
    for root, dirs, files in os.walk(root_dir):
        total += len(files)
    return total
 
 
def retrain(train_data_dir, validation_data_dir):
    img_width, img_height = 299, 299
 
    nb_train_samples = get_num_of_files(train_data_dir)
    nb_validation_samples = get_num_of_files(validation_data_dir)
 
    base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))
 
    x = base_model.output
 
    x = Dense(128, activation='relu', init='glorot_uniform', W_regularizer=l2(.0005))(x)
    x = Dropout(0.7)(x)
 
    x = AveragePooling2D(pool_size=(8, 8))(x)
    x = Dropout(.7)(x)
    x = Flatten()(x)
    predictions = Dense(num_classes, init='glorot_uniform', W_regularizer=l2(.0005), activation='softmax')(x)
 
    model = Model(inputs=base_model.input, outputs=predictions)
 
    print(keras2ascii(model))
 
    opt = optimizers.SGD(lr=.01, momentum=.9)
 
    def schedule(epoch):
        if epoch < 20:
            return 0.01
        if epoch < 35:
            return 0.001
        elif epoch < 1000:
            return .0005
 
    lr_scheduler = LearningRateScheduler(schedule)
 
    model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
 
    train_datagen = ImageDataGenerator(
        rescale=1. / 255,
        shear_range=0.3,
        horizontal_flip=True,
        fill_mode="nearest",
        zoom_range=0.3,
        width_shift_range=0.3,
        height_shift_range=0.3,
        rotation_range=70
    )
 
    train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode="categorical")
 
    test_datagen = ImageDataGenerator(
        rescale=1. / 255,
    )
 
    validation_generator = test_datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_height, img_width),
        batch_size=256,
        class_mode="categorical")
 
    checkpoint = ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}.hdf5', monitor='val_acc',
                                 verbose=2, save_best_only=True,
                                 save_weights_only=False,
                                 mode='auto', period=1)
 
    early = EarlyStopping(monitor='val_acc', min_delta=0, patience=50, verbose=1, mode='auto')
 
    model.fit_generator(
        train_generator,
        steps_per_epoch=nb_train_samples / batch_size,
        validation_steps=nb_validation_samples / batch_size,
        epochs=epochs,
        use_multiprocessing=True,
        validation_data=validation_generator,
        callbacks=[lr_scheduler, early, checkpoint])
 
 
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("train")
    parser.add_argument("test")
 
    args = parser.parse_args()
    retrain(args.train, args.test)

Transfer learning (layer freezing)
Most of the existing CV deep learning algorithms require huge datasets for training. The most popular approach to avoid this problem is to first pre-train a deep net on a large-scale dataset then, given a new dataset, we can use these pretrained weights when training on our new task. There are lots of transfer learning variations. With layer freezing the initial neural network is used only as a feature extractor. That means that we freeze every layer prior to the output layer and simply learn a new output layer.

MXNetTensorFlow
def do_finetune(symbol, arg_params):
    all_layers = symbol.get_internals()
    net = all_layers["flatten_output"]
    net = mx.symbol.Activation(data=net, name='relu1', act_type="relu")
    net = mx.symbol.Dropout(data=net, p=0.7, name='dp', mode='always')
    net = mx.symbol.FullyConnected(data=net, num_hidden=num_classes, name='fc1')
    net = mx.symbol.SoftmaxOutput(data=net, name='softmax')
    new_args = dict({k: arg_params[k] for k in arg_params if 'fc1' not in k})
    return net, new_args
...
    base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))
 
    x = base_model.output
 
    x = Dense(128, activation='relu', init='glorot_uniform', W_regularizer=l2(.0005))(x)
    x = Dropout(0.7)(x)
 
    x = AveragePooling2D(pool_size=(8, 8))(x)
    x = Dropout(.7)(x)
    x = Flatten()(x)
    predictions = Dense(num_classes, init='glorot_uniform', W_regularizer=l2(.0005), activation='softmax')(x)
 
    model = Model(inputs=base_model.input, outputs=predictions)
 
 
    ...
    model.fit_generator(
        train_generator,
        steps_per_epoch=nb_train_samples / batch_size,
        validation_steps=nb_validation_samples / batch_size,
        epochs=epochs,
        use_multiprocessing=True,
        validation_data=validation_generator,
        callbacks=[lr_scheduler, early, checkpoint])

Hyperparameters tuning
The same machine learning model might require different weights or constraints for different data patterns. These values are called hyperparameters and, usually, should be tuned to solve the machine learning problem in the most optimal way. Hyperparameter optimization finds a tuple of hyperparameters that contain an optimal model that allows minimizing a predefined loss function on given independent data.

MXNetTensorFlow
def fit(symbol, arg_params, aux_params, train, val):
    devs = [mx.gpu(i) for i in range(num_gpus)]
    mod = mx.mod.Module(symbol=symbol, context=devs)
    metrics = mx.metric.create(['ce', 'acc'])
    mod.fit(train, val,
            num_epoch=100,
            arg_params=arg_params,
            aux_params=aux_params,
            allow_missing=True,
            epoch_end_callback=mx.callback.do_checkpoint("Inception", 1),
            kvstore='device',
            optimizer='sgd',
            optimizer_params={'learning_rate': 0.01, 'wd': 0.0005, 'momentum': 0.9},
            initializer=mx.init.Xavier(rnd_type='gaussian', factor_type="in", magnitude=2),
            eval_metric=metrics,
            validation_metric=metrics)
train_datagen = ImageDataGenerator(
        rescale=1. / 255,
        shear_range=0.3,
        horizontal_flip=True,
        fill_mode="nearest",
        zoom_range=0.3,
        width_shift_range=0.3,
        height_shift_range=0.3,
        rotation_range=70
    )
 
    train_generator = train_datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_height, img_width),
        batch_size=batch_size,
        class_mode="categorical")
 
    test_datagen = ImageDataGenerator(
        rescale=1. / 255,
    )
 
    validation_generator = test_datagen.flow_from_directory(
        validation_data_dir,
        target_size=(img_height, img_width),
        batch_size=256,
        class_mode="categorical")
 
    checkpoint = ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}.hdf5', monitor='val_acc',
                                 verbose=2, save_best_only=True,
                                 save_weights_only=False,
                                 mode='auto', period=1)
 
    early = EarlyStopping(monitor='val_acc', min_delta=0, patience=50, verbose=1, mode='auto')
 
    model.fit_generator(
        train_generator,
        steps_per_epoch=nb_train_samples / batch_size,
        validation_steps=nb_validation_samples / batch_size,
        epochs=epochs,
        use_multiprocessing=True,
        validation_data=validation_generator,
        callbacks=[lr_scheduler, early, checkpoint])

Results comparison –

As soon as we have performed network training and hyperparameters tuning we got the following metrics:

Training & validation metrics:

MXNetTensorFlow
acc-n-loss
acc
loss

Then we have performed testing against a test dataset, that was not a part of neither training nor validation. Let’s take a look at actual result networks:

Test metrics:

MXNetTensorFlow
test-acc-n-loss
test-acc
test-loss

Performance comparison –

And here some performance metrics for trained networks:

Cold cache
MXNet – 0.637s, TensorFlow – 34s
We have definite winner here. MXNet shows really good performance
cold-profile

Hot Cache:
So the result is: MXNet – 0.405s, TensorFlow – 0.37s.
There is a tie.
hot-profile

Conclusions –

As for the last question, did we take the MXNet to production? Well… our management team liked the demo and mobile app. So – yes, we did.

Is there any significant difference between the two frameworks? This is a much harder question.

  • MXNet trained model has much better performance on cold run. This will become crucial on mobile devices where applications need to boot up every time
  • MXNet trained model has slightly better accuracy. This makes a big difference for enterprises and robotic CV systems
  • TensorFlow has an established community with lots of ready solutions and relevant tutorials, while MXNet documentation was outdated
  • TensorFlow became synonymous to machine learning in the current world, so it is much easier to sell

In my opinion we are seeing a good competition between a techy, ambitious challenger, and an established champion.
I would take the challenger’s side this time 🙂

Once we had the actual working models we had to solve one more problem. How to present the results to management in the most suitable way. So we asked our React Native developer to spare one day of his time and create a nice cross-platform mobile app for both Android and iOS.
Mobile app
Mobile app scripts on GitHub
Training scripts on GitHub