Using AR and Cricket Memorabilia to Recreate Moments in Australian Sporting History – Part 2

Mitch McCaffrey

What if you could take a physical set of cricket stumps lying around in your backyard and recreate a moment of sporting history?

With augmented reality, this is possible — and something I recently explored as part of a prototype I developed for an AR cricket experience.

In part 1 of this two-part series, we looked at how I attempted to use Apple’s ARKit to localise a set of cricket stumps. In this post, we’ll explore other methods I tested and how I was able to successfully detect the stumps, integrate my models in ARKit, and create an AR cricket experience prototype.

Exploring Other Localisation Methods for Cricket Stump Detection

After it was clear that I needed another method for localising the cricket stumps, I decided to investigate the field of deep learning based object detection.

A recent advancement in deep learning object detectors comes in the form of single network object detectors. Unlike their predecessors, these object detectors run a single neural network to find objects in an image. This means they are fast (fast enough to run on mobile) while still remaining accurate.

The two single network detectors we’ll be looking at in this post are the Single Shot Multibox Detector (SSD) and the You Only Look Once (YOLO) detector.

Aside: Non-max suppression

There are many techniques that go into making deep learning object detectors successful. One of these is non-max suppression. Due to the fact that these detectors return thousands of possible predictions with only a few of them being the ones you actually want, techniques like non-max suppression are essential for getting usable results. Non-max suppression works by selecting the bounding boxes that the neural network thinks are the most likely to be correct, then removes the other predictions that overlap this high probability bounding box (see below). The amount that defines whether a prediction will be removed for overlapping with another prediction of higher confidence is called the non-max suppression threshold. We’ll look at tweaking this value for better results with the stumps later in this post.

Before (left) and after (right) removing low confidence predictions and applying non-max suppression. Source: YOLO.

SSD with Tensorflow

To investigate the SSD, I went with the Tensorflow open source implementation as it offered many different configuration options and, along with the tf-coreml utility, could be converted to Core ML to deploy to iOS devices.

After getting a model running on a device (see below), there were a few workflow issues that I noticed. The first major one was tf-coreml, which doesn’t currently support the full SSD network. This manifested itself mostly in the post-processing nodes that the Tensorflow implementation used for things like non-max suppression and anchor boxes (a technique used to help the detection of overlapped objects).

This meant a few things needed to be implemented with Swift or Objective C before the results of the network could be used. Luckily, this GitHub project offers an implementation of these to use with a Tensorflow model, but it’s not the cleanest solution so a more native solution would be preferred.

SSD network trained with Tensorflow running on an iPad Pro.

A few niceties do exist with Tensorflow, mainly in the training workflow, which allows you to create checkpoints to stop training early and not lose progress. Also with Tensorboard, you can  visualise the training process (see below).

Training results can be visualised when using Tensorflow with Tensorboard.

Ultimately, due to the workflow issues, I wanted to explore other options for the object detection network.

Aside: A brief investigation of object detection labelling tools

A big part of data driven methods for computer vision, like neural networks, is the ability to collect data. For object detection, you must have a labelled dataset of objects and their bounds in a respective image. As both techniques explored in this series make use of  transfer learning (which allows you to use a pre-trained network to bootstrap the training process) a dataset of only a few hundred labelled objects is needed.

With this in mind, I opted to do the labelling locally with a few hundred images of my test stumps, which allowed me to prototype my idea.

To do the labelling, I looked at four tools:

  1. Labelimg – An open source python program that allows you to draw bounding boxes in images and export it to XML.
  2. VoTT – An open source Electron app created by Microsoft for labelling both images and videos. It supports both XML and exporting directly to the Tensorflow TFRecord format.
  3. LabelMe – A web-based image labeller created by MIT that allows multiple people to add to a single labelled dataset.
  4. RectLabel – A native MacOS utility for image labeling that supports exporting to XML.

Both Labelimg and VoTT currently only support drawing bounding boxes or rectangles. As we’ll be looking at later in this post, I needed to have the ability to label arbitrary polygons so these didn’t suit my purpose. LabelMe allows you to create any shape you want, however it is meant more for research and as such their licence reflects this. This left RectLabel (see below) a neat little tool that happened to tick all of my boxes. So this is what I used for this project.

RectLabel used to label the exact bounds of the stumps.

YOLO with Turi Create

Similar to Tensorflow, Turi Create is a Python library for training machine learning models. I found that it had a few benefits over Tensorflow if your target platform is iOS since it was created by Apple.

Unlike Tensorflow, Turi Create has a single object detection option in the form of the YOLO architecture. This means (for better or worse) the choice of what network to use is made for you; for a prototype like this, I found little difference in the end result.

After training a model and deploying to iOS, here are some pros and cons I discovered for Turi Create compared to Tensorflow.

Turi Create Pros:

  • Allows exporting to a Core ML model natively, meaning no conversion tool is necessary.
  • Has some nice interaction with iOS’s Vision Framework, making it easier to run the models on iOS (e.g. in iOS 12+ this means no need to implement non-max suppression).
  • Extremely simple Python API allows you to start training in a few lines of code.

Turi Create Cons:

  • No real way to visually track the progress of training (just a simple loss metric printed to the console).
  • No checkpoints means training can’t be interrupted without loss of progress.

Below, you can see the results of the YOLO model running on an iPad Pro.

YOLO network trained with Turi Create running on an iPad Pro.

For a prototype like this where I was only deploying to iOS, the Turi Create to iOS workflow worked well, allowing me to iterate on ideas a little quicker than when using Tensorflow.

Aside: Optimising the Turi Create export for a single object detector

When exporting out of Turi Create, there is an option called iou_threshold this controls the intersection amount that the non-max suppression will use as threshold to remove overlapping detections. As I had one object class to predict, I set this to a very small value to eliminate any overlapping predictions and ensure better detection results. The result from the image above is using an iou_threshold of 0.001, which led to less false positives and, ultimately, helped achieve better results.

Landmark Detection with Keras

So far, we’ve explored two options for detecting an object in a 2D reference frame. Next, I needed to think about how to use this in ARKit.

The solution I came up with required taking the object detection results from the Turi Create model and creating a tight 2D mesh around the stumps. This tight mesh then allowed me to use hit testing to project into the AR world and find both the centre of the stumps as well as its rotation (see below).

After the exact bounds of the stumps were detected, I performed a hit test to project them into the AR world. These world points were then used to find the position and rotation of the stumps.

To accomplish this, I created a simple convolutional neural network with Keras, a Python API for creating neural networks. Keras works as an interface for other deep learning libraries. For this, I used a Tensorflow backend because it enabled me to use coremltools to convert the model to Core ML.

As I wanted to detect the exact bounds of the stumps, I decided to base the architecture off the field of face landmark detectors. With this in mind, I based the model from this implementation of a landmark detector, which makes use of convolutional layers with both max pooling and dropout to reduce overfitting.

The model used for the stump landmark detection. The input is a 96×96 RGB image containing a single set of cricket stumps. The output is the coordinates of the stumps’ corners.

After exporting the model for iOS with coremltools, I integrated the landmark detector with the YOLO detector from Turi Create. To do this, I took the output of the YOLO detector and cropped the image to the detected bounds. This cropped image was then passed to the landmark detector to get the four corners of the stumps.

The output of the YOLO object detector (red) was used to find the landmarks (blue) with the custom Keras model.

The Final AR Cricket Experience Prototype

After finding success detecting both the stump bounds using the Turi Create object detector and detecting the stump landmarks using the Keras model, the last thing I needed to do was integrate my models into ARKit. I did this by taking the camera feed from ARKit and running the Vision requests with my models, so when the user starts the experience, a hit test is performed, in turn creating an AR Anchor to use as the origin of the experience.

Here’s a look at the final result:

Once the stumps are recognised, the user can tap the screen and a cricket player will walk up to the stumps and perform a batting routine.

Conclusion and Looking Forward

Hopefully, this post has been informative if you’re looking to use object detection for AR experiences. Looking to the future, it would be interesting to see if the two models I used to do the object and landmark detection could be merged into one singular network. This may be able to help with efficiency as currently on a 2017 iPad Pro the detection models run at <30 fps. While this doesn’t impact the experience at the moment because the tracking is still performed by ARKit, but it would be interesting to test.

If you would like to know more about the concept or the motivation behind creating this experience, get in touch: