Modern day object detection for football
06/06/18 | SciSports
At SciSports and BallJames, we are interested in real-time localization and tracking of football players and the ball using video footage. Deep learning based approaches have dramatically improved object detection in video images throughout the past few years. This is because of the availability of large datasets, as well as the effectiveness of convolutional networks to learn image features that represent objects in those images.
We use a detection-based approach to localize players on the pitch. Specifically, we use YOLO, an open-source object detection framework built on top of a deep learning library called Darknet (by Joseph Redmon). In the rest of this blog, we will take you through some of the key features of this framework, as well as the domain-specific training schema we came up with to customize YOLO for use in football.
Why we like YOLO
One of the most important aspects of the BallJames system is the ability to detect and track the players and the ball in real-time using deep learning approaches. This allows for applications in match tactics, live monitoring of player fitness, fan engagement in media, and betting industries. However, deep learning models are notorious for their computational complexity, given the large number of computations required to process every video frame. This adds up quickly for a tracking system that needs to process 25-50 video frames per second (FPS) from multiple high-resolution cameras.
YOLO has been created with such real-time applications in mind. With a standard video resolution of 640*480 pixels, a 31-layer YOLO model can run at 25 FPS on a reasonable consumer GPU. In our case, 14 cameras each generate videos of 3840x2160 pixels. To get YOLO running in real-time on this amount of data, we utilize massive parallelizations with an elastic GPU cluster in the cloud. But even with such parallelization, we need YOLO models of less than 31 layers to scale up the processing of the BallJames video streams.
BallJames - YOLO
We work with a tiny version of YOLO, custom created for our object-detection tasks. We use a 15 layer model that can run at 60 FPS on videos of 640*480 pixels. The interesting aspect of our YOLO model is the way we have defined the semantic classes for the detection. One of our best performing models has 5 classes, allowing detection of the front and back of the player, detection of player occlusions, as well as detection of the ball. A final class is used to detect people who are not part of the match, including trainers, coaches, managers, and photographers. Such a semantic distinction between classes is made feasible by the quality of our labeled datasets. We believe that over time, with the increasing amount of labeled data that we keep acquiring every day, we would be able to detect even fine-grained semantic information directly by object detection.
In the coming posts, we will give more insight into our system, and how we keep pushing the boundaries of existing deep learning approaches to aid the football industry. Stay tuned!