About the Event
It’s an exciting time in computer vision. We’re rapidly making progress on fundamental problems such as object recognition and human pose estimation. When I started my Ph.D. in 2007, the best object detection system could achieve a mean average precision (mAP) of only 21% on our standard benchmark dataset (PASCAL VOC 2007). In this talk I’ll describe two systems, one developed during my Ph.D. and the other during my postdoc, that have more than doubled object detection performance (to 54% mAP) over the last seven years.
The first system, Deformable Part Models (or DPM), is based on an elegant framework in which object categories are represented by a type of context-free grammar. These grammars allow object detectors to be specified recursively in terms of parts and subparts. Grammars can also naturally model object classes with variable structure and distinct subclasses. I will describe how we systematically improved object detection performance by increasing the structural sophistication of our detectors within this framework.
In the second part of the talk, I will describe a new approach to object detection that is already achieving remarkable results. This approach, Regions with Convolutional Neural Network Features (or R-CNN), applies a large convolutional neural network to image regions generated by a bottom-up segmentation algorithm. The key insight behind this work is that one can train a ConvNet on a large-scale image classification dataset (ImageNet) and then transfer the learned representation to the problem of object detection, where we are typically short on labeled training data.