[TL;DR] TagMe annotates objects in videos using GPS data without any human annotators. Therefore, TagMe can significantly reduce the annotation cost, i.e., up to 110x.
Training high accuracy object detection models requires large and diverse annotated datasets. However, creating these datasets is time-consuming and expensive since it relies on human annotators. We design, implement, and evaluate TagMe, a new approach for automatic object annotation in videos that uses GPS data. When the GPS trace of an object is available, TagMe matches the object's motion from GPS trace and the pixels' motions in the video to find the pixels belonging to the object in the video and creates the bounding box annotations of the object. TagMe works using passive data collection and can continuously generate new object annotations from outdoor video streams without any human annotators. We evaluate TagMe on a dataset of 100 video clips. We show TagMe can produce high-quality object annotations in a fully-automatic and low-cost way. Compared with the traditional human-in-the-loop solution, TagMe can produce the same amount of annotations at a much lower cost, e.g., up to 110x.
Demo: We show a demo of TagMe below. In this demo, we show the input (videos and GPS traces) and the output bounding box annotations produced by TagMe. We use three objects, i.e., person, cyclist and car. They are captured from three different positions with different camera tilt angles.