We would like to complement the MIT Technology Review headline Google Unveils Neural Network with “Superhuman” Ability to Determine the Location of Almost Any Image with information about NNN (Non Neural Network) approaches with similar properties.
This blogpost provides a comparison between the DVEM (Distinctive Visual Element Matching) approach, introduced by our recent arXiv manuscript (currently under review):
Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)
and the PlaNet approach, introduced by the arXiv manuscript covered in the MIT Technology Review article:
Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)
We also include, at the end, a bit of history on the problem of automatically "determining the location of images", which is also known as geo-location prediction, geo-location estimation as in [3], or, colloquially, "placing" after [4].
Our DVEM approach is a search-based approach to the prediction of the geo-location of an image. Search-based approaches consider the target image (the image whose geo-coordinates are to be predicted) as a query. They then carry out content-based image search (i.e., query-by-image) on a large training set of images labeled with geo-coordinates (referred to as the "background collection"). Finally, they process the search results in order to make a prediction of the geo-coordinates of the target image. The most basic algorithm, Visual Nearest Neighbor (VisNN), simply adopts the geo-coordinates of the image at the top of the search results list as the geo-coordinates of the target image. Our DVEM algorithm uses local image features for retrieval, and then creates geo-clusters in the list of image search results. It adopts the top ranked cluster, using a method that we previously introduced [5, 6]. The special magic of our DVEM approach is the way that it reranks the clusters in the results list: it validates the visual match at the cluster level (rather than at the level of an individual image) using a geometric verification technique for object/scene matching we previously proposed in [7], and it leverages the occurrence of visual elements that are discriminative for specific locations.
The PlaNet approach divides the surface of the globe into cells with an algorithm that adapts to the number of images in its training set that are labeled with geo-coordinates for that location, i.e., a location that has more photos will be divided into finer cells. Each cell is considered a class, and is used to train a CNN classifier.
Further comparison of the way the algorithms were trained and tested in the two papers:
DVEM | PlaNet | |
Training set size | 5M images train, 2K validation | 91M train, 34M validation |
Training set selection | CC Flickr images with geo-locations, (MediaEval 2015 Placing Task) | Web images with Exif geolocations |
Training time | 1 hour on 1,500 cores for 5M photos for indexing and feature extraction | 2.5 months on 200 CPU cores |
Test set size | ca. 1M images | 2.3M images |
Test set selection | CC Flickr images (MediaEval 2015) | Flickr images with 1-5 tags |
Train/test de-duplication | train/test sets mutually exclusive wrt uploading user | CNN trained on near-duplicate images |
Data set availability | via MM Commons on AWS | not specified |
Model size | 100GB for 5M images | 377MB |
Baselines | GVR [6], MediaEval 2015 | IM2GPS [8] |
From this table, we see that the training and test data for the algorithms are different, and for this reason, we cannot compare the accuracy measured for the two approaches directly. However, the numbers at the 1 km level (i.e., street level) suggest that DVEM and PlaNet are playing in the same ballpark. PlaNet reports correct prediction for 3.6% of the images on the (2.3M image test set) and 8.4% on the IM2GPS data set (237 images). Our DVEM approach achieves around 8% correct predictions on our 1M image test set, and is surprisingly robust to the exact choice of parameters. DVEM gains 12% relative performance over VisNN, and 5% over our own previous GVR. Note that [6] provides evidence that GVR outperforms IM2GPS [8]. PlaNet also reports that it outperforms IM2GPS, but the numbers are not directly comparable because 14x less training data is used.
The downside of search-based approaches is prediction time, as pointed out by the PlaNet authors in discussion IM2GPS. DVEM requires 88 hours on a Hadoop based cluster containing 1,500 cores to make predictions for 1M images. For applications requiring offline prediction, this may be fine, however, we assume that online geo-prediction is also important. We point out that with enough memory or an efficient index compression method, we would not need Hadoop, and we would be able to do the prediction on a single core with about 2s per query. Further, the question of how runtime scales is closely related to the question of the number of images that are actually needed in the background collection. Our DVEM approach uses 18x less training data than the PlaNet algorithm: if we are indeed in the same ballpark, this result calls in to question the assumption that prediction accuracy will not saturate after a certain number of training images.
We mention a couple reasons for which DVEM might ultimately turn out to out-perform PlaNet. First, the PlaNet authors point out that the discretization hurts accuracy in some cases. DVEM, in contrast, creates candidate locations "on the fly". As such, DVEM has the ability to make a geo-prediction at an arbitrarily small geo-resolution.
Second, the test set used to test DVEM is possibly more challenging than the PlaNet test set because it does not eliminate images without tags. We assume that the presence of a tag is at least a weak indicator of care on the part of the user. A careless user might also engage in careless photography, producing images that are low quality and/or are not framed to clearly depict their subject matter. A test set containing images taken by relatively more careful users could be expected to yield a higher accuracy.
Third, we assume that when near duplicates were eliminated from the PlaNet test/training set, that these were near duplicates from the same location. Eliminating images that are very close visual matches with other locations would, of course, artificially simplify the problem. However, it may also turn out that the elimination artificially makes the problem more difficult. In real life, a lot of people simply do take the same picture, for example, of the leaning tower of Pisa. A priori it is not clear how near duplicates should be eliminated to ensure the testing setup maximally resembles an operational setting.
The PlaNet paper was a pleasure to read, the name "PlaNet" is truly cool, and we are enthused about the small size of the resulting model. We are interested by the fact that "PlaNet" produces a distributional probability over the whole world, although we also remark that, DVEM is capable of producing top-N location predictions. We also liked the idea of exploiting sequence information, but think that considering temporal neighborhoods rather than temporal sequences might also be helpful. Extending DVEM with either temporal sequences or neighborhoods would be straightforward.
We hope that the PlaNet authors will run their approach using the MediaEval 2015 Placing Task data set so that we are able to directly compare the results. In any case, they will want to revisit their assertion that "...previous approaches only recognize landmarks or perform approximate matching using global image descriptors" in the light of the MediaEval 2015 Placing Task results, including our DVEM algorithm.
We would like to point out that work on algorithms able to predict the location of almost any image has been ongoing in full public visibility for a number of years. (Although given our field, we also enjoy the delicious jolt of a headline beginning "Google unveils...") The starting point can be seen as Mapping the World's Photos [9] in 2009. The MediaEval Multimedia Evaluation benchmark has been developing solutions to the problem since 2010, as chronicled in [10]. The most recent contribution was the MediaEval 2015 Placing task [11], cf. the contributions that use visual approaches to the task [12,13]. The MediaEval 2015 data set is part of the larger, publicly available YFCC100M data set, part of Multimedia Commons, and recently featured in Communications of the ACM [14]. MediaEval 2016 will offer a further edition of the Placing Task, which is open to participation for any research team who signs up.
We close by retuning to comment on the importance of NNN (Non Neural Network) approaches. This example of the strength of DVEM vs. PlaNet provides a demonstration that there is reason for the research community to retain a balance in their engagement in NN and NNN approaches. One appealing aspect of NNN approaches, and, in particular of search-based geo-location prediction, is the relative transparency of how the data is connected to the prediction. It may sound like science fiction from today's perspective, but one could imagine a future in which the person who took the image would receive a micro fee every time their image was used for the purpose of predicting geo-location metadata for someone else. Such a system would encourage people to take images that were useful for geo-location, and move us forward as a whole.
We would like to thank the organizers of the MediaEval Placing task for making the data set available for our research. Also a big thanks to SURF SARA for the HPC infrastructure without which our work would not be possible.
[1] Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)
[2] Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)
[3] Jaeyoung Choi and Gerald Friedland. 2015. Multimodal Location Estimation of Videos and Images. Springer Publishing Company, Springer.
[4] P. Serdyukov, V. Murdock, R. van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '09), ACM, New York, pp. 484–491.
[5] Xinchao Li, Martha Larson, and Alan Hanjalic. 2013. Geo-visual ranking for location prediction of social images. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval (ICMR '13). ACM, New York, NY, USA, 81-88. [6] Xinchao Li, Martha Larson, and Alan Hanjalic. Global-Scale Location Prediction for Social Images Using Geo-Visual Ranking, in IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 674-686, May 2015.
[7] Xinchao Li, Martha Larson, Alan Hanjalic. 2015. Pairwise Geometric Matching for Large-scale Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 5153-5161.
[8] J. Hays and A. A. Efros, "IM2GPS: estimating geographic information from a single image," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, Anchorage, AK, 2008, pp. 1-8.
[9] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. 2009. Mapping the world's photos. In Proceedings of the 18th international conference on World wide web (WWW '09,) ACM, New York, 761-770.
[19] Martha Larson, Pascal Kelm, Adam Rae, Claudia Hauff, Bart Thomee, Michele Trevisiol, Jaeyoung Choi, Olivier Van Laere, Steven Schockaert, Gareth J.F. Jones, Pavel Serdyukov, Vanessa Murdock, Gerald Friedlan. 2015. The Benchmark as a Research Catalyst: Charting the Progress of Geo-prediction for Social Multimedia. In [3].
[11] Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, Bart Thomee. The Placing Task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper6.pdf
[12] Lin Tzy Li, Javier A.V. Muñoz, Jurandy Almeida, Rodrigo T. Calumby, Otávio A. B. Penatti, Ícaro C. Dourado, Keiller Nogueira, Pedro R. Mendes Júnior, Luís A. M. Pereira, Daniel C. G. Pedronette, Jefersson A. dos Santos, Marcos A. Gonçalves, Ricardo da S. Torres. RECOD @ Placing Task of MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper49.pdf
[13] Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris. CERTH/CEA LIST at MediaEval Placing Task 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper58.pdf
[14] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, Vol. 59 No. 2, Pages 64-73.