Saturday, June 18, 2016

Multimedia analysis out of the box: new applications and domains

Flickr: Tom Magliery
This blogpost summarizes the panel on the third day of the 14th International Workshop on Content-based Multimedia Indexing CBMI 2016. It includes both statements by the panelists and comments coming from the audience. I was the panel moderator, and was also taking notes as people were speaking (any error in reproducing what people said here is strictly my own).

The panel was structured into three rounds roughly related to the past, present, and future of multimedia analysis research. Each round had an “opener” that the panelists were asked to respond to, and then continued in free form, with the audience also contributing.

First round: The panelists were asked to discuss, “A past vision (that you have had during the last 20 years) for a multimedia analysis application that came to be.”

The early work of GroupLens started a user revolution. It was great to have recommender systems break onto the scene. Their introduction shifted the focus of the community of researchers, also those studying information/multimedia access, from pure computation to involving users. This shift was possible because computers could collect user interactions, providing researchers with large sets of interactions to work with. Recommender systems introduced the key idea that users can benefit from other users, and this idea has come into its own.

Historically, multimedia indexing started with spoken content indexing. (This statement carried the “footnote” that the panelists and the panel moderator all have a speech background.) In the past years, we have seen the maturation of speech and language technology. Now we are on the brink of systems that index all spoken information in multimedia. (But let’s keep breathing in the meantime.)

The panel noticed that it is easier to name past visions that still have not completely come to be. Examples were:

First person video: In the late 1990’s, video life logging started. The goal was to summarize daily life, and to aid memory and remembering. Privacy is a real stumbling block for this vision. However, now we are seeing first person cameras like GoPro: so, perhaps it is video life logging is here, but it is not exactly what we thought it was going to be.

Users: Ten years ago we were developing algorithms for applications, but there was a sense that they would never be put to use. The field of multimedia analysis is now more user centered, although not yet complete so: we are on our way. Sometimes it’s not gaining 5% MAP that makes product usable. Instead, we need to think about different lines.

Education: The panel was in agreement that we have yet to see multimedia reach its potential as a tool for education. This could and should be the century of education!

In the early 1990’s multimedia retrieval and spoken content retrieval were intended to support education. Today, we see that eduction is still mainly about books. MOOCs and online learning resources are growing in popularity, but we are still waiting for multimedia indexing to really contribute to education at large scales.

We used to have the vision that kids should be able to play with information and to communicate with each other as part of studying and learning: These types of applications were fun. What happened to this kind of work? It is a shame that this hasn’t really been put to mainstream use: Is this the responsibility of the multimedia people?

Well, yes. We are all teachers in a way: Why don’t we eat our own dogfood? Looking at this conference, our presentations are all text-heavy sets of PowerPoint slides!

Why is willingness of teachers and journalists to use multimedia tools so low? Do we need to wait until everyone in the world becomes tech friendly to have our research put to use?

Maybe we just don’t have the tools necessary to allow multimedia indexing to come into its own in support of education. We need the tools in order to engage teachers.

We don’t have the time to do education related research. You can’t just do a 10 minute experiment with data from 30 people: people are complex kids are complex! We haven’t been willing to take the time to work with teachers: we haven’t had funding for a 5-10 year sustained effort in this area. But it’s a worthwhile goal.

We need to understand the nature of education. There is a relationship between student and teacher: it is a human relationship. A machine might not be able to motivate the student.

This observation about student behavior stands in contrast to the success of video games in motivating kids. Games appear to motivate kids more so than their parents are able to do. However, today’s games are too simplistic to be an education tool. They don’t reflect real breath.

Final note of the first round: It seems that multimedia analysis researchers don’t talk about “killer applications” anymore. The way we see our success is more diffuse, and maybe that is also OK.

Second round. Panel members were asked to discuss “A current (widely-held) vision for a multimedia analysis application that is doomed.”

Our panelists jumped on the opportunity to be controversial.

Is lifelogging doomed?
Multimedia researchers of course love the huge amounts of data that life logging delivers. But do people really want their lives to be logged? Why would I want all of those picture? Are we just recording without a real application?

When we are healthy and in good shape we have perhaps no reason to record our lives. But when we become older or are in a situation that we need to be managing an illness, things change. In this case, the lifelogging applications are tremendously interesting. For elderly people living alone, it can be a real help: although it does not replace human company.

Why don’t we see this technology being widely used? The problem is not the market. The problem is that we are not marketing or business people: we need someone else to put this technology on the market. This process for doing so is a mess! We develop nice applications, but we need to move on, and the business development never gets done.

Is virtual reality doomed?
We are not in a virtual space having a virtual conference. We are here. Virtual meeting rooms have not come to be and video conferencing fatigue is real. Virtual reality works great in games. Perhaps also in demonstrating things. But in general, augmented reality appears to be the more promising path.

Is multimedia analysis of broadcast television doomed? 
Analysis of news, sports, movies, in fact, any produced content is over. If someone can produce the content, they can also dedicate the effort to annotate it.

A less extreme version of that position is probably, however, more appropriate. When we carry out multimedia research, often produced content is the only content we have. Not every content producer has the resources to create annotations. Finally (as note by the moderator) some types of annotations are against the business interests of people producing multimedia content: Do film producers really want audiences to have a fine-grained breakdown of the violence in film?

The panel agreed that analysis of produced content is very important for knowledge extraction and summaries of large, heterogenous collections. You can extract knowledge and facts: for example, the present needs a 20 minute summary.

Professionals, or specific applications often need detailed summaries: There would be value in summarizing to study for example the soccer moves of a certain player for practice or for strategy purposes.

Personal content often needs summarization: parents like highlights of school games or performances that feature their own children.

Are standards doomed? 
Standards make sense for compression and communication, but standards have been over pushed. Many researchers identify with this situation: You barely know what you’re doing and you make a standard for it. However, the activity that takes place around the production of standards gives rise to new ideas. The fact that descriptors were encoded in MPEG7 gave rise to a lot of further work on descriptors.

Perhaps a more direct way of achieving the same effects is via reference implementations and toolkits. OpenCV is effectively, although not formally, a standard. This kinds of efforts are very important.

Third round: Panel members were asked for “A future vision for a multimedia analysis application that we should strive for.”

The opening comment was interesting and unexpected: As a early-career researcher in multimedia one is drawn to problems that one likes, and that attract and holds one’s attention. However, as a late-career researcher, one looks back and starts to regret not having considered the contribution that one’s career was making to society.

Multimedia for medicine: Young multimedia researchers should consider “joining the doctors”: the field of medecine needs us.

Human rights: Another area with enormous potential social impact is multimedia for human rights. We need algorithms that will allow us to find evidence of violations: examples are the analysis of areal photos to search for hidden destruction and the reconstruction of events using social media.

We need (footnote by moderator) technology that is able to verify the extent to which multimedia reflects the reality that it claims to capture: and, in particular, identify multimedia created with the intent to deceive.

Low quality content is key: Interestingly, some of the most highly socially relevant applications for multimedia involve processing some of the worst images. Multimedia researchers need to be brave enough to venture into areas where content is poor quality, difficult to obtain, and (footnote by moderator) where evaluation of success is highly challenging.

User intent: Multimedia information retrieval has recently experienced the “intent revolution”: the change from focusing on the nature of the items that users are trying to find, to the tasks that users are trying to achieve. Supporting people in their daily lives is not is as obviously socially relevant as education, medical or human rights applications. However, it has an important contribution to make.

Affective computing: We look forward to multimedia systems that support us in the emotional aspects of communicating with multimedia: sharing and mutual remembering. Humans are social creatures (isolations causes us to suffer). Shared experiences allow us to build relationships, share values, and keep the connections needed for social and psychological well-being. Regretfully, current research on affect and sentiment simplifies the emotional aspects of multimedia to the extent that it may be “trivial”. We need to work towards understanding both multimedia and the mind: a key question is: What pieces need to come together in order for someone to experience the reproduction of a memory or an experience?

Hardware and energy consumption: We should not forget that multimedia analysis is possible because of the devices that capture, store and process multimedia. We are ever dependent on hardware. Processing of multimedia costs energy: and future work should also keep energy efficiency in mind.

Closing comments:
When we study multimedia, we study communicating with multimedia. Moving forward it is important to keep the human in human communication.

Is there an end to multimedia? Can we foresee that it might be replaced by something completely different?

We see multimedia as an “everlasting field” encompassing applications that have not yet been invented. However, we should continue to call it “multimedia”, because continuity of what we call it will allow us to build on the past.

Currently, we see more and more other communities doing multimedia: examples are the computer vision community and the speech and language processing community. Having a distinct identity will allow the other fields to avoid reinventing the wheel.

We saw during the first round of the panel that looking back over the past 20 years, we did not do so well in formulating predictions which came true: the technologies that we anticipated have not achieved mainstream uptake (with a few notable exceptions). It’s not dramatic to be wrong in our predictions. However: it is important that we learn from our mistakes.

In general, we do not expect all early-career multimedia researchers to connect to socially relevant applications by “joining the doctors”. But it is good to have a larger vision. When you are writing a paper, embed your ideas within an overall picture of their potential. Embrace the larger meaning of your work and imbue multimedia research with sense of mission.

A big thank you to our panelists and to the members of the audience who contributed to the discussion.

Panelists:
Guillaume Gravier, IRISA, France
Alexander Hauptmann, Carnegie Mellon University, USA
Bernard Merialdo, EURECOM, France

Audience contributors:
Jenny Benois Pineau, University of Bordeaux, France
Bogdan Ionescu, University Politehnica of Bucharest, Romania
Georges Quénot, LIG, France
Stéphane Marchand Maillet, University of Geneva, Switzerland
Mathias Lux, Klagenfurt University, Austria

Thursday, April 21, 2016

Horizons: Multimedia Technologies that Protect Privacy

The Survey on Future Media for the new H2020 Work Programme gave me 500 characters each to answer a series of critical questions. I’m listing questions and my answers below. I'm taking this as my chance to pull out all the stops: extreme caution meets idealism. Did I use my characters wisely?

Describe which area the new research and innovation work programme of H2020 should look at when addressing the future of Media.

Non-Obvious Relationship Awareness (NORA) is a set of data mining techniques that find relationships between people and events in data that no one would think would exist. European Citizens sharing images or videos online have no way of knowing what sorts of information they are revealing about themselves. We need innovative research on media processing techniques that protect people's privacy by warning them when they are sharing information, and that obfuscate media making it safe for sharing.

What difference would projects in the area you propose make for Europe's society and citizens?

Projects in this area would contribute to safeguarding the fundamental right of European citizens to privacy and protection of personal data. Today, privacy protection focuses on protecting "obvious" personal information. This protection means nothing when personal information is obtainable "non-obvious" form. European citizens need tools to understand the dangers of sharing media in cyberspace, and tools that can support them in making informed decisions and protecting themselves.

What are the main technological and Media ecosystem related breakthroughs to achieve the foreseen scenario?

The Media ecosystem in question is the whole of cyberspace. The breakthrough that we need is techniques to predict that impact of data that we have not seen entering the system. We need techniques that are able to obfuscate images and videos in ways that defeat sophisticated machine learning algorithms, such as deep learning techniques. These technologies must be designed from the beginning in a way that is understandable and acceptable to the general population: protection only works if used.

What kind of technology(ies) will be involved?

Technologies involved are image, text, audio, and video processing algorithms. These algorithms will re-synthesize users' multimedia content so that it still fulfills its intended function, but with a reduced risk of leaking private information. Technology must go beyond big data to be aware of hypothetical future data. Yet unheard of: technology capable of protecting users' privacy against inference of non-obvious relations must be understandable by the people who it is intended to serve.

Describe your vision on the future of Media in 5 years' time?

People will begin to worry about large companies claiming to own (and attempt to sell them back) digital versions of their past selves, forgotten on distant servers. The realization will grow that it is not enough to have a device that takes amazing images and videos, but you also need a device that allows you to save and enjoy those images in years to come. An understanding will emerge that a rich digital media chronicle of ones own life contributes to health, happiness and wellbeing.

Describe your vision on the future of Media in 10 years' time?

Social images circling the globe will give people unprecedented insight into the human condition. People living in both developed and developing countries will rebel at anyone in the human race living under conditions of constant fear, and threat of constant hunger. The world will change. If protecting privacy means that people need to stop sharing images and videos all together, the opportunity to fulfill this idealistic vision is missed. The future of Media is bright, only if can be kept safe.

At the end of the day, multimedia is about making the world healthy, happy, and complete. At the end of this exercise I have concluded that the horizon stretches even further than 2020.

Sunday, April 3, 2016

Starting to RUN

Thank you for the email, tweets and texts about my new appointment at Radboud University Nijmegen. I'm happy that other people realize what a special day it was for me, and share my excitement about new opportunities and new challenges. I appreciate the warm reception at Radboud University. The "Welcome!" was unmistakeable: actually written on my whiteboard, when I walked into my office in the Center for Language Studies for the first time.

My appointment is as "Professor of Multimedia Information Technology" at the Faculty of Science, Institute for Computing and Information Sciences (iCIS). It involves a double affiliation (50/50) between iCIS and the Faculty of Arts, Centre for Language Studies (CLS). In this way, it brings together my background (pre-1990 in Math and EE; 1990-2000 in Formal Linguistics; and since 2000 in Computer Science, i.e., audio-visual search engines). It is a natural extension of this background that I will be working to bridge the research occurring on information access between the two faculties.

A press release about my appointment appeared on 31 March on the Radboud University homepage. I was very happy about the publicity for the MediaEval Multimedia Evaluation Benchmark. MediaEval is an initiative aimed at driving the development of new multimedia access technologies by offering shared tasks to the community. Instead of being centrally organized, it is grassroots in nature. My role is the bass player who, in a band, helps to links different parts together and keep the music moving forward on tempo. The success of the benchmark comes from the dedication and efforts of the task organizers, and the participants. (MediaEval is offering a great lineup of tasks in 2016, and signup is now open on the MediaEval 2016 website. The MediaEval 2016 workshop will be held 20-21 October 2016, right after ACM Multimedia 2016 in Amsterdam.)

Starting January 2017, Radboud University will be my main university (4 days per week), but I will maintain an affiliation with Delft University of Technology (1 day a week).

Currently, my main affiliation remains the Multimedia Computing Group at Delft University of Technology. However, I am at Radboud University Nijmegen for two days a week to get started at CLS. My first act is teach Intelligent Information Tools, a course for first and second year undergraduate students in Communication and Information Science. The students learn about the nature of information, the structure of the internet, how search, recommendation, and other information tools work, and also how to think critically about these tools.

At TU Delft I continue teaching, and pursuing my research. The main focus of my research at this time is recommender systems, within the context of the EC FP7 project CrowdRec "Fusion of active information for next generation recommender systems". It is a privilege to serve the CrowdRec consortium as the scientific coordinator.  Current highlights are: The NewsREEL news recommendation challenge, at CLEF 2016 the ACM RecSys 2016 job recommendation challenge, and the Workshop on Deep Learning for Recommender Systems, also at ACM RecSys 2016. I look forward to a successful conclusion of the project September 2016, and also to future collaborations.

Seven years ago, nearly to the day, I wrote the first post on this blog. I had read an article advising kill your blog, as an answer to blogposts getting lost in a sea of mainstream information. My post points out that it is strange to suggest that bloggers must change, and not mention the role or responsibility of search engines.

Now, I am more convinced in ever of the value of information within small circles. Search needs to support exploitation of that value. The readership of this blog is intended to be future versions of myself, and also a limited number of people interested in a deep dive into reflections on various search-related topics. As I move to a new university, and the number of people I teach or collaborate with grows, I would like to remember that. I'll probably have less time to write blog posts, but I have decided that I will wait a few more years until moving away from occasionally blogging.

Creating information is a way in which we help ourselves think. Intense conversations also refine thought. But the model of everyone talks to everyone about everything does not always make sense. Instead, we need room for reflection with a relatively small set of individuals. Search should support that.

What's blocking the road? Maybe we feel that small scale search is a success because Google now displays calendar events in our search results. Maybe facing the personal is somehow more laborious or painful. In any case, currently we are far from understanding the aggregated impact of thousands of local dialogues, or to evaluating the success of small search that helps us exchange ideas with our past selves, and our closest colleagues. The future holds no lack of challenges.


Saturday, March 5, 2016

A Non Neural Network algorithm with "superhuman" ability to determine the location of almost any image

Martha Larson and Xinchao Li

We would like to complement the MIT Technology Review headline Google Unveils Neural Network with “Superhuman” Ability to Determine the Location of Almost Any Image with information about NNN (Non Neural Network) approaches with similar properties.

This blogpost provides a comparison between the DVEM (Distinctive Visual Element Matching) approach, introduced by our recent arXiv manuscript (currently under review): 

Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)

and the PlaNet approach, introduced by the arXiv manuscript covered in the MIT Technology Review article:

Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)

We also include, at the end, a bit of history on the problem of automatically "determining the location of images",  which is also known as geo-location prediction, geo-location estimation as in [3], or, colloquially, "placing" after [4].

Our DVEM approach is a search-based approach to the prediction of the geo-location of an image. Search-based approaches consider the target image (the image whose geo-coordinates are to be predicted) as a query. They then carry out content-based image search (i.e., query-by-image) on a large training set of images labeled with geo-coordinates (referred to as the "background collection"). Finally, they process the search results in order to make a prediction of the geo-coordinates of the target image. The most basic algorithm, Visual Nearest Neighbor (VisNN), simply adopts the geo-coordinates of the image at the top of the search results list as the geo-coordinates of the target image.  Our DVEM algorithm uses local image features for retrieval, and then creates geo-clusters in the list of image search results. It adopts the top ranked cluster, using a method that we previously introduced [5, 6]. The special magic of our DVEM approach is the way that it reranks the clusters in the results list: it validates the visual match at the cluster level (rather than at the level of an individual image) using a geometric verification technique for object/scene matching we previously proposed in [7], and it leverages the occurrence of visual elements that are discriminative for specific locations.

The PlaNet approach divides the surface of the globe into cells with an algorithm that adapts to the number of images in its training set that are labeled with geo-coordinates for that location, i.e., a location that has more photos will be divided into finer cells. Each cell is considered a class, and is used to train a CNN classifier.

Further comparison of the way the algorithms were trained and tested in the two papers:


DVEMPlaNet
Training set size5M images train, 2K validation91M train, 34M validation
Training set selectionCC Flickr images with geo-locations, (MediaEval 2015 Placing Task)Web images with Exif geolocations
Training time1 hour on 1,500 cores for 5M photos for indexing and feature extraction2.5 months on 200 CPU cores
Test set sizeca. 1M images2.3M images
Test set selectionCC Flickr images (MediaEval 2015)Flickr images with 1-5 tags
Train/test de-duplicationtrain/test sets mutually exclusive wrt uploading userCNN trained on near-duplicate images
Data set availabilityvia MM Commons on AWSnot specified
Model size100GB for 5M images377MB
BaselinesGVR [6], MediaEval 2015 IM2GPS [8]

From this table, we see that the training and test data for the algorithms are different, and for this reason, we cannot compare the accuracy measured for the two approaches directly. However, the numbers at the 1 km level (i.e., street level) suggest that DVEM and PlaNet are playing in the same ballpark. PlaNet reports correct prediction for 3.6% of the images on the (2.3M image test set) and 8.4% on the IM2GPS data set (237 images). Our DVEM approach achieves around 8% correct predictions on our 1M image test set, and is surprisingly robust to the exact choice of parameters. DVEM gains 12% relative performance over VisNN, and 5% over our own previous GVR. Note that [6] provides evidence that GVR outperforms IM2GPS [8]. PlaNet also reports that it outperforms IM2GPS, but the numbers are not directly comparable because 14x less training data is used.

The downside of search-based approaches is prediction time, as pointed out by the PlaNet authors in discussion IM2GPS. DVEM requires 88 hours on a Hadoop based cluster containing 1,500 cores to make predictions for 1M images. For applications requiring offline prediction, this may be fine, however, we assume that online geo-prediction is also important. We point out that with enough memory or an efficient index compression method, we would not need Hadoop, and we would be able to do the prediction on a single core with about 2s per query. Further, the question of how runtime scales is closely related to the question of the number of images that are actually needed in the background collection. Our DVEM approach uses 18x less training data than the PlaNet algorithm: if we are indeed in the same ballpark, this result calls in to question the assumption that prediction accuracy will not saturate after a certain number of training images.

We mention a couple reasons for which DVEM might ultimately turn out to out-perform PlaNet. First, the PlaNet authors point out that the discretization hurts accuracy in some cases. DVEM, in contrast, creates candidate locations "on the fly". As such, DVEM has the ability to make a geo-prediction at an arbitrarily small geo-resolution.

Second, the test set used to test DVEM is possibly more challenging than the PlaNet test set because it does not eliminate images without tags. We assume that the presence of a tag is at least a weak indicator of care on the part of the user. A careless user might also engage in careless photography, producing images that are low quality and/or are not framed to clearly depict their subject matter. A test set containing images taken by relatively more careful users could be expected to yield a higher accuracy.

Third, we assume that when near duplicates were eliminated from the PlaNet test/training set, that these were near duplicates from the same location. Eliminating images that are very close visual matches with other locations would, of course, artificially simplify the problem. However, it may also turn out that the elimination artificially makes the problem more difficult. In real life, a lot of people simply do take the same picture, for example, of the leaning tower of Pisa. A priori it is not clear how near duplicates should be eliminated to ensure the testing setup maximally resembles an operational setting.

The PlaNet paper was a pleasure to read, the name "PlaNet" is truly cool, and we are enthused about the small size of the resulting model. We are interested by the fact that "PlaNet" produces a distributional probability over the whole world, although we also remark that, DVEM is capable of producing top-N location predictions. We also liked the idea of exploiting sequence information, but think that considering temporal neighborhoods rather than temporal sequences might also be helpful. Extending DVEM with either temporal sequences or neighborhoods would be straightforward.

We hope that the PlaNet authors will run their approach using the MediaEval 2015 Placing Task data set so that we are able to directly compare the results. In any case, they will want to revisit their assertion that "...previous approaches only recognize landmarks or perform approximate matching using global image descriptors" in the light of the MediaEval 2015 Placing Task results, including our DVEM algorithm.

We would like to point out that work on algorithms able to predict the location of almost any image has been ongoing in full public visibility for a number of years. (Although given our field, we also enjoy the delicious jolt of a headline beginning "Google unveils...") The starting point can be seen as Mapping the World's Photos [9] in 2009. The MediaEval Multimedia Evaluation benchmark has been developing solutions to the problem since 2010, as chronicled in [10]. The most recent contribution was the MediaEval 2015 Placing task [11], cf. the contributions that use visual approaches to the task [12,13]. The MediaEval 2015 data set is part of the larger, publicly available YFCC100M data set, part of Multimedia Commons, and recently featured in Communications of the ACM [14]. MediaEval 2016 will offer a further edition of the Placing Task, which is open to participation for any research team who signs up.

We close by retuning to comment on the importance of  NNN (Non Neural Network) approaches. This example of the strength of DVEM vs. PlaNet provides a demonstration that there is reason for the research community to retain a balance in their engagement in NN and NNN approaches. One appealing aspect  of NNN approaches, and, in particular of search-based geo-location prediction, is the relative transparency of how the data is connected to the prediction. It may sound like science fiction from today's perspective, but one could imagine a future in which the person who took the image would receive a micro fee every time their image was used for the purpose of predicting geo-location metadata for someone else. Such a system would encourage people to take images that were useful for geo-location, and move us forward as a whole.

We would like to thank the organizers of the MediaEval Placing task for making the data set available for our research. Also a big thanks to SURF SARA for the HPC infrastructure without which our work would not be possible.

[1] Xinchao Li, Martha A. Larson, Alan Hanjalic Geo-distinctive Visual Element Matching for Location Estimation of Images (Submitted on 28 Jan 2016) (http://arxiv.org/abs/1601.07884)

[2] Tobias Weyand, Ilya Kostrikov, James Philbin PlaNet—Photo Geolocation with Convolutional Neural Networks (Submitted on 17 Feb 2016) (http://arxiv.org/abs/1602.05314)
[3] Jaeyoung Choi and Gerald Friedland. 2015. Multimodal Location Estimation of Videos and Images. Springer Publishing Company, Springer.
[4] P. Serdyukov, V. Murdock, R. van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '09), ACM, New York, pp. 484–491.
[5Xinchao Li, Martha Larson, and Alan Hanjalic. 2013. Geo-visual ranking for location prediction of social images. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval (ICMR '13). ACM, New York, NY, USA, 81-88. 
[6Xinchao Li, Martha Larson, and Alan Hanjalic. Global-Scale Location Prediction for Social Images Using Geo-Visual Ranking, in IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 674-686, May 2015.
[7] Xinchao Li, Martha Larson, Alan Hanjalic. 2015. Pairwise Geometric Matching for Large-scale Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 5153-5161.
[8] J. Hays and A. A. Efros, "IM2GPS: estimating geographic information from a single image," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, Anchorage, AK, 2008, pp. 1-8.
[9] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. 2009. Mapping the world's photos. In Proceedings of the 18th international conference on World wide web (WWW '09,) ACM, New York, 761-770.
[19Martha Larson, Pascal Kelm, Adam Rae, Claudia Hauff, Bart Thomee, Michele Trevisiol, Jaeyoung Choi, Olivier Van Laere, Steven Schockaert, Gareth J.F. Jones, Pavel Serdyukov, Vanessa Murdock, Gerald Friedlan. 2015. The Benchmark as a Research Catalyst: Charting the Progress of Geo-prediction for Social Multimedia. In [3]. 
[11Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, Bart Thomee. The Placing Task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper6.pdf
[12] Lin Tzy Li, Javier A.V. Muñoz, Jurandy Almeida, Rodrigo T. Calumby, Otávio A. B. Penatti, Ícaro C. Dourado, Keiller Nogueira, Pedro R. Mendes Júnior, Luís A. M. Pereira, Daniel C. G. Pedronette, Jefersson A. dos Santos, Marcos A. Gonçalves, Ricardo da S. Torres. RECOD @ Placing Task of MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper49.pdf
[13] Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, Yiannis Kompatsiaris. CERTH/CEA LIST at MediaEval Placing Task 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online ceur-ws.org/Vol-1436/Paper58.pdf
[14] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li-Jia Li. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, Vol. 59 No. 2, Pages 64-73.

Sunday, February 7, 2016

MediaEval 2015: Insights from last year's experiences in multimedia benchmarking



This blogpost is a list of bullet points concerning MediaEval 2015. It represents the "meta-themes" of MediaEval that I perceived to be the strongest during the MediaEval 2015 season, which culminated with the MediaEval 2015 Workshop in Wurzen, German (14-15 September 2015). I'm putting them here, so we can look back later and see how they are developing.
  1. How not to re-invent the wheel? Providing task participants with reading lists of related work and with baseline implementations helps ensure that it is as easy as possible for them to develop algorithms that extend the state of the art.
  2. Reproducibility and replication: How can we encourage participants to share information about their approaches so that their results can be reproduced or replicated? How can we emphasize the importance of reproduction and replication and at the same time push for innovation, and forward movement in the state of the art (and avoid re-inventing the wheel as just mentioned)? One answer that arose this year was to reinforce student participation. Students should feel welcome at the workshop, even if they “just” reproduced an existing workflow.
  3.  Development of evaluation metrics for new tasks: Innovating a new task may involve a developing a new evaluation metric. All tasks face the challenges of ensuring that they are using an evaluation metric that faithfully reflects usefulness to users within an evaluation scenario.
  4. How to make optimal use of leaderboards in evaluation: Participants should be able to check on their progress over the course of the benchmark, and aspire to ever-greater heights. However, it is important that leaderboards not discourage participants from submitting final runs to the benchmark. It is possible that an innovative new approach does very badly on the leaderboard, but is still valuable.
  5. Understanding the relationship between the conceptual formulation of the task, and the dataset that is chosen for use in the task: Are the two compatible? Are there assumptions that we are making about the dataset that do not hold? How can we keep task participants on track: solving the conceptual formulation from the task, and not leveraging some incidental aspect of the dataset?
  6. Disruption: Tasks are encouraged to innovate from year to year. However, 2015 was the first year that organizers started planning far ahead for “disruption” that would take the task to the next level in the next year.
  7. Using crowdsourcing for evaluation: How to make sure that everyone is aware of and applies best practices? How to ensure that the crowd is reflective of the type of users in the use scenario of the task?
  8. Engineering: Task organization involves an enormous amount of time and dedication to engineering work. We continuously seek ways to structure organizer teams and to recruit new organizers and task auxiliaries to make sure that no one feels that their scientific output suffered in a year where they spend time handling the engineering aspects of MediaEval task organization.
  9. Defining tasks and writing task descriptions: We repeatedly see that the process of defining and new task and of writing task descriptions must involve a large number of people. If people with a lot of multimedia benchmarking experience contribute, they can help to make sure that the task definition is well grounded in the existing literature. If people with very little experience in multimedia benchmarking contribute, they can help to make sure that the task definition is understandable even to new participants. We try to write task descriptions such that a master student planning to write a thesis in a multimedia related topic would easily understand what was required for the task.

In order to round this off to a nice "10" points let me mention another issue that is constantly on my mind, namely, the way that the multimedia community treats the word "subjective".

"Subjective" is something that one feels oneself as a subject (and cannot be directly felt by another person---pain is the classic example). In MediaEval tasks, such as Violent Scene Detection, we would like to respect the fact that people are entitled to their own opinions about what constitutes a concept. Note that people can communicate very well concerning acts of violence, without all having an exactly identical idea of what constitutes "violence". Because the concept "works" in the face of the existence of person perspectives, we can consider the task "subjective". 

So often researchers reason in the sequence, "This task is subjective, therefore it is difficult for automatic multimedia analysis algorithms to address". That reasoning simply does not follow. Consider this example: Classifying a noise source as painful is the ultimate "subjective task". You as a subject are the only one who knows that you are in pain. However: Create a device that signals "pain" when noise levels reach 100 decibels, and you have a solution to the task. Easy as pie. "Subjective" tasks are not inherently difficult. 

Instead: whether a task is difficult to address with automatic methods depends on the stability of content-based features across different target labels. 

The whole point of machine learning is to generalize across not only obvious cases, but also across cases in which no stability of features is apparent to a human observer. If we stuck to tasks that "looked" easy to a researcher browsing through the data, (exaggerating a bit for effect) we might as well handcraft rule-based recognizers. So my point 10 is to try to figure out a way to keep researchers from being scared off from tasks just because they are "subjective", without giving the matter a second thought. Multimedia research needs to tackle "subjective" tasks in order to make sure that it remains relevant to the real-world needs of users---once you understand subjectivity, you start to realize that it is actually all over the place.

In 2014, we noticed that the discussion of such themes was becoming more systematic, and that members of the MediaEval community were interested in having a venue in which they could publish their thoughts. For this reason, in 2015, we added a MediaEval Letters section to the MediaEval Working Notes Proceedings dedicated to short considerations of themes related to the MediaEval workshop. The Letter format allows researchers to publish their thoughts already as they are developing, even before they are mature enough to appear in a mainstream venue.

The concept of MediaEval Letters was described in the following paper, in the 2015 MediaEval Working Notes Proceedings:

Larson, M., Jones, G.J.F., Ionescu, B., Soleymani, M., Gravier, G. Recording and Analyzing Benchmarking Results: The Aims of the MediaEval Working Notes Papers. Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015, CEUR-WS.org, online http://ceur-ws.org/Vol-1436/Paper90.pdf


Look for MediaEval Letters to be continued in 2016.

Monday, December 21, 2015

Features, machine learning, and explanations

Selected Topics in Multimedia Computing is a master-level seminar taught at the TU Delft. This post is my answer to the question from one of this year's students, asked in the context of our discussion of the survey paper that he wrote for the seminar on emotion recognition for music. I wrote an extended answer, since questions related to this one come up often, and my answer might be helpful for other students as well. 

Here is the text of the question from the student's email:


I also have a remark about something you said yesterday during our Skype meeting. You said that the premise of machine learning is that we don't exactly know why the features we use give an accurate result (for some definition of accuracy :)), hence that sentence about features not making a lot of sense could be nuanced, but does that mean it is accepted in the machine learning field that we don't explain why our methods work? I think admitting that some of the methods that have been developed simply cannot be explained, or at least won't be for some time, would undermine the point of the survey. Also, in that case, it would seem we should at least have a good explanation for why we are unable to explain why the unexplained methods work so well (cf. Deutsch, The beginning of infinity, particularly ch. 3).

Your survey tackles the question of missing explanations. In light of your topic, yes, I do agree that it would undermine your argument to identify methods that cannot be explained as valuable in moving the field forward.

My comment deals specifically with this sentence: "For some of the features, it seems hard to imagine how they could have a significant influence on the classification, and the results achieved do not outrank those of other approaches by much."

I'll start off by saying that a great deal more important that this specific sentence is a larger point that you are making in the survey, namely that meaningful research requires paying attention to whether what you think that you are doing and what you are actually doing are all aligned. You point out the importance of the "horse" metaphor of:

Sturm, B.L., "A Simple Method to Determine if a Music Information Retrieval System is a “Horse”," in Multimedia, IEEE Transactions on , vol.16, no.6, pp.1636-1644, Oct. 2014.

I couldn't agree more on that.

But here, let's think about the specific sentence above. My point is that it would help the reader to express what you are thinking here more fully. If you put two thoughts into one sentence like this, the reader will jump to the conclusion that one explains the other. You want to avoid assuming (or implying that you assume) that the disappointing results could have been anticipated by choosing features that a priori could be "imagined" to have significant influence.

(Note that there are interpretations of this sentence — i.e., if you read "and" as a logical conjunction — that do not imply this. As computer scientists, we are used to reading code, and these interpretations are, I have the impression, relatively more natural to us than to other audiences. So it is safer to assume that in general your readers will not spend a lot of time picking the correct interpretation of "and", and need more help from your side as the author :-))

As a recap: I said that Machine Learning wouldn't be so useful if humans could look at a problem, and tell which features should be used in order to yield the best performance.  I don't want to go as far as claiming it is the premise, in fact, I rather hope I didn't actually use the word "premise" at all.

Machine learning starts with a feature engineering step where you apply an algorithm to select features that will be used, e.g., by your classifier. After this step, you can "see" which features were selected. So it's not the case that you have no idea why machine learning works.

My point is that you need to be careful about limiting your input to feature selection a priori. If you assume you yourself can predict which features will work, you will miss something. When you use deep learning, you don't necessarily do feature selection, but you do having the possibility of inspecting the hidden layers of the neural network, and these can shed some light on why it works.

This is not to say that human intelligence should not be leveraged for feature engineering. Practically, you do need to make design decisions to limit the possible number of choices that you are considering. Well-motivated choices will land you with a system that is probably also "better", also along the line of Deutsch's thinking (I say "probably" because I have not read the book the you cite in detail.)

In any case: careful choices of features are necessary to prevent you from developing a classifier that works well on the data set that you are working on because there is an unknown "leak" between a feature and the ground truth, i.e., for some reason one (or more) of the features is correlated with the ground truth. If you have such a leak, you methods will not generalize, i.e., their performance will not transfer to the case of unseen data. The established method for preventing this problem is ensuring that you carry out your feature engineering step on separate data (i.e., separate from both your training and your test sets). A more radical approach that can help when you are operationalizing a system is to discard features that work "suspiciously" well. A good dose of common sense is very helpful, but note that you should not try to replace good methodology and feature engineering with human intelligence (which I mention for completeness, and not because I think you had any intention in this direction).

It is worth pointing out there are plenty of problems out there that can indeed be successfully addressed by a classifier that is based on rules that were defined by humans, "by hand". If you are trying to solve such a problem, your shouldn't opt for an approach that would require more data or computational resources, merely for the sake of using a completely data-driven algorithm. The issue is that it is not necessarily easy to know whether or not you are trying to solve such a problem.

In sum, I am trying to highlight a couple of points that we tend to forget sometimes when we are using machine learning techniques: You should resist the temptation to look at a problem and declare it unsolvable because you can't "see" any features that seem like they would work. A second related temptation that you should resist is using sub-optimal features because you make your own assumptions about what the best features must be a priori.

A few further words on my own perspective:

There are algorithms that are explicitly designed to make predictions and create human-interpretable explanations simultaneously. This is a very important goal for intelligent systems that are being used by users who don't have the technical training to understand what is going on "under the hood."

Personally, I hold the rather radical position that we should aspire to creating algorithms that are effective, but yet so simple that they can be understood by anyone who uses their output. The classic online shopping recommender "People who bought this item also bought ...." is an example that hopefully convinces you such a goal is not completely impossible. A major hindrance is that we may need to sacrifice some of the fun and satisfaction we derive from cool math.

Stepping back yet further:

Underlying the entire problem is the danger is that the data you use to train you learner has properties that you have overlooked or did not anticipate, and that your resulting algorithm gives you, well, let me just come out and call it "B.S." without your realizing it. High profile cases get caught (http://edition.cnn.com/2015/07/02/tech/google-image-recognition-gorillas-tag/). However, these cases should also prompt us to ask the question: Which machine learning flaws go completely unnoticed?

Before you start even thinking about features, you need to explain the problem that you are trying to address, and also explain how and why you chose the data that you will use to address it.

Luckily, it's these sorts of challenges that make our field so fascinating.

Friday, October 30, 2015

Compressing a complicated situation: Making videos that support search

Mohammad Soleymani asked me to make a video for ASM 2015, the Workshop on Affect and Sentiment in Multimedia, held at ACM Multimedia 2015 on 30 October in Brisbane, Australia. He wanted to present during the workshop the view of different disciplines outside of computer science on sentiment. Since my PhD is in linguistics (semantics/syntax interface), I was the natural choice for "The Linguist". The results of my efforts was a video entitled "A Linguist Talks about Sentiment".

In this post, I briefly discuss my thoughts upon making this video. I wanted to give viewers with no linguistics background the confidence they needed in order to attempt to understand semantics, as it is studied by linguists, and leverage these insights in their work. Ultimately, such a "support for search" video has the goal of addressing people starting with the background that they already have, and giving them just enough knowledge in order to support them in searching for more information themselves.


Mohammad gave me four minutes of time for the video, and I pushed it beyond the limit: the final video runs over six minutes. I realized that what I needed to do is not to convey all possible relevant information, but rather show where a complicated situation is hiding behind something that might seem at first glance simple. The effect of the video is to convince the viewer that its worth searching for more information on linguistics, and giving just enough terminology to support that search.

My original motivation to make the video, was a strong position that I hold: At one level, I agree that anyone can study anything that they want. However, without a precisely formulated, well-grounded definition of what we are studying we are in danger of leaving vast areas of the problem unexplored, and prematurely declare the challenge a solved problem.

After making this video, I realized that one can consider "support for search" videos a new genre of videos. These videos allow people get their foot in the door, and provide the basis for search. A good support-for-search video needs to address specific viewer segments "where they are", i.e., given their current background knowledge. It must simplify the material without putting people off on completely the wrong track. Finally, it must admit this simplification, so that viewers realize that there is more to the story.

When I re-watched my video after the workshop, I found a couple places that make me cringe. Would a fellow linguist accept these simplifications, or did they distort too much? I make no mention of the large range of different psychological verbs, or the difference between syntactic and semantic roles. I put a lot of emphasis on the places that I myself have noticed that people fail to understand. On the whole, the goal is that the video allows the viewer to go from a state of being overwhelmed by a complex topic to having those handholds necessary in order to formulate information needs and support search.

Are "support for search" videos a new genre on the rise? If this "genre" is indeed a trend, it is a welcome one.

Recently browsing, I hit on a video entitled "Iraq Explained -- ISIS, Syria and War":


This video assumes of the reader a particular background knowledge (little) and focuses on introducing the major players and a broad sketch of the dynamics. There are points in the video where I find myself saying "Well, OK" (choice of music gives the impression of things happening with a sense of purpose that I do not remember at the time). This video is clearly a "support for search" video since it ends with the following information:

"We did our best to compress a complicated situation in a very short video, so please don’t be mad that we had to simplify things. There is sooo much stuff that we couldn’t talk about in the detail it deserved…But if you didn’t know much about the situation this video hopefully helps making researching easier."

On the whole, the video succeeds in the goal of giving viewers the information that they need to start sorting out a complicated situation. It gives them the picture on what they don't yet understand, in order that they can the start looking for more information.