Sunday, February 26, 2017

Shared-tasks for multimedia research: Bans, benchmarks, and being effective in 2017

Last week, I officially resigned from contributing as an organizer to TRECVid Video Retrieval Evaluation, which is sponsored by NIST, a US government agency in Gaithersburg, Maryland. In 2016, I was part of the Video Hyperlinking task, and contributed by defining the year's relevance criteria, creating queries, and helping to design the crowdsourcing-based assessment. It has been a very difficult decision, so I would like to record here in this blogpost why I have made it. 

Ultimately, we make such decisions ourselves, and everyone navigates these difficult processes alone. However, it takes a lot of time and energy to search for the relevant information, and to weigh the considerations. For this reason, I think that for some it may be helpful to know more details about my own process.

Benchmarking Multimedia Technologies
Since 2008, I have been involved in benchmarking new multimedia technology. Benchmarking is the process of systematically comparing technologies by assessing their performance on standardized tasks. The process makes it possible to quantify the degree to which one algorithm outperforms another. Quantification is necessary in order to understand if a new algorithm has succeeded in improving over the state of the art, defined by the performance of existing algorithms.

The strength of benchmarking lies in the degree to which a benchmark succeeds in achieving open participation. If a new algorithm is compared to some, but not all, existing algorithms, the results of the benchmark reflect less clearly a true improvement over the state of the art.

My emphasis in benchmarking is on tasks that focus on the human and social aspects of multimedia access and retrieval. In other words, I am interested in people producing and consuming video, images, and audio content in their daily lives, and how technology can create algorithms to give them back usefulness and value from these activities. It is difficult to pack these aspects into quantitative metrics, so I am also committed to research that develops new evaluation methodologies and new metrics, as well.

Due to this emphasis, it is not surprising that most of my contribution has been channeled through the MediaEval Benchmark for Multimedia Evaluation. (I coordinate the MediaEval "umbrella", which synchronizes the otherwise autonomous tasks.) However, the strength of the benchmarking paradigm is weakened if a single benchmark, with a limited spectrum of topics, becoming all-dominant. Instead, we need to act to prevent a single effort from "taking over the market". We need to work towards ensuring that a broad range of different types of problems are investigated by the research community. Fostering breadth means offering not only multiple tasks, but multiple benchmarks. This year, I am again involved in MediaEval, but also, as last year, in contributing to the organization of  the NewsREEL task at CLEF (where my role is to contribute to design, documentation, and reporting).

Open Participation in Benchmarks
Both MediaEval and CLEF are open participation benchmarks in three aspects:
  • First, anyone can propose a task (there is an open call for tasks). CLEF chooses its tasks by multi-institutional committee, cf. 2017 CLEF Call for Task Proposals. MediaEval also chooses its task by multi-institutional committee. However, the committee checks only for viability. The ultimate choice lies in the hands of all community members, including organizers and participants, cf. MediaEval 2017 Call for Task Proposals. The goal of an open call for tasks is to promote innovation---constantly evolving tasks prevent the community from "locking in" on certain topics, and becoming satisfied with incremental progress.
  • Second, anyone can sign up to participate. Participants submit working notes papers, which go through a review process (emphasizing completeness, clarity, and technical soundness). MediaEval and CLEF both publish open access working notes proceedings.
  • Third, for both MediaEval and CLEF, workshop registration is open to anyone, and requires only the payment of a fee to cover costs. For MediaEval, the fee covers the costs of the workshop, and also of hosting the website and organizer teleconferences. People/organizations contribute time to cover the rest of workshop organization.
Like MediaEval and CLEF,  TRECVid also pursues the mission of offering an open research venue. Historically, both TRECVid and CLEF grew from TREC (also, of course, organized by NIST) so the commitment to the common cause is unsurprising in this sense. However, TRECVid does not offer open participation in all three of the above aspects. Specifically, there is no publicly circulated call for task proposals, and the workshop is closed. (The stated policy is that the workshop is only open to task participants, and "to selected government personnel from sponsoring agencies and data donors", cf. TRECVid 2017 Call for Participation) Technically, TRECVid is not able to welcome all participants. The US does not maintain diplomatic relationships with Iran. US Government employees cannot answer email from Iran. It is important to understand that this is a historical challenge, and is not new with the current US Republican Administration.

Defining Priorities and Making Decisions
Considerations related to open participation made me hesitant to get deeply involved in TRECVid. However, over the years, I have been very open for exchange. TRECVid originally reached out to me to give an invited talk back in 2009, when MediaEval was still VideoCLEF. (There are some musings on my blog from that trip.)  The idea was to learn from each other. We hope this year to reciprocate with a TRECVid speaker at CLEF/MediaEval.

In 2016, I contributed to the Video Hyperlinking organization, since the move of Video Hyperlinking from MediaEval to TRECVid represented a spread of the emphasis on the human aspects of multimedia retrieval, and it was important to me to support that explicitly.

All and all, it has taken a lot of time to decide where to invest my resources in 2017 in order to most effectively support multimedia benchmarking efforts that provide venues that are open and therefore effective as benchmarks.

With the new Republican Administration in the US, two considerations grew to dominate my decision making process. The first is how to contribute to the movement whose goal is to demonstrate the relevance and importance of science to the public and to policy makers https://www.forceforscience.org TRECVid, by virtue of being a benchmark, is certainly on the forefront of this movement (just by doing the same thing it has done for years). We need to support our US-based colleagues in the efforts to be a force for science, and hope that they support us as well, if we land in a similar situation.

The second is how to react to the travel ban, which would prevent scientists of certain countries from entering the US. The first-order effects of the travel have been constrained by court rulings. However, the future plans of the administration are uncertain, and there is a range of second-order effects that a court cannot un-do, e.g., people self-selecting out of participation since they are worried about their visa's being held up by additional processing steps (and granted, for example, only after the workshop has occurred). These secondary effects effectively prevent people from attending a US-based event even though technically they may be able to get a visa.

We are not alone in our thinking, but we are guided by a large number of organizations who have issued a public statement on the importance of openness for science (Statement of the International Council for Science, Statement of American Association for the Advancement of Science) including professional organizations that we belong to (Statement of the ACM, Statement of ACM SIGMM, Statement of IEEE) and European universities (Statement of the European University Association, Combined statement of all the universities in the Netherlands, Statement of Radboud University).

There is much power in making an open statement of values---more than one might think. However, we should avoid assuming that statements are enough and that the situation will go back to where it was before the current Republican Administration. In other words, the days are gone in which we had to dedicate relatively less time in protecting and upholding the values of openness in science. Instead, we need to think explicitly about where our effort can be best dedicated in 2017.

TREC/TRECVid celebrated their 25th anniversary in 2016. The event has been a constant through many changes of US administration, and it is heartening that the 2017 event will look, from the inside at least, with all probability pretty much like all other events over the past 25 years.

However, 2017 is the first year where people will be in the streets, in the US and around the world, marching for science:  https://www.marchforscience.com. The large-scale sense of urgency tells us that 2017 is not just business as usual. For this reason, it is important in 2017 to reexamine the idea that the US should be such a strong attractor within the map of scientific research in the world.

On top of the merit and can-do attitude that attracts people from around the world to US institutions, we as scientists (because we study systems and networks) know that another force is at play. Specifically, we know that US institutions enjoy preferential attachment, meaning that past success is a determiner of future success. This effect translates into the reality that new or small events (e.g., research topics or benchmarking workshops) need a lot of extra time and attention to establish or maintain themselves in the field. 2017 is the year that we need to think carefully about to which extent we want to contribute this non-linear feedback loop that strengthens the pull towards US-based events, and to which extent we want to build counterweights.

I consciously use the word "counterweights" since I am referring to a balancing act. We stand in complete solidarity with our US-based colleagues. Providing counterweights in no way detracts from that fact. For multimedia research, counterweights include region-based initiatives, and benchmarks that allow anyone to propose a task. A network of diverse benchmarks makes benchmarking as a whole stronger, and makes us internationally more robust,

My personal decision is that time spent promoting and preserving diversity is, in 2017, a more effective way to achieve the larger goals of benchmarking, than time spent reinforcing the connection between benchmarking and Gaithersburg, Maryland. I was born in Maryland, outside of DC, but Maryland is not where I am needed now. TRECVid will be fine without extra help from Europe, but what can (and does) suffer is the availability to the research community of non-US-based benchmarks.

Recommendations to TRECVid
The intention is for my resignation to be a positive decision for and not a negative decision against. Reasoning that my reflections on the topics are probably helpful to NIST, I distilled my thinking into a set of three recommendations. Interestingly, these recommendations are relatively independent of the situation in the US caused by the current Republican Administration:
  • First, TRECVid is an open research venue. I recommend stating this explicitly on the website. An example is the ACM Open Participation statement. 
  • Second, TRECVid is supported by NIST. I recommended a clearer statement of the source and the distribution of the funding on the website. People familiar with the benchmark know that NIST is the powerhouse behind its success, but it is not clear to newcomers. Critically, currently, the cases in which defense funding supports TRECVid are not clear. This is important to people who personally, or whose institutions, have a commitment to pursue research for civil purposes only. For example, many German institutions have a Zivilklausel by which they commit themselves to pursuing exclusively research for civilian purposes. Even if participation is nominally open, unclarity on defense funding can scare people away, and the benchmark is effectively not as open as it would otherwise aspire to be. (For completeness: at least one colleague assumed I received NIST funding for my work on Video Hyperlinking. I did not. The unclarity in the funding causes confusion.) 
  • Third, attention should be devoted to the archival status of the proceedings. As a good next step, they should be indexed by mainstream search engines. Moving forward, attention should be paid to maintaining a historical record of TRECVid should at some point in the future NIST not be able to continue to support open participation/open access in the way it does now.
If you have read all the way to the end of this blog post, let me finish by thanking you: both  for your dedication to open participation in scientific research, which is so essential to benchmarking, but also for taking the time to read about my personal struggle. It has been a long path.

Don't miss the March for Science on 22 April. Inspire and be inspired.

Or find another march around the world here: https://www.marchforscience.com/satellite-marches

Wednesday, February 8, 2017

Bytes and pixels meet the challenges of human media interpretation

Back in June, I gave a talk at the  Communication Science Department here at Radboud University Nijmegen. Today, I presented a version of that talk to my colleagues in the Language and Speech Technology Research Meeting. The abstract is below together with the slides, which are on SlideShare. During the discussion it became clear that many problems in natural language processing and information retrieval face the issue of human interpretations. It is important to find ways to move forward, although it may not be possible to pack our challenges into neat classification or ranking problems with a single set of consensus ground truth labels. A way forward, is to look to other disciplines for theory of how people understand and use media, and let these inform what we design our systems to do and the ways that we measure success.

Within computer science, "Multimedia" is a field of research that investigates how computers can support people in communication, information finding, and knowledge/opinion building. Multimedia content is defined broadly. It includes not only video, but also images accompanied by text and other information (for example, a geo-location). It can be professionally produced, or generated by users for online sharing. Computer scientists historically have a “love-hate” relationship with multimedia. They “love” it because of the richness of the data sources and the wealth of available data, which leads to interesting problems to tackle with machine learning. They “hate” it because multimedia is a diffuse and moving target: the interpretation of multimedia differs from person to person, and changes over time in the course of its use as a communication medium. This talk gives a view onto ongoing research in the area of multimedia information retrieval algorithms, which help people find multimedia. We look at a series of topics that reveal how pattern recognition, text processing, and crowdsourcing tools are used in multimedia research, and discuss both their limitations and their potential.