CFP

META-EVAL-2020: Evaluating Evaluation of AI Systems (AAAI 2020 Workshop)

Hilton New York Midtown

New York, NY, United States, February 7, 2020

Conference website	http://eval.how/aaai-2020
Submission link	https://easychair.org/conferences/?conf=reais19
Notification date:	September 22, 2019
Submission deadline	November 27, 2019

Topics: evaluation metrics ai crowdsourcing

This workshop is part of the HCOMP 2019 conference.

The last decade has seen massive progress in AI research powered by crowdsourced datasets and benchmarks such as Imagenet, Freebase, SQuAD; as well as widespread adoption and increasing use of AI in deployed systems. A crucial ingredient is the role of crowdsourcing in operationalizing empirical ways for evaluating, comparing, and assessing the progress.

Using crowdsourced datasets for evaluating AI systems’ success at tasks such as image labeling and question answering have proven powerful enablers for research. However, adoption of such datasets is typically driven by the mere existence and size of a dataset without proper scrutiny of its scope, quality, and limitations. While crowdsourcing has enabled a burst of published work on specific problems, determining if that work has resulted in real progress cannot continue without a deeper understanding of how the dataset supports the scientific or performance claims of the AI systems it is evaluating. This workshop will provide a forum for growing our collective understanding of what makes a dataset good, the value of improved datasets and collection methods, and how to inform the decisions of when to invest in more robust data acquisition.

We invite scientific contributions and positions papers on the following topics:

META-EVALUATION: Quality of evaluation approaches, datasets / benchmarks
- Characteristics of ‘good’ dataset / benchmark?
- Shortcomings of existing evaluation approaches, datasets / benchmarks?
- Building new / improving existing metrics
- Measuring trustworthiness, interpretability and fairness of crowdsourced benchmarks datasets
- Measuring added value of improvements to previous versions of benchmark datasets
- Comparative evaluations between mainstream AI systems, e.g. recommenders, voice assistants, etc.
- Measuring quality of guidelines for content moderation, search evaluation, etc.
- Comparison of results between offline (e.g. crowdsourced) and online (e.g. A/B testing) evaluations?
- Open questions and challenges in meta-evaluation?
TRANSPARENCY: Making quality and characteristics of (crowdsourced) benchmark datasets transparent and explainable
- Reproducibility of crowdsourced datasets
- Replicability of crowdsourced evaluations of AI systems
- Explainability of crowdsourced evaluations to different stakeholders, e.g. users, scientists, developers
RESOURCE BUILDING: Making existing evaluation methodologies, raw data and outcomes, discoverable, fully documented and available for public scrutiny
- How do we make evaluations and related datasets archival and discoverable?
- What can we learn from other systematic evaluation efforts and communities such as TREC, SIGIR, etc.?

Submission Guidelines

KEY DATES

September 8, 2019: Full papers due
September 22, 2019: Notification of acceptance
October 1, 2019: Final camera-ready papers due

Workshop Authors are invited to submit papers of up to 6 pages, plus any number of additional pages containing references only.

All submitted papers must represent original work, not previously published or under simultaneous peer-review for any other peer-reviewed, archival conference or journal.

Papers must be formatted in AAAI two-column, camera-ready style; please refer to the style guide here:
(https://www.aaai.org/Publications/Templates/Original_AAAI_Style.zip) for details. Papers must be in trouble-free, high-resolution PDF format, formatted for US Letter (8.5″ x 11″) paper, using Type 1 or TrueType fonts. The AAAI copyright block is not required on submissions, but must be included on final accepted versions.

Electronic paper submission through the HCOMP-19 EasyChair paper submission site required on or before the deadlines listed above. We cannot accept submissions by e-mail or fax. Authors will receive confirmation of receipt of their abstracts or papers, including an ID number, shortly after submission. HCOMP will contact authors again only if problems are encountered with papers.

At least one author of each accepted paper must register for the conference to present the work or acceptance will be withdrawn.

List of Workshop Activities

The focus of this workshop is not on evaluating AI systems, but on evaluating the quality of evaluations of AI systems. When these evaluations rely on crowdsourced datasets or methodologies, we are interested in the meta-questions around characterization of those methodologies. Some of the expected activities in the workshop include:

Asking the question of "what makes evaluations good'?
Defining "what good looks like" in evaluations of different types of AI systems (image recognition, recommender systems, search, voice assistants, etc).
Collecting, examining and sharing current evaluation efforts, comprehensive of one system or competitive of multiple systems with the goal of critically evaluating the evaluations themselves
Developing an open repository of existing evaluations with methodology fully documented and raw data and outcomes available for public scrutiny

Committees

Program Committee

Matt Lease, UT Austin
Paul Tepper, Nuance
Sid Suri, Microsoft
Danna Gurari, UT Austin
Anbang Xu, IBM
Chris Welty, Google
Lora Aroyo, Google
Omar Alonso, Microsoft
Walter Lasecki, Michigan
Sarah Luger, Orange
Alex Quinn, Purdue
Brad Klingenberg, StitchFix
Ka Wong, Google
Panos Ipeirotis, NYU

Organizing committee

Praveen Paritosh
Kurt Bollacker

Venue

The Workshop will be held at The seventh AAAI Conference on Human Computation and Crowdsourcing

Contact

All questions about submissions should be emailed to rigorous-evaluation@googlegroups.com