META-EVAL-2020: Evaluating Evaluation of AI Systems (AAAI 2020 Workshop) Hilton New York Midtown New York, NY, United States, February 7, 2020 |
Conference website | http://eval.how/aaai-2020 |
Submission link | https://easychair.org/conferences/?conf=reais19 |
Notification date: | September 22, 2019 |
Submission deadline | November 27, 2019 |
This workshop is part of the HCOMP 2019 conference.
The last decade has seen massive progress in AI research powered by crowdsourced datasets and benchmarks such as Imagenet, Freebase, SQuAD; as well as widespread adoption and increasing use of AI in deployed systems. A crucial ingredient is the role of crowdsourcing in operationalizing empirical ways for evaluating, comparing, and assessing the progress.
Using crowdsourced datasets for evaluating AI systems’ success at tasks such as image labeling and question answering have proven powerful enablers for research. However, adoption of such datasets is typically driven by the mere existence and size of a dataset without proper scrutiny of its scope, quality, and limitations. While crowdsourcing has enabled a burst of published work on specific problems, determining if that work has resulted in real progress cannot continue without a deeper understanding of how the dataset supports the scientific or performance claims of the AI systems it is evaluating. This workshop will provide a forum for growing our collective understanding of what makes a dataset good, the value of improved datasets and collection methods, and how to inform the decisions of when to invest in more robust data acquisition.
We invite scientific contributions and positions papers on the following topics:
- META-EVALUATION: Quality of evaluation approaches, datasets / benchmarks
- Characteristics of ‘good’ dataset / benchmark?
- Shortcomings of existing evaluation approaches, datasets / benchmarks?
- Building new / improving existing metrics
- Measuring trustworthiness, interpretability and fairness of crowdsourced benchmarks datasets
- Measuring added value of improvements to previous versions of benchmark datasets
- Comparative evaluations between mainstream AI systems, e.g. recommenders, voice assistants, etc.
- Measuring quality of guidelines for content moderation, search evaluation, etc.
- Comparison of results between offline (e.g. crowdsourced) and online (e.g. A/B testing) evaluations?
- Open questions and challenges in meta-evaluation?
- TRANSPARENCY: Making quality and characteristics of (crowdsourced) benchmark datasets transparent and explainable
- Reproducibility of crowdsourced datasets
- Replicability of crowdsourced evaluations of AI systems
- Explainability of crowdsourced evaluations to different stakeholders, e.g. users, scientists, developers
- RESOURCE BUILDING: Making existing evaluation methodologies, raw data and outcomes, discoverable, fully documented and available for public scrutiny
- How do we make evaluations and related datasets archival and discoverable?
- What can we learn from other systematic evaluation efforts and communities such as TREC, SIGIR, etc.?
Submission Guidelines
KEY DATES
- September 8, 2019: Full papers due
- September 22, 2019: Notification of acceptance
- October 1, 2019: Final camera-ready papers due
Workshop Authors are invited to submit papers of up to 6 pages, plus any number of additional pages containing references only.
All submitted papers must represent original work, not previously published or under simultaneous peer-review for any other peer-reviewed, archival conference or journal.
Papers must be formatted in AAAI two-column, camera-ready style; please refer to the style guide here:
(https://www.aaai.org/Publications/Templates/Original_AAAI_Style.zip) for details. Papers must be in trouble-free, high-resolution PDF format, formatted for US Letter (8.5″ x 11″) paper, using Type 1 or TrueType fonts. The AAAI copyright block is not required on submissions, but must be included on final accepted versions.
Electronic paper submission through the HCOMP-19 EasyChair paper submission site required on or before the deadlines listed above. We cannot accept submissions by e-mail or fax. Authors will receive confirmation of receipt of their abstracts or papers, including an ID number, shortly after submission. HCOMP will contact authors again only if problems are encountered with papers.
At least one author of each accepted paper must register for the conference to present the work or acceptance will be withdrawn.
List of Workshop Activities
The focus of this workshop is not on evaluating AI systems, but on evaluating the quality of evaluations of AI systems. When these evaluations rely on crowdsourced datasets or methodologies, we are interested in the meta-questions around characterization of those methodologies. Some of the expected activities in the workshop include:
- Asking the question of "what makes evaluations good'?
- Defining "what good looks like" in evaluations of different types of AI systems (image recognition, recommender systems, search, voice assistants, etc).
- Collecting, examining and sharing current evaluation efforts, comprehensive of one system or competitive of multiple systems with the goal of critically evaluating the evaluations themselves
- Developing an open repository of existing evaluations with methodology fully documented and raw data and outcomes available for public scrutiny
Committees
Program Committee
- Matt Lease, UT Austin
- Paul Tepper, Nuance
- Sid Suri, Microsoft
- Danna Gurari, UT Austin
- Anbang Xu, IBM
- Chris Welty, Google
- Lora Aroyo, Google
- Omar Alonso, Microsoft
- Walter Lasecki, Michigan
- Sarah Luger, Orange
- Alex Quinn, Purdue
- Brad Klingenberg, StitchFix
- Ka Wong, Google
- Panos Ipeirotis, NYU
Organizing committee
- Praveen Paritosh
- Kurt Bollacker
Venue
The Workshop will be held at The seventh AAAI Conference on Human Computation and Crowdsourcing
Contact
All questions about submissions should be emailed to rigorous-evaluation@googlegroups.com