CFP

REAIS2020: HCOMP 2020 Workshop on Rigorous Evaluation for AI Systems

Human Computation 2020 (virtual)

October 25, 2020

Conference website	http://eval.how/reais-2020
Submission link	https://easychair.org/conferences/?conf=reais2020
Abstract registration deadline	October 14, 2020
Submission deadline	October 14, 2020

Topics: evaluation machine learning crowdsourcing metascience

NOTE: The date for the workshop has changed to Sunday, October 25, 2020 to allow joint participation in the Data Excellence Workshop (DEW), which has a separate call for papers, the following day.

Call For Papers

This workshop is part of the HCOMP 2020 conference. We intend to build a community interested in using rigorous evaluation to tackle real-world problems that may be otherwise inaccessible.

The last decade has seen massive progress in AI research powered by crowdsourced datasets and benchmarks such as Imagenet, Freebase, SQuAD; as well as widespread adoption and increasing use of AI in deployed systems. A crucial ingredient is the role of crowdsourcing in operationalizing empirical ways for evaluating, comparing, and assessing the progress.

Using crowdsourced techniques for evaluating AI systems’ success at tasks such as image labeling and question answering have proven powerful enablers for research. However, adoption of such approaches is typically driven by the mere existence and size of crowdsourced contributions without proper scrutiny of their scope, quality, and limitations. While crowdsourcing has enabled a burst of published work on specific problems, determining if that work has resulted in real progress cannot continue without a deeper understanding of how dataset and benchmarks support the scientific or performance claims of the AI systems it is evaluating. This workshop will provide a forum for growing our collective understanding of what makes an evaluation good, the value of improved datasets and collection methods, and how to inform the decisions of when to invest in more robust data acquisition.

This Year’s Focus: Third Party And Independent Meta-evaluation

Often AI systems and datasets are evaluated by measures that only have mathematical and theoretical components, while the real, physical world is messy, irregular, and subject to constant change. Some systems and approaches that perform well under the scrutiny of clean, elegant theory may still fail quite spectacularly in real-world applications simply because the theory did not match reality. Traditionally, approaches such as crowdsourcing, "human in the loop" decision systems, HCI mechanisms, and other human-centered sources of ground truth are powerful complements used to fill in that "real-world" gap. However, because they are often part of the same construction by the same creators, this human component of measure and evaluation has the correlated biases and omissions as the rest of the system. To overcome these biases and omissions, an independent layer of 3rd party human-centered evaluation, or "meta-evaluation" may be needed. If this separate, external scrutiny is crafted from the perspective of actual users and consumers of such datasets and systems rather than the expectations of the system/dataset creators, it might be used to more accurately measure real-world performance and value.

This year, we will will have a focus on how we can use human computation to craft external, independent evaluation of AI datasets and systems, especially focussed on application to:

Building the appropriate level of trust and confidence, especially for systems that have direct safety or economic impact on people
Providing guidance for choosing between or making resource commitments to real world instances of systems
Detecting fraudulent claims about systems and manipulation of data
Compensating for gaming, deception, and other malicious use of AI systems
Constructing adversarial scrutiny of datasets and systems

In the context of these applications, we invite scientific contributions and positions papers on the following topics:

META-EVALUATION: Quality of evaluation approaches, datasets / benchmarks
- Characteristics of ‘good’ dataset / benchmark?
- Shortcomings of existing evaluation approaches, datasets / benchmarks?
- Building new / improving existing metrics
- Measuring trustworthiness, interpretability and fairness of crowdsourced benchmarks
- Measuring added value of improvements to previous versions of benchmark datasets
- Comparison of results between offline (e.g. crowdsourced) and online (e.g. A/B testing) evaluations?

AVAILABILITY: Making quality and characteristics of (crowdsourced) benchmarks, datasets, and systems explainable, discoverable, fully documented and available for public scrutiny
- Replicability of crowdsourced evaluations of AI systems
- Explainability of crowdsourced evaluations to different stakeholders, e.g. users, scientists, developers
- How do we make evaluations and related datasets archival and discoverable?

Submission Guidelines

Key Dates

October 14, 2020: Full papers due

October 18, 2020: Notification of acceptance

October 20, 2020: Final camera-ready papers due

Workshop Authors are invited to submit extended abstracts (2 - 4 pages) or short papers (4 - 6 pages), plus any number of additional pages containing references only.

All submitted papers must represent original work, not previously published or under simultaneous peer-review for any other peer-reviewed, archival conference or journal.

Papers must be formatted in AAAI two-column, camera-ready style; please refer to the style guide here:

(https://www.aaai.org/Publications/Templates/Original_AAAI_Style.zip) for details. Papers must be in trouble-free, high-resolution PDF format, formatted for US Letter (8.5″ x 11″) paper, using Type 1 or TrueType fonts. The AAAI copyright block is not required on submissions, but must be included on final accepted versions.

Electronic paper submission through the HCOMP-20 EasyChair paper submission site required on or before the deadlines listed above. We cannot accept submissions by e-mail or fax. Authors will receive confirmation of receipt of their abstracts or papers, including an ID number, shortly after submission. HCOMP will contact authors again only if problems are encountered with papers.

At least one author of each accepted paper must register for the conference to present the work or acceptance will be withdrawn.

List of Workshop Activities

The focus of this workshop is not on evaluating AI systems, but on evaluating the quality of evaluations of AI systems. When these evaluations rely on crowdsourced datasets or methodologies, we are interested in the meta-questions around characterization of those methodologies. Some of the expected activities in the workshop include:

Asking the question of "what makes evaluations good'?
Defining "what good looks like" in evaluations of different types of AI systems (image recognition, recommender systems, search, voice assistants, etc).
Collecting, examining and sharing current evaluation efforts, comprehensive of one system or competitive of multiple systems with the goal of critically evaluating the evaluations themselves
Developing an open repository of existing evaluations with methodology fully documented and raw data and outcomes available for public scrutiny

Organizing committee

Bernease Herman
Sarah Luger
Kurt Bollacker
Maria Stone

Venue

The Workshop will be held at The eighth AAAI Conference on Human Computation and Crowdsourcing on Sunday, October 25, 2020 at 2:00 - 6:30 PM (CET - UTC+1) / 9:00am - 1:30pm (EDT - UTC-4).

Contact

All questions about submissions should be emailed to rigorous-evaluation@googlegroups.com.