Academia.eduAcademia.edu

Automatic Evaluation

description1,245 papers
group47 followers
lightbulbAbout this topic
Automatic evaluation refers to the use of algorithms and computational methods to assess the quality or performance of systems, models, or outputs, particularly in fields such as natural language processing, machine learning, and educational assessment, enabling objective and efficient measurement without human intervention.
lightbulbAbout this topic
Automatic evaluation refers to the use of algorithms and computational methods to assess the quality or performance of systems, models, or outputs, particularly in fields such as natural language processing, machine learning, and educational assessment, enabling objective and efficient measurement without human intervention.

Key research themes

1. How can automated methods accurately and robustly evaluate subjective and open-ended written responses?

This theme investigates computational approaches to automatically assess the quality of subjective textual answers, such as essays or short answers, focusing on techniques that handle the complexity and variability of natural language in educational contexts. It addresses challenges in modeling semantic similarity, handling rater bias, multilingual support, and feedback provision to enhance evaluation accuracy and instructional utility.

Key finding: This work established that early automated essay scoring systems, such as Project Essay Grader (PEG), successfully correlated surface textual features (e.g., average word length, essay length) with human scores (up to R=.78),... Read more
Key finding: This study revealed that human rater bias, manifested in subjective comments, systematically affects automated essay scoring (AES) models trained to mimic such scores. Using lexicon-based analyses and subjectivity measures on... Read more
Key finding: GradeAid integrates lexical and semantic features analyzed via state-of-the-art regression models to automatically score short student answers across multiple languages and heterogeneous datasets. Its robust validation,... Read more
Key finding: This paper proposed an unsupervised two-stage system combining text summarization to extract key information and advanced neural language models (BERT, XLNET) fine-tuned on challenging datasets to evaluate subjective answers.... Read more
Key finding: Pioneering a bilingual AES system for Spanish and Basque, this work developed NLP pipelines integrating spell checking, lexical variability, and discourse features leveraging language-specific resources. A client-server... Read more

2. How can implicit user interaction data and dialogue act modeling enable automatic evaluation of intelligent assistants across diverse tasks?

This research area focuses on developing scalable, consistent automatic evaluation frameworks for voice-activated intelligent assistants that perform multiple, heterogeneous tasks (e.g., voice commands, web search, chat). It leverages implicit user feedback derived from user-system interaction logging, and models dialog actions in a task-independent manner to predict user satisfaction and key system components' performance, enabling cost-effective and continuous quality assessment without human annotations.

Key finding: This paper introduced a novel evaluation model that predicts user satisfaction with intelligent assistants by classifying user-system interactions into task-independent dialog actions using a Markov model over action... Read more

3. What are the critical considerations for fairness, transparency, and interpretability in automatic evaluation metrics across AI systems?

This theme explores challenges in the representativeness, bias, and interpretability of automatic evaluation metrics in AI, including fairness concerns in scoring and evaluation transparency. Research addresses how aggregate metrics may mask critical performance disparities, the impact of biased training data on evaluation fairness, and proposes methodological innovations for transparent, interpretable reporting and fair scoring frameworks that consider social and ethical dimensions.

Key finding: This tutorial synthesized challenges and frameworks for responsible scoring of data items, highlighting that fairness is multi-faceted and context-dependent with competing definitions. It emphasized that biased social data... Read more
Key finding: The authors argued that prevailing AI evaluation practices relying on aggregate metrics impede nuanced understanding of system capabilities and failure modes. They demonstrated that lack of availability of instance-level... Read more
Key finding: This paper designed a rubric-based human evaluation protocol for image captioning that separately quantifies precision, recall, fluency, conciseness, and inclusiveness, revealing critical gaps in standard automatic metrics... Read more
Key finding: While focused on pedagogy, this study underscored the importance of timely, automated feedback mechanisms to motivate learners and improve outcomes in programming education. It demonstrated that computer-supported tools... Read more

All papers in Automatic Evaluation

This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is nodebased, i.e. extracts MWEs that contain the item specified by the user, using a fixed window-size around the... more
One of the main reasons that justify the student's failure in (introductory) programming courses is the lack of motivation that impacts on the knowledge acquisition process, affecting learning results. As soon as students face the... more
) et de 12 textes calibrés de valence émotionnelle positive (joie et bonne surprise) et négative (peur, colère, dégoût, tristesse et mauvaise surprise). Les deux types de tests effectués confirment la pertinence psychologique d'EMOVAL.... more
This paper studies how granularity of machine translation evaluation can be extended from sentence to document level. While most state-of-the-art evaluation metrics focus on the sentence level, we emphasize the importance of document... more
Having students express their understanding of difficult, new material in their own words is an effective method to deepen their comprehension and learning. Summary Street® is a computer tutor that offers a supportive context for students... more
Question Answering (QA) is a specialized area in the field of Information Retrieval (IR). The QA systems are concerned with providing relevant answers in response to questions proposed in natural language. QA is therefore composed of... more
Building Accessible Custom UI Controls: A Comprehensive Guide for 508 Compliance addresses the critical need for developing inclusive web applications through customized user interface components. This technical article explores the... more
In this paper, we describe a News Story Gisting system that generates a 10-word short summary of a news story. This system uses a machine learning technique to combine linguistic, statistical and positional information in order to... more
In this paper, we present the HybridTrim system which uses a machine learning technique to combine linguistic, statistical and positional information to identify topic labels for headlines in a text. We compare our system with the Topiary... more
The relationship between QT interval dispersion and dipyridamole-induced, transient myocardial ischemia was assessed in 32 male patients with ischemic heart disease. A standardized, high dose dipyridamole-ECG stress test was used as... more
Human assessment is often considered the gold standard in evaluation of translation systems. But in order for the evaluation to be meaningful, the rankings obtained from human assessment must be consistent and repeatable. Recent analysis... more
Graph and tree transducers have been applied in many NLP areas-among them, machine translation, summarization, parsing, and text generation. In particular, the successful use of tree rewriting transducers for the introduction of syntactic... more
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance testdriven... more
शाबर मंत्र भारतीय धार्मिक एवं आध्यात्मिक परंपराओं में एक विशिष्ट स्थान रखते हैं जो लोक प्रथाओं तथा तांत्रिक विश्वासों में गहराई से निहित हैं। संस्कृत के शास्त्रीय मंत्रों के स्वरूप के विपरीत ये मंत्र क्षेत्रीय भाषाओं में रचित होते हैं... more
Abstract: This work focuses on how we can improve automatic evaluation based on guidelines inspection throughout the life cycle of Web applications by mapping guideline concepts to different artifacts produced during the development... more
We present the application of Principal Component Analysis for data acquired during the design of a natural gesture interface. We investigate the concept of an eigengesture for motion capture hand gesture data and present the... more
World Wide Web content continuously grows in size and importance. Furthermore, users ask Web search engines to satisfy increasingly disparate information needs. New techniques and tools are constantly developed aimed at assisting users in... more
Nowadays, the evaluation of the accessibility of Enterprise Web Information Systems is based on the Web Content Accessibility Guidelines (WCAG 2.0), created by the World Wide Web Consortium (W3C) in 2008 and adopted in 2012 by the... more
Life cycle monitoring of civil infrastructure such as bridges and buildings is critical to the long-term operational cost and safety of aging structures. The widespread use of Structural Health Monitoring (SHM) systems is limited due to... more
Abstract: This paper uses a neural theory of emotional consciousness to develop a novel account of conscience and moral intuition. Emotions are both cognitive appraisals and somatic perceptions, performed simultaneously by interacting... more
Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed... more
This study delves into the relatively unexplored domain of natural language processing for the Kazakh languagea language with limited computational resources. The paper dissects the effectiveness of diffusion models and transformers in... more
Paraphrasing is an important aspect of language competence; however, EFL learners have long had difficulty paraphrasing in their writing owing to their limited language proficiency. Therefore, automatic paraphrase suggestion systems can... more
This paper studies how granularity of machine translation evaluation can be extended from sentence to document level. While most state-of-the-art evaluation metrics focus on the sentence level, we emphasize the importance of document... more
For a given piece of music, there often exist multiple versions belonging to the symbolic (e.g., MIDI representations), acoustic (audio recordings), or visual (sheet music) domain. Each type of information allows for applying specialized,... more
Over the past few years, tissue microarray (TMA) technology has been established as a standard method for assessing the expression of proteins or genes across large sets of tissue specimens. It is being adopted increasingly among leading... more
This work presents a system for automatically evaluating the interaction that exists between the atmosphere and the ocean's surface. Monitoring and evaluating the ocean's carbon exchange process is a function that requires working with a... more
Web search evaluation is the process of measuring the effectiveness of a Web search system. Such an evaluation helps in identifying the most effective one and helps the users to find the required information with less effort. Web search... more
This paper reports an overview of the evaluation campaign results of the IWSLT 2005 workshop1. The BTEC corpus, which consists of typical travel domain phrases, was used. Data for the five language pairs Arabic/Chinese/Japanese/Korean to... more
Automatic lexical alignment is a vital step for empirical machine translation, and although good results can be obtained with existent models (e.g. Giza++), more precise alignment is still needed for successfully handling complex... more
Web Content Accessibility Guidelines (WCAG) from W3C consist of a set of 65 checkpoints or specifications that Web pages should accomplish in order to be accessible to people with disabilities or using alternative browsers. Many of these... more
The principles of the Kohonen and counterpropagation artificial neural network (K-ANN and CP-ANN) learning strategy is described. The use of both methods (with the emphasis on CP-ANNs) is explained on several examples from analytical... more
La accesibilidad web es aquella característica que permite que cualquier persona sin importar sus condiciones pueda acceder a los contenidos de los sitios web. El uso de validadores automáticos permite realizar un primer análisis acerca... more
Some previous studies (e.g. that carried out by Van Bruggen et al. in 2004) have pointed to a need for additional research in order to firmly establish the usefulness of LSA (latent semantic analysis) parameters for automatic evaluation... more
We extend the original entity-based coherence model (Barzilay and Lapata, 2008) by learning from more fine-grained coherence preferences in training data. We associate multiple ranks with the set of permutations originating from the same... more
In this research the outcome of an aVective priming experiment is shown to critically depend on the frequency of occurrence of the target words used. Low frequency target words (5.7 occurrences per million words) resulted in an aVective... more
Statistical N-gram language modeling is used in many domains like spelling and syntactic verification, speech recognition, machine translation, character recognition and like others. This paper describes a system for sentence structure... more
-Muitos alunos apresentam dificuldades na compreensão e desenvolvimento de algoritmos. Para tentar ajudar esses alunos foi criado o ambiente SICAS, um ambiente de trabalho individual baseado na animação e simulação de algoritmos. Neste... more
Ergonomik memainkan peranan yang penting dalam rekabentuk kerusi sekolah. Postur duduk yang janggal akibat daripada rekabentuk kerusi yang tidak sesuai mampu menyumbang ke arah kesan yang negatif kepada kesihatan kanak-kanak. Isu ini... more
From classic theory and research in psychology, we distill a broad theoretical statement that evaluative responding can be immediate, unintentional, implicit, stimulus based, and linked directly to approach and avoidance motives. This... more
Some previous studies (e.g. that carried out by Van Bruggen et al. in 2004) have pointed to a need for additional research in order to firmly establish the usefulness of LSA (latent semantic analysis) parameters for automatic evaluation... more
We investigate the use of web search queries for detecting errors in non-native writing. Distinguishing a correct sequence of words from a sequence with a learner error is a baseline task that any error detection and correction system... more
In this paper we present a system for automatic correction of errors made by learners of English. The system has two novel aspects. First, machine-learned classifiers trained on large amounts of native data and a very large language model... more
Some previous studies (e.g. that carried out by Van Bruggen et al. in 2004) have pointed to a need for additional research in order to firmly establish the usefulness of LSA (latent semantic analysis) parameters for automatic evaluation... more
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is... more
In this paper we discuss our participation to the 2013 Semeval Semantic Textual Similarity task. Our core features include (i) a set of metrics borrowed from automatic machine translation, originally intended to evaluate automatic against... more
Breasts are composed of a mixture of fibrous and glandular tissue as well as adipose tissue and breast density describes the prevalence of fibroglandular tissue as it appears on a mammogram. Over the past few years, evaluation and... more
Download research papers for free!