Commonly referred to as predictive modeling, the use of machine learning and statistical methods ... more Commonly referred to as predictive modeling, the use of machine learning and statistical methods to improve healthcare outcomes has recently gained traction in biomedical infor-matics research. Given the vast opportunities enabled by large Electronic Health Records (EHR) data and powerful resources for conducting predictive modeling, we argue that it is yet crucial to first carefully examine the prediction task and then choose predictive methods accordingly. Specifically, we argue that there are at least three distinct prediction tasks that are often conflated in biomedical research: 1) data imputation, where a model fills in the missing values in a dataset, 2) future forecasting, where a model projects the development of a medical condition for a known patient based on existing observations, and 3) new-patient generalization, where a model transfers the knowledge learned from previously observed patients to newly encountered ones. Importantly, the latter two tasks—future forecasting and new-patient generalizations—tend to be more difficult than data imputation as they require predictions to be made on potentially out-of-sample data (i.e., data following a different predictable pattern from what has been learned by the model). Using hearing loss progression as an example, we investigate three regression models and show that the modeling of latent clusters is a robust method for addressing the more challenging prediction scenarios. Overall, our findings suggest that there exist significant differences between various kinds of prediction tasks and that it is important to evaluate the merits of a predictive model relative to the specific purpose of a prediction task.
Forming an accurate representation of a task environment often takes place incrementally as the i... more Forming an accurate representation of a task environment often takes place incrementally as the information relevant to learning the representation only unfolds over time. This incremental nature of learning poses an important problem: it is usually unclear whether a sequence of stimuli consists of only a single pattern, or multiple patterns that are spliced together. In the former case, the learner can directly use each observed stimulus to continuously revise its representation of the task environment. In the latter case, however, the learner must first parse the sequence of stimuli into different bundles, so as to not conflate the multiple patterns. We created a video-game statistical learning paradigm and investigated (1) whether learners without prior knowledge of the existence of multiple “stimulus bundles” — subsequences of stimuli that define locally coherent statistical patterns — could detect their presence in the input and (2) whether learners are capable of constructing a rich representation that encodes the various statistical patterns associated with bundles. By comparing human learning behavior to the predictions of three computational models, we find evidence that learners can handle both tasks successfully. In addition, we discuss the underlying reasons for why the learning of stimulus bundles occurs even when such behavior may seem irrational.
When we read or listen to language, we are faced with the challenge of inferring intended message... more When we read or listen to language, we are faced with the challenge of inferring intended messages from noisy input. This
challenge is exacerbated by considerable variability between and within speakers. Focusing on syntactic processing (parsing), we test the hypothesis that language comprehenders rapidly adapt to the syntactic statistics of novel linguistic environments (e.g., speakers or genres). Two self-paced reading experiments investigate changes in readers’ syntactic expectations based on repeated exposure to sentences with temporary syntactic ambiguities (so-called ‘‘garden path sentences’’). These sentences typically lead to a clear expectation violation signature when the temporary ambiguity is resolved to an a priori less expected structure (e.g., based on the statistics of the lexical context). We find that comprehenders rapidly adapt their syntactic expectations to converge towards the local statistics of novel environments. Specifically, repeated exposure to a priori unexpected structures can reduce, and even completely undo, their processing disadvantage (Experiment 1). The opposite is also observed: a priori expected structures become less expected (even eliciting garden paths) in environments where they are hardly ever observed (Experiment 2). Our findings suggest that, when changes in syntactic statistics are to be expected (e.g., when entering a novel environment), comprehends can rapidly adapt their expectations, thereby overcoming the processing disadvantage that mistaken expectations would otherwise cause. Our findings take a step towards unifying insights from research in expectation-based models of language processing, syntactic priming, and statistical learning.
Crowdsourcing platforms are a popular choice for researchers to gather text annotations quickly a... more Crowdsourcing platforms are a popular choice for researchers to gather text annotations quickly at scale. We investigate whether crowdsourced annotations are useful when the labeling task requires medical domain knowledge. Comparing a sentence classification model trained with expert-annotated sentences to the same model trained on crowd-labeled sentences, we find the crowdsourced training data to be just as effective as the manually produced dataset. We can improve the accuracy of the crowd-fueled model with- out collecting further labels by filtering out worker labels applied with low confidence.
Proceedings of the National Academy of Science, Sep 22, 2014
The order in which stimuli are presented in an experiment has long been recognized to influence b... more The order in which stimuli are presented in an experiment has long been recognized to influence behavior. Previous accounts have often attributed the effect of stimulus order to the mechanisms with which people process information. We propose that stimulus order influences cognition because it is an important cue for learning the underlying structure of a task environment. In particular, stimulus order can be used to infer a "stimulus bundle"-a sequence of consecutive stimuli that share the same underlying latent cluster. We describe a clustering model that successfully explains the perception of streak shooting in basketball games, along with two other cognitive phenomena, as the outcome of finding the statistically optimal bundle representation. We argue that the perspective of viewing stimulus order as a cue may hold the key to explaining behaviors that seemingly deviate from normative theories of cognition and that in task domains where the assumption of stimulus bundles is intuitively appropriate, it can improve the explanatory power of existing models.
Abstract Speakers have been hypothesized to organize discourse content so as to achieve communica... more Abstract Speakers have been hypothesized to organize discourse content so as to achieve communicative efficiency. Previous work has focused on indirect tests of the hypothesis that speakers aim to keep per-word entropy constant across discourses to achieve communicative efficiency (Genzel & Charniak, 2002). We present novel and more direct evidence by examining the role of topic shift in discourse planning.
Knowing how to pronounce a word is important for automatic speech recognition and synthesis. Prev... more Knowing how to pronounce a word is important for automatic speech recognition and synthesis. Previous approaches have either employed trained persons to manually generate pronunciations, or have used letterto-phoneme (L2P) rules, which were either hand-crafted or machine-learned from a manually transcribed corpus Elovitz et al.(1976); Dietterich (2002). The first approach is expensive, the second can be of variable quality, depending on the skill of the experts or size and quality of the transcribed data.
Abstract We report on the large-scale acquisition of class attributes with and without the use of... more Abstract We report on the large-scale acquisition of class attributes with and without the use of lists of representative instances, as well as the discovery of unary attributes, such as typically expressed in English through prenominal adjectival modification. Our method employs a system based on compositional language processing, as applied to the British National Corpus.
Abstract Non-native English speakers often have problems determining the exact form of an idiomat... more Abstract Non-native English speakers often have problems determining the exact form of an idiomatic expression while they have some vague idea about the key words in them. In this paper, we describe a system called Webtionary that allows users to consult idiomatic usage by entering a questionable expression. Webtionary uses web search to find candidate corrections and suggests expressions that are commonly used in writing and semantically-related to the user query.
Learning an accurate representation of the environment is a difficult task for both animals and h... more Learning an accurate representation of the environment is a difficult task for both animals and humans, because the causal structures of the environment are unobservable and must be inferred from the observable input. In this article, we argue that this difficulty is further increased by the multi-context nature of realistic learning environments. When the environment undergoes a change in context without explicit cueing, the learner must detect the change and employ a new causal model to predict upcoming observations correctly. We discuss the problems and strategies that a rational learner might adopt and existing findings that support such strategies. We advocate hierarchical models as an optimal structure for retaining causal models learned in past contexts, thereby avoiding relearning familiar contexts in the future.
One major aspect of successful language acquisition is the ability to generalize from properties ... more One major aspect of successful language acquisition is the ability to generalize from properties of experienced items to novel items. We present a computational study of artificial language learning, where the generalization patterns of three generative models are compared to those of human learners across 10 experiments. Results suggest that an explicit representation of word categories is the best model for capturing the generalization patterns of human learners across a wide range of learning environments. We discuss the representational assumptions implied by these models.
Recent years have seen a surge in accounts motivated by information theory that consider language... more Recent years have seen a surge in accounts motivated by information theory that consider language production to be partially driven by a preference for communicative efficiency. Evidence from discourse production (i.e., production beyond the sentence level) has been argued to suggest that speakers distribute information across discourse so as to hold the conditional per-word entropy associated with each word constant, which would facilitate efficient information transfer (Genzel & Charniak, 2002). This hypothesis implies that the conditional (contextualized) probabilities of linguistic units affect speakers’ preferences during production. Here, we extend this work in two ways. First, we explore how preceding cues are integrated into contextualized probabilities, a question which so far has received little to no attention. Specifically, we investigate how a cue's maximal informativity about upcoming words (the cue's effectiveness) decays as a function of the cue's recency. Based on properties of linguistic discourses as well as properties of human memory, we analytically derive a model of cue effectiveness decay and evaluate it against cross-linguistic data from 12 languages. Second, we relate the information theoretic accounts of discourse production to well-established mechanistic (activation-based) accounts: We relate contextualized probability distributions over words to their relative activation in a lexical network given preceding discourse.
We formally derive a mathematical model for evaluating the effect of context relevance in languag... more We formally derive a mathematical model for evaluating the effect of context relevance in language production. The model is based on the principle that distant contextual cues tend to gradually lose their relevance for predicting upcoming linguistic signals. We evaluate our model against a hypothesis of efficient communication (Genzel and Charniak’s Constant Entropy Rate hypothesis). We show that the development of entropy throughout discourses is described significantly better by a model with cue relevance decay than by previous models that do not consider such factors.
We describe ScriptTranscriber, an open source toolkit for extracting transliterations in comparab... more We describe ScriptTranscriber, an open source toolkit for extracting transliterations in comparable corpora from lan- guages written in different scripts. The system includes various methods for extracting potential terms of interest from raw text, for providing guesses on the pronunciations of terms, and for comparing two strings as possible transliterations using both phonetic and temporal measures. The system works with any script in the Unicode Basic Multilingual Plane and is easily extended to include new modules. Given comparable corpora, such as newswire text, in a pair of languages that use different scripts, ScriptTranscriber provides an easy way to mine transliterations from the comparable texts. This is particularly useful for underresourced languages, where training data for transliteration may be lacking, and where it is thus hard to train good transliterators. ScriptTranscriber provides an open source package that allows for ready incorporation of more sophisticated modules — e.g. a trained transliteration model for a particular language pair. ScriptTranscriber is available as part of the nltk contrib source tree at http://code.google.com/p/nltk/
In this paper we investigate the manner in which the human language comprehension system adapts t... more In this paper we investigate the manner in which the human language comprehension system adapts to shifts in probability distributions over syntactic structures, given experimentally controlled experience with those structures. We replicate a classic reading experiment, and present a model of the behavioral data that implements a form of Bayesian belief update over the course of the experiment
Speakers have been hypothesized to organize discourse content so as to achieve communicative effi... more Speakers have been hypothesized to organize discourse content so as to achieve communicative efficiency. Previous work has focused on indirect tests of the hypothesis that speakers aim to keep per-word entropy constant across discourses to achieve communicative efficiency (Genzel & Charniak, 2002). We present novel and more direct evidence by examining the role of topic shift in discourse planning. If speakers aim for constant per-word entropy, they should encode less unconditional per-word entropy (as estimated based on only sentence-internal cues) following topic shifts, as there is less relevant context to condition on. Applying latent topic modeling to a large set of English texts, we find that speakers are indeed sensitive to the recent topic structure in the predicted way.
Recent work proposes that language production is organized to facilitate efficient communication ... more Recent work proposes that language production is organized to facilitate efficient communication by means of transmitting information at a constant rate. However, evidence has almost exclusively come from English. We present new results from Mandarin Chinese supporting the hypothesis that Constant Entropy Rate is observed cross-linguistically, and may be a universal property of the language production system. We show that this result holds even if several important confounds that previous work failed to address are controlled for. Finally, we present evidence that Constant Entropy Rate is observed at the syllable level as well as the word level, suggesting findings do not depend on the chosen unit of observation.
We describe the use of a weakly supervised bootstrapping algorithm in discovering contrasting sem... more We describe the use of a weakly supervised bootstrapping algorithm in discovering contrasting semantic categories from a source lexicon with little training data. Our method primarily exploits the patterns in sentential contexts where different categories of words may appear. Experimental results are presented showing that such automatically categorized terms tend to agree with human judgements.
Forming an accurate representation of a task environment often takes place incrementally as the i... more Forming an accurate representation of a task environment often takes place incrementally as the information relevant to learning the representation only unfolds over time. This incremental nature of learning poses an important problem: it is usually unclear whether a sequence of stimuli consists of only a single pattern, or multiple patterns that are spliced together. In the former case, the learner can directly use each observed stimulus to continuously revise its representation of the task environment. In the latter case, however, the learner must first parse the sequence of stimuli into different bundles, so as to not conflate the multiple patterns. We created a video-game statistical learning paradigm and investigated 1) whether learners without prior knowledge of the existence of multiple “stimulus bundles” — subsequences of stimuli that define locally coherent statistical patterns — could detect their presence in the input, and 2) whether learners are capable of constructing a rich representation that encodes the various statistical patterns associated with bundles. By comparing human learning behavior to the predictions of three computational models, we find evidence that learners can handle both tasks successfully. In addition, we discuss the underlying reasons for why the learning of stimulus bundles occurs even when such behavior may seem irrational.
Uploads
Papers by Ting Qian
challenge is exacerbated by considerable variability between and within speakers. Focusing on syntactic processing (parsing), we test the hypothesis that language comprehenders rapidly adapt to the syntactic statistics of novel linguistic environments (e.g., speakers or genres). Two self-paced reading experiments investigate changes in readers’ syntactic expectations based on repeated exposure to sentences with temporary syntactic ambiguities (so-called ‘‘garden path sentences’’). These sentences typically lead to a clear expectation violation signature when the temporary ambiguity is resolved to an a priori less expected structure (e.g., based on the statistics of the lexical context). We find that comprehenders rapidly adapt their syntactic expectations to converge towards the local statistics of novel environments. Specifically, repeated exposure to a priori unexpected structures can reduce, and even completely undo, their processing disadvantage (Experiment 1). The opposite is also observed: a priori expected structures become less expected (even eliciting garden paths) in environments where they are hardly ever observed (Experiment 2). Our findings suggest that, when changes in syntactic statistics are to be expected (e.g., when entering a novel environment), comprehends can rapidly adapt their expectations, thereby overcoming the processing disadvantage that mistaken expectations would otherwise cause. Our findings take a step towards unifying insights from research in expectation-based models of language processing, syntactic priming, and statistical learning.
expert-annotated sentences to the same model trained on crowd-labeled sentences, we find the crowdsourced training data to be just as effective as the manually produced dataset. We can improve the accuracy of the crowd-fueled model with-
out collecting further labels by filtering out worker labels applied with low confidence.
environments. We discuss the representational assumptions implied by these models.
role of topic shift in discourse planning. If speakers aim for constant per-word entropy, they should encode less unconditional per-word entropy (as estimated based on only sentence-internal cues) following topic shifts, as there is less relevant context to condition on. Applying latent topic modeling to a large set of English texts, we find that speakers are indeed sensitive to the recent topic structure in the predicted way.
exploits the patterns in sentential contexts where different categories of words may appear. Experimental results are presented showing that such automatically categorized terms tend to agree with human judgements.
Drafts by Ting Qian
into different bundles, so as to not conflate the multiple patterns. We created a video-game statistical learning paradigm and investigated 1) whether learners without prior knowledge of the existence of multiple “stimulus bundles” — subsequences of stimuli that define locally coherent statistical patterns — could detect their presence in the input, and 2) whether learners are capable of constructing a rich representation that encodes the various statistical patterns associated with bundles. By comparing human learning behavior to the predictions of three computational models, we find evidence that learners can handle both tasks successfully. In addition, we discuss the underlying reasons for why the learning of stimulus bundles occurs even when such behavior may seem irrational.