Papers by Alexander Genkin
A clustering neural network model of insect olfaction
2017 51st Asilomar Conference on Signals, Systems, and Computers

A key step in insect olfaction is the transformation of a dense representation of odors in a smal... more A key step in insect olfaction is the transformation of a dense representation of odors in a small population of neurons - projection neurons (PNs) of the antennal lobe - into a sparse representation in a much larger population of neurons - Kenyon cells (KCs) of the mushroom body. What computational purpose does this transformation serve? We propose that the PN-KC network implements an online clustering algorithm which we derive from the k-means cost function. The vector of PN-KC synaptic weights converging onto a given KC represents the corresponding cluster centroid. KC activities represent attribution indices, i.e. the degree to which a given odor presentation is attributed to each cluster. Remarkably, such clustering view of the PN-KC circuit naturally accounts for several of its salient features. First, attribution indices are nonnegative thus rationalizing rectification in KCs. Second, the constraint on the total sum of attribution indices for each presentation is enforced by ...

Many neurons in the brain, such as place cells in the rodent hippocampus, have localized receptiv... more Many neurons in the brain, such as place cells in the rodent hippocampus, have localized receptive fields, i.e., they respond to a small neighborhood of stimulus space. What is the functional significance of such representations and how can they arise? Here, we propose that localized receptive fields emerge in similarity-preserving networks of rectifying neurons that learn low-dimensional manifolds populated by sensory inputs. Numerical simulations of such networks on standard datasets yield manifold-tiling localized receptive fields. More generally, we show analytically that, for data lying on symmetric manifolds, optimal solutions of objectives, from which similarity-preserving networks are derived, have localized receptive fields. Therefore, nonnegative similarity-preserving mapping (NSM) implemented by neural networks can model representations of continuous manifolds in the brain.
Sparse Bayesian Classifiers for Text Categorization (U) Alexander Genkin
ABSTRACT This paper empirically compares the performance of di#erent Bayesian models for text cat... more ABSTRACT This paper empirically compares the performance of di#erent Bayesian models for text categorization. In particular we examine so-called "sparse" Bayesian models that explicitly favor simplicity. We present empirical evidence that these models retain good predictive capabilities while o#ering significant computational advantages.
Large-Sacale Bayesian Logistic Regression for Text Categorization
Technometrics a Journal of Statistics For the Physical Chemical and Engineering Sciences, 2007
Special Issue on Statistics in Information Technology || Large-Scale Bayesian Logistic Regression for Text Categorization
Set covering submodular maximization: An optimal algorithm for data mining in bioinformatics and medical informatics
Journal of Intelligent & Fuzzy …, 2002
Abstract. In this paper we show how several problems in different areas of data mining and knowle... more Abstract. In this paper we show how several problems in different areas of data mining and knowledge discovery can be viewed as finding the optimal covering of a finite set. Many such problems arise in biomedical and bioinformatics research. For example, protein functional annotation ...
DIMACS at the TREC 2005 Genomics Track
Trec, 2005
DIMACS TR: 2005-42 Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD Challenge of 2005
Large-Scale Bayesian Logistic Regression for Text Cat- egorization
Jmlr, 2004
Page 1. Large-Scale Bayesian Logistic Regression for Text Categorization Alexander GENKIN DIMACS ... more Page 1. Large-Scale Bayesian Logistic Regression for Text Categorization Alexander GENKIN DIMACS Rutgers University Piscataway, NJ 08854 (alexgenkin@iname.com) David D. LEWIS David D. Lewis Consulting Chicago, IL 60614 (tmpaper06@DavidDLewis.com) ...
on two of the groups of entity resolution problems, ER1 and ER2 for the KDD Challenge in 2005. We... more on two of the groups of entity resolution problems, ER1 and ER2 for the KDD Challenge in 2005. We presume that the situation is intended to mimic, using abstracts and author information from the life sciences, some real world problem, in which it is important to recognize the identity of an individual, even though he may share that name with other individuals (ER1), or may actively seek to hide his identity by removing his own name from a work, or replacing it with an alias (ER2a, and ER2b,c). Thus specific problems investigated include author resolution, finding a missing author of a paper, and detecting a false author of a paper. The methods used to attack these problems include combinatorial cluster analysis, fusion of methods, penalized logistic regression / maximum entropy approaches, and dependency modeling. 1

DIMACS participated in the text categorization and ad hoc retrieval tasks of the TREC 2004 Genomi... more DIMACS participated in the text categorization and ad hoc retrieval tasks of the TREC 2004 Genomics track. For the categorization task, we tackled the triage and annotation hierarchy subtasks. and biology of the laboratory mouse. In particular, the Mouse Genome Database (MGD) contains information on the characteristics and functions of genes in the mouse, and on where this information appeared in the scientific litera- ture. Human curators encode this information using con- trolled vocabulary terms from the Gene Ontology2 (GO), and provide citations to documents that report each piece of information. GO consists of three structured networks: Bi- ological Process (BP), Molecular Function (MF), and Cellu- lar Component (CC)) of terms describing attributes of genes and gene products. The TREC 2004 Genomics track defined a categorization task with three subtasks based on simplified versions of this curation process. DIMACS participated in two of those sub- tasks, triage and annotation h...
This paper empirically compares the performance of dierent Bayesian mod- els for text categorizat... more This paper empirically compares the performance of dierent Bayesian mod- els for text categorization. In particular we examine so-called "sparse" Bayesian models that explicitly favor simplicity. We present empirical evidence that these models retain good predictive capabilities while oering significant computa- tional advantages.

2013 Asilomar Conference on Signals, Systems and Computers, 2013
A neuron is a basic physiological and computational unit of the brain. While much is known about ... more A neuron is a basic physiological and computational unit of the brain. While much is known about the physiological properties of a neuron, its computational role is poorly understood. Here we propose to view a neuron as a signal processing device that represents the incoming streaming data matrix as a sparse vector of synaptic weights scaled by an outgoing sparse activity vector. Formally, a neuron minimizes a cost function comprising a cumulative squared representation error and regularization terms. We derive an online algorithm that minimizes such cost function by alternating between the minimization with respect to activity and with respect to synaptic weights. The steps of this algorithm reproduce well-known physiological properties of a neuron, such as weighted summation and leaky integration of synaptic inputs, as well as an Oja-like, but parameter-free, synaptic learning rule. Our theoretical framework makes several predictions, some of which can be verified by the existing data, others require further experiments. Such framework should allow modeling the function of neuronal circuits without necessarily measuring all the microscopic biophysical parameters, as well as facilitate the design of neuromorphic electronics.
Individuals have distinctive ways of speaking and writing, and there exists a long history of lin... more Individuals have distinctive ways of speaking and writing, and there exists a long history of linguistic and stylistic investigation into authorship attribution. In recent years, practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), criminal law (identifying writers of ransom notes and harassing letters), civil law (copyright

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06, 2006
Supervised learning approaches to text classification are in practice often required to work with... more Supervised learning approaches to text classification are in practice often required to work with small and unsystematically collected training sets. The alternative is usually viewed as building classifiers by hand, using an expert's understanding of what features of the text are related to the class of interest. This is expensive, requires a degree of computational and linguistic sophistication, and makes it difficult to use combinations of weak predictors. We propose instead combining domain knowledge with training examples in a Bayesian framework. Domain knowledge is used to specify a prior distribution for parameters of a logistic regression model, and labeled training data is used to produce and find the mode of the posterior distribution. We show on three text categorization data sets that this approach can rescue what would otherwise be disastrously bad training situations, producing much more effective classifiers.

Because of the specifics of neuronal architecture, imaging must be done with very high resolution... more Because of the specifics of neuronal architecture, imaging must be done with very high resolution and throughput. While Electron Microscopy (EM) achieves the required resolution in the transverse directions, its depth resolution is a severe limitation. Computed tomography (CT) may be used in conjunction with electron microscopy to improve the depth resolution, but this severely limits the throughput since several tens or hundreds of EM images need to be acquired. Here, we exploit recent advances in signal processing to obtain high depth resolution EM images computationally. First, we show that the brain tissue can be represented as sparse linear combination of local basis functions that are thin membrane-like structures oriented in various directions. We then develop reconstruction techniques inspired by compressive sensing that can reconstruct the brain tissue from very few (typically 5) tomographic views of each section. This enables tracing of neuronal connections across layers and, hence, high throughput reconstruction of neural circuits to the level of individual synapses.
Large-Scale Bayesian Logistic Regression for Text Categorization
Technometrics, 2007
... to logistic re-gression that avoids overfitting, has classification effectiveness similar to ... more ... to logistic re-gression that avoids overfitting, has classification effectiveness similar to that of the best published methods, and is efficient both during fitting and at prediction time. ... In Section 3 we present the basics of our Bayesian ap-proach to logistic regression, and in ...
Uploads
Papers by Alexander Genkin