We use the Expectation-Maximization (EM) algorithm to classify 3D aerial lidar scattered height d... more We use the Expectation-Maximization (EM) algorithm to classify 3D aerial lidar scattered height data into four categories: road, grass, buildings, and trees. To do so we use five features: height, height variation, normal variation, lidar return intensity, and image intensity. We also use only lidar-derived features to organize the data into three classes (the road and grass classes are merged). We apply and test our results using ten regions taken from lidar data collected over an area of approximately eight square miles, obtaining higher than 94% accuracy. We also apply our classifier to our entire dataset, and present visual classification results both with and without uncertainty. We use several approaches to evaluate the parameter and model choices possible when applying EM to our data. We observe that our classification results are stable and robust over the various subregions of our data which we tested. We also compare our results here with previous classification efforts using this data.
Building footprints have been shown to be extremely useful in urban planning, infrastructure deve... more Building footprints have been shown to be extremely useful in urban planning, infrastructure development, and roof modeling. Current methods for creating these footprints are often highly manual and rely largely on architectural blueprints or skilled modelers. In this work we will use aerial LIDAR data to generate building footprints automatically. Existing automatic methods have been mostly unsuccessful due to large amounts of noise around building edges. We present a novel Bayesian technique for automatically constructing building footprints from a pre-classified LI-DAR point cloud. Our algorithm first computes a boundederror approximate building footprint using an application of the shortest path algorithm. We then determine the most probable building footprint by maximizing the posterior probability using linear optimization and simulated annealing techniques. We have applied our algorithm to more than 300 buildings in our data set and observe that we obtain accurate building footprints compared to the ground truth. Our algorithm is automatic and can be applied to other man-made shapes such as roads and telecommunication lines with minor modifications.
Trained musicians intuitively produce expressive variations that add to their audience's enjoymen... more Trained musicians intuitively produce expressive variations that add to their audience's enjoyment. However, there is little quantitative information about the kinds of strategies used in different musical contexts. Since the literal synthesis of notes from a score is bland and unappealing, there is an opportunity for learning systems that can automatically produce compelling expressive variations. The ESP (Expressive Synthetic Performance) system generates expressive renditions using hierarchical hidden Markov models trained on the stylistic variations employed by human performers. Furthermore, the generative models learned by the ESP system provide insight into a number of musicological issues related to expressive performance.
Neural Information Processing Systems, Nov 29, 1999
Recent interpretations of the Adaboost algorithm view it as performing a gradient descent on a po... more Recent interpretations of the Adaboost algorithm view it as performing a gradient descent on a potential function. Simply changing the potential function allows one to create new algorithms related to AdaBoost. However, these new algorithms are generally not known to have the formal boosting property. This paper examines the question of which potential functions lead to new algorithms that are boosters. The two main results are general sets of conditions on the potential; one set implies that the resulting algorithm is a booster, while the other implies that the algorithm is not. These conditions are applied to previously studied potential functions , such as those used by LogitBoost and Doom II.
Neural Information Processing Systems, Nov 27, 1995
We analyze and compare the well-known Gradient Descent algorithm and a new algorithm, called the ... more We analyze and compare the well-known Gradient Descent algorithm and a new algorithm, called the Exponentiated Gradient algorithm, for training a single neuron with an arbitrary transfer function . Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the standard back-propagation algorithm. In this paper we prove worstcase loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worst-case bounds for gradient-based algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worst-case loss bound for any such transfer function and its corresponding matching loss. For example, the matching loss for the identity function is the square loss and the matching loss for the logistic sigmoid is the entropic loss. The different structure of the bounds for the two algorithms indicates that the new algorithm out-performs Gradient Descent when the inputs contain a large number of irrelevant components.
We study the problem of deterministically predicting boolean values by combining the boolean pred... more We study the problem of deterministically predicting boolean values by combining the boolean predictions of several experts. Previous on-line algorithms for this problem predict with the weighted majority of the experts' predictions. These algorithms give each expert an exponential weight tim where fl is a constant in [0, 1) and ra is the number of rnistakes made by the expert in the past. We show that it is better to use sums of binomials as weights. In particular, we present a deterministic algorithm using binomial weights that has a better worst case mistake bound than the best deterministic algorithm using exponential weights. The binomial weights naturally arise from a version space argument. We also show how both exponential and binomial weighting schemes can be used to make prediction algorithms robust against noise.
The standard techniques for online learning of combinatorial objects perform multiplicative updat... more The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of n-symbol Huffman trees is known to have exponentially many facets . We get around this difficulty by exploiting extended formulations , which encode the polytope of combinatorial objects in a higher dimensional "extended" space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of O( log(n)) of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.
We present an on-line investment algorithm which a c hieves almost the same wealth as the best co... more We present an on-line investment algorithm which a c hieves almost the same wealth as the best constant-rebalanced portfolio determined in hindsight from the actual market outcomes. The algorithm employs a multiplicative update rule derived using a framework introduced by Kivinen and Warmuth. Our algorithm is very simple to implement and requires only constant storage and computing time per stock i n e a c h trading period. We tested the performance of our algorithm on real stock data from the New York Stock Exchange accumulated during a 22-year period. On this data, our algorithm clearly outperforms the best single stock a s w ell as Cover's universal portfolio selection algorithm. We also present results for the situation in which the investor has access to additional side information.
We study the {0, 1}-loss version of adaptive adversarial multi-armed bandit problems with α(≥ 1) ... more We study the {0, 1}-loss version of adaptive adversarial multi-armed bandit problems with α(≥ 1) lossless arms. For the problem, we show a tight bound K -α -Θ(1/T ) on the minimax expected number of mistakes (1-losses), where K is the number of arms and T is the number of rounds.
The standard techniques for online learning of combinatorial objects perform multiplicative updat... more The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of n-symbol Huffman trees is known to have exponentially many facets . We get around this difficulty by exploiting extended formulations , which encode the polytope of combinatorial objects in a higher dimensional "extended" space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of O( log(n)) of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.
We study dropout and weight decay applied to deep networks with rectified linear units and the qu... more We study dropout and weight decay applied to deep networks with rectified linear units and the quadratic loss. We show how using dropout in this context can be viewed as adding a regularization penalty term that grows exponentially with the depth of the network when the more traditional weight decay penalty grows polynomially. We then show how this difference affects the inductive bias of algorithms using one regularizer or the other: we describe a random source of data that dropout is unwilling to fit, but that is compatible with the inductive bias of weight decay. We also describe a source that is compatible with the inductive bias of dropout, but not weight decay. We also show that, in contrast with the case of generalized linear models, when used with deep networks with rectified linear units and the quadratic loss, the regularization penalty of dropout (a) is not only a function of the marginals on the independent variables, but also depends on the response variables, and (b) can be negative. Finally, the dropout penalty can drive a learning algorithm to use negative weights even when trained with monotone training data.
We study learning of initial intervals in the prediction model. We show that for each distributio... more We study learning of initial intervals in the prediction model. We show that for each distribution D over the domain, there is an algorithm A D , whose probability of a mistake in round m is at most 1 2 + o(1) 1 m . We also show that the best possible bound that can be achieved in the case in which the same algorithm A must be applied for all distributions D is at least Informally, "knowing" the distribution D enables an algorithm to reduce its error rate by a constant factor strictly greater than 1. As advocated by , knowledge of D can be viewed as an idealized proxy for a large number of unlabeled examples.
International Conference on Artificial Intelligence, 2009
We present and explore the effectiveness of several variations on the All-Moves-As-First (AMAF) h... more We present and explore the effectiveness of several variations on the All-Moves-As-First (AMAF) heuristic in Monte-Carlo Go. Our results show that: • Random play-outs provide more information about the goodness of moves made earlier in the play-out. • AMAF updates are not just a way to quickly initialize counts, they are useful after every play-out. • Updates even more aggressive than AMAF can be even more beneficial.
Dropout is a simple but effective technique for learning in neural networks and other settings. A... more Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager et al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: • when the dropout-regularized criterion has a unique minimizer, • when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, • that the dropout regularization can be non-monotonic as individual weights increase from 0, and • that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with L 2 regularization, we formalize the notion of when different random sources of data are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than L 2 regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and L 2 regularization differ. We provide some similar results for L 1 regularization.
We analyze algorithms for approximating a function f (x) = Φx mapping ℜ d to ℜ d using deep linea... more We analyze algorithms for approximating a function f (x) = Φx mapping ℜ d to ℜ d using deep linear neural networks, i.e. that learn a function h parameterized by matrices Θ 1 , ..., Θ L and defined by h(x) = Θ L Θ L-1 ...Θ 1 x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix Φ, in the case where the initial hypothesis Θ 1 = ... = Θ L = I has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ is symmetric positive definite, we show that an algorithm that initializes Θ i = I learns an ǫ-approximation of f using a number of updates polynomial in L, the condition number of Φ, and log(d/ǫ). In contrast, we show that if the least squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u ⊤ Φu > 0 for all u, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u ⊤ Θ L Θ L-1 ...Θ 1 u > 0 for all u, and another that "balances" Θ 1 , ..., Θ L so that they have the same singular values.
In this paper we consider the problem of tracking a subset of a domain (called the target) which ... more In this paper we consider the problem of tracking a subset of a domain (called the target) which changes gradually over time. A single (unknown) probability distribution over the domain is used to generate random examples for the learning algorithm and measure the speed at which the target changes. Clearly, the more rapidly the target moves, the harder it is for the algorithm to maintain a good approximation of the target. Therefore we evaluate algorithms based on how much movement of the target can be tolerated between examples while predicting with accuracy e. Furthermore, the complexity of the class 7-/of possible targets, as measured by d, its VC-dimension, also effects the difficulty of tracking the target concept. We show that if the problem of minimizing the number of disagreements with a sample from among concepts in a class 7{ can be approximated to within a factor k, then there is a simple tracking algorithm for 7-t which can achieve a probability e of making a mistake if the target movement rate is at most a constant times e2/ (k(d + k) In 1), where d is the Vapnik-Chervonenkis dimension of 7-t. Also, we show that if 7-/ is properly PAC-learnable, then there is an efficient (randomized) algorithm that with high probability approximately minimizes disagreements to within a factor of 7d + 1, yielding an efficient tracking algorithm for 7-I which tolerates drift rates up to a constant times e2/(d 2 In ¼). In addition, we prove complementary results for the classes of halfspaces and axisaligned hyperrectangles showing that the maximum rate of drift that any algorithm (even with unlimited computational power) can tolerate is a constant times e2/d.
This paper considers a variant of the classical online learning problem with expert predictions. ... more This paper considers a variant of the classical online learning problem with expert predictions. Our model's differences and challenges are due to lacking any direct feedback on the loss each expert incurs at each time step t. We propose an approach that uses peer prediction and identify conditions where it succeeds. Our techniques revolve around a carefully designed peer score function s() that scores experts' predictions based on the peer consensus. We show a sufficient condition, that we call peer calibration, under which standard online learning algorithms using loss feedback computed by the carefully crafted s() have bounded regret with respect to the unrevealed ground truth values. We then demonstrate how suitable s() functions can be derived for different assumptions and models.
Adversarial attacks add perturbations to the input features with the intent of changing the class... more Adversarial attacks add perturbations to the input features with the intent of changing the classification produced by a machine learning system. Small perturbations can yield adversarial examples which are misclassified despite being virtually indistinguishable from the unperturbed input. Classifiers trained with standard neural network techniques are highly susceptible to adversarial examples, allowing an adversary to create misclassifications of their choice. We introduce a new type of network unit, called MWD (max of weighed distance) units that have a built-in resistant to adversarial attacks. These units are highly non-linear, and we develop the techniques needed to effectively train them. We show that simple interval techniques for propagating perturbation effects through the network enables the efficient computation of robustness (i.e., accuracy guarantees) for MWD networks under any perturbations, including adversarial attacks. MWD networks are significantly more robust to input perturbations than ReLU networks. On permutation invariant MNIST, when test examples can be perturbed by 20% of the input range, MWD networks provably retain accuracy above 83%, while the accuracy of ReLU networks drops below 5%. The provable accuracy of MWD networks is superior even to the observed accuracy of ReLU networks trained with the help of adversarial examples. In the absence of adversarial attacks, MWD networks match the performance of sigmoid networks, and have accuracy only slightly below that of ReLU networks.
The standard techniques for online learning of combinatorial objects perform multiplicative updat... more The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of n-symbol Huffman trees is known to have exponentially many facets . We get around this difficulty by exploiting extended formulations , which encode the polytope of combinatorial objects in a higher dimensional "extended" space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of O( log(n)) of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.
The goal of object detection is to find objects in an image. An object detector accepts an image ... more The goal of object detection is to find objects in an image. An object detector accepts an image and produces a list of locations as (x, y) pairs. Here we introduce a new concept: location-based boosting. Location-based boosting differs from previous boosting algorithms because it optimizes a new spatial loss function to combine object detectors, each of which may have marginal performance, into a single, more accurate object detector. A structured representation of object locations as a list of (x, y) pairs is a more natural domain for object detection than the spatially unstructured representation produced by classifiers. Furthermore, this formulation allows us to take advantage of the intuition that large areas of the background are uninteresting and it is not worth expending computational effort on them. This results in a more scalable algorithm because it does not need to take measures to prevent the background data from swamping the foreground data such as subsampling or applying an ad-hoc weighting to the pixels. We first present the theory of location-based boosting, and then motivate it with empirical results on a challenging data set.
Uploads
Papers by David Helmbold