Poster Presentation by Ankit Shah
Sound waves pervade the entire universe and as humans
our hearing capacity permits us to detect o... more Sound waves pervade the entire universe and as humans
our hearing capacity permits us to detect only a limited set of
waves. Machines, however don’t have this limitation.
Imagine a machine that can sense a knock on the door,
someone breaking into your home, detect accidents by itself
and take decisions by sensing different sounds in the environ-
ment!
Our project, Never Ending Learning of Sound aims to
make the machine continuously learn sounds that exist, by
crawling the entire web. This will help the machine to under-
stand, sense, categorize and model the relationships between
different sounds. It is an effort to build one of the largest struc-
tured sound database.
This is a difficult problem due to three aspects.
1. Amount of Data needed to be processed for sound
2. Interference of Sounds
3. Validation of the results obtained with minimum human
intervention
Papers by Ankit Shah

Pipelined implementation of high radix adaptive CORDIC as a coprocessor
2015 International Conference on Computing and Network Communications (CoCoNet), 2015
The Coordinate Rotational Digital Computer (CORDIC) algorithm allows computation of trigonometric... more The Coordinate Rotational Digital Computer (CORDIC) algorithm allows computation of trigonometric, hyperbolic, natural log and square root functions. This iterative algorithm uses only shift and add operations to converge. Multiple fixed radix variants of the algorithm have been implemented on hardware. These have demonstrated faster convergence at the expense of reduced accuracy. High radix adaptive variants of CORDIC also exist in literature. These allow for faster convergence at the expense of hardware multipliers in the datapath without compromising on the accuracy of the results. This paper proposes a 12 stage deep pipeline architecture to implement a high radix adaptive CORDIC algorithm. It employs floating point multipliers in place of the conventional shift and add architecture of fixed radix CORDIC. This design has been synthesised on a FPGA board to act as a coprocessor. The paper also studies the power, latency and accuracy of this implementation.

arXiv (Cornell University), May 22, 2023
Learning with reduced labeling standards, such as noisy label, partial label, and multiple label ... more Learning with reduced labeling standards, such as noisy label, partial label, and multiple label candidates, which we generically refer to as imprecise labels, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision coexist. In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations. ILL leverages expectationmaximization (EM) for modeling the imprecise label information, treating the precise labels as latent variables. Instead of approximating the correct labels for training, it considers the entire distribution of all possible labeling entailed by the imprecise information. We demonstrate that ILL can seamlessly adapt to partial label learning, semi-supervised learning, noisy label learning, and, more importantly, a mixture of these settings. Notably, ILL surpasses the existing specified techniques for handling imprecise labels, marking the first unified framework with robust and effective performance across various challenging settings. We hope our work will inspire further research on this topic, unleashing the full potential of ILL in wider scenarios where precise labels are expensive and complicated to obtain.

arXiv (Cornell University), Apr 10, 2022
Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice.... more Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom made nonstandard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter are not only more accurate and interpretable but also more computationally efficient in that they can be run locally on small devices. We demonstrate this on a human-curated dataset of over 1000 subjects, collected and calibrated in clinical settings.
ArXiv, 2021
This paper reflects on the effect of several categories of medical conditions on human voice, foc... more This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers and allow them to be measured and used for predictive and diagnostic purposes. These approaches include proxy techniques, model-based analytical techniques and data-driven AI techniques.
ArXiv, 2016
In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Even... more In this paper we present our work on Task 1 Acoustic Scene Classi- fication and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.76 compared to the baseline of 0.91.
CORDIC serves as an iterative algorithm to exclude the usage of hardware multipliers in estimatin... more CORDIC serves as an iterative algorithm to exclude the usage of hardware multipliers in estimating functions such as the sin and log. We extend upon the results of Elguibaly et al.[1], in developing a high-radix adaptive CORDIC algorithm to enhance traditional CORDIC by an average speed up of 2s. The factor s is the number of leading bits of the result mantissa that is to be approximated. We analyze the conditions for achieving the speed up; subject to simulations and FPGA implementation on Xilinx Virtex VI, to provide for a comprehensive understanding of the algorithm. The project proposes to identify a hardware architecture to implement the results of HCORDIC as a math co-processor for an existing General Purpose Computer or a DSP
DCASE 2017 Challenge consists of four tasks: acoustic scene classification , detection of rare so... more DCASE 2017 Challenge consists of four tasks: acoustic scene classification , detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

In this paper, we introduce the imprecise label learning (ILL) framework, a unified approach to h... more In this paper, we introduce the imprecise label learning (ILL) framework, a unified approach to handle various imprecise label configurations, which are commonplace challenges in machine learning tasks. ILL leverages an expectation-maximization (EM) algorithm for the maximum likelihood estimation (MLE) of the imprecise label information, treating the precise labels as latent variables. Compared to previous versatile methods attempting to infer correct labels from the imprecise label information, our ILL framework considers all possible labeling imposed by the imprecise label information, allowing a unified solution to deal with any imprecise labels. With comprehensive experimental results, we demonstrate that ILL can seamlessly adapt to various situations, including partial label learning, semi-supervised learning, noisy label learning, and a mixture of these settings. Notably, our simple method surpasses the existing techniques for handling imprecise labels, marking the first unified framework with robust and effective performance across various imprecise labels. We believe that our approach has the potential to significantly enhance the performance of machine learning models on tasks where obtaining precise labels is expensive and complicated. We hope our work will inspire further research on this topic with an open-source codebase release.
Detection and Classification of Acoustic Scenes and Events 2017, 2017
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sou... more DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The largest source of sound events is web videos. Most videos lack sound event labels at segment ... more The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found using metadata by search engines. In this paper we explore the extent to which a search query can be used as the true label for detection of sound events in videos. We present a framework for large-scale sound event recognition on web videos. The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets. The datasets are used to train three classifiers, and we obtain a prediction on 3.7 million web video segments. We evaluated performance using the search query as true label and compare it with human labeling. Both types of ground truth exhibited close performance, to within 10%, and similar performance trend with increasing number of evaluated segments. Hence, our experiments show potential for using search query as a preliminary true label for sound event recognition in web videos.

2019 International Conference on Multimodal Interaction, 2019
Suicide is one of the leading causes of death in the modern world. In this digital age, individua... more Suicide is one of the leading causes of death in the modern world. In this digital age, individuals are increasingly using social media to express themselves and often use these platforms to express suicidal intent. Various studies have inspected suicidal intent behavioral markers in controlled environments but it is still unexplored if such markers will generalize to suicidal intent expressed on social media. In this work, we set out to study multimodal behavioral markers related to suicidal intent when expressed on social media videos. We explore verbal, acoustic and visual behavioral markers in the context of identifying individuals at higher risk of suicidal attempt. Our analysis reveals that frequent silences, slouched shoulders, rapid hand movements and profanity are predominant multimodal behavioral markers indicative of suicidal intent 1 .
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
We present a comparative analysis of the performance of state-ofthe-art sound event detection sys... more We present a comparative analysis of the performance of state-ofthe-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019
In the last couple of years, weakly labeled learning has turned out to be an exciting approach fo... more In the last couple of years, weakly labeled learning has turned out to be an exciting approach for audio event detection. In this work, we introduce webly labeled learning for sound events which aims to remove human supervision altogether from the learning process. We first develop a method of obtaining labeled audio data from the web (albeit noisy), in which no manual labeling is involved. We then describe methods to efficiently learn from these webly labeled audio recordings. In our proposed system, WeblyNet, two deep neural networks co-teach each other to robustly learn from webly labeled data, leading to around 17% relative improvement over the baseline method. The method also involves transfer learning to obtain efficient representations.

2017 25th European Signal Processing Conference (EUSIPCO), 2017
Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED emplo... more Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was use to retrain the detectors. The performance of the retrained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos.
In the last couple of years, weakly labeled learning for sound eventshas turned out to be an exci... more In the last couple of years, weakly labeled learning for sound eventshas turned out to be an exciting approach for audio event detection.In this work, we introduce webly labeled learning for sound events in which we aim to remove human supervision altogether from the learning process. We first develop a method of obtaining labeledaudio data from the web (albeit noisy), in which no manual labeling is involved. We then describe deep learning methods to efficiently learn from these webly labeled audio recordings. In our proposed system,WeblyNet, two deep neural networks co-teach each other to robustly learn from webly labeled data, leading to around 17%relative improvements over the baseline method. The method also involves transfer learning to obtain efficient representations.
In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event ... more In this paper we present our work on Task 1 Acoustic Scene Classification and Task 3 Sound Event Detection in Real Life Recordings. Among our experiments we have low-level and high-level features, classifier optimization and other heuristics specific to each task. Our performance for both tasks improved the baseline from DCASE: for Task 1 we achieved an overall accuracy of 78.9% compared to the baseline of 72.6% and for Task 3 we achieved a Segment-Based Error Rate of 0.76 compared to the baseline of 0.91.
—Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED empl... more —Audio Event Detection (AED) aims to recognize sounds within audio and video recordings. AED employs machine learning algorithms commonly trained and tested on annotated datasets. However, available datasets are limited in number of samples and hence it is difficult to model acoustic diversity. Therefore, we propose combining labeled audio from a dataset and unlabeled audio from the web to improve the sound models. The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube. Whenever the detectors recognized any of the known sounds with high confidence, the unlabeled audio was use to retrain the detectors. The performance of the retrained detectors is compared to the one from the original detectors using the annotated test set. Results showed an improvement of the AED, and uncovered challenges of using web audio from videos.
Thesis Chapters by Ankit Shah
CORDIC serves as an iterative algorithm to exclude the usage of hardware multipliers in estimatin... more CORDIC serves as an iterative algorithm to exclude the usage of hardware multipliers in estimating functions such as the sin and log. We extend upon the results of Elguibaly et al.[1], in developing a high-radix adaptive CORDIC algorithm to enhance traditional CORDIC by an average speed up of 2s. The factor s is the number of leading bits of the result mantissa that is to be approximated. We analyze the conditions for achieving the speed up; subject to simulations and FPGA implementation on Xilinx Virtex VI, to provide for a comprehensive understanding of the algorithm. The project proposes to identify a hardware architecture to implement the results of HCORDIC as a math co-processor for an existing General Purpose Computer or a DSP
Drafts by Ankit Shah
Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice.... more Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom made nonstandard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter are not only more accurate and interpretable but also more computationally efficient in that they can be run locally on small devices. We demonstrate this on a human-curated dataset of over 1000 subjects, collected and calibrated in clinical settings.
Uploads
Poster Presentation by Ankit Shah
our hearing capacity permits us to detect only a limited set of
waves. Machines, however don’t have this limitation.
Imagine a machine that can sense a knock on the door,
someone breaking into your home, detect accidents by itself
and take decisions by sensing different sounds in the environ-
ment!
Our project, Never Ending Learning of Sound aims to
make the machine continuously learn sounds that exist, by
crawling the entire web. This will help the machine to under-
stand, sense, categorize and model the relationships between
different sounds. It is an effort to build one of the largest struc-
tured sound database.
This is a difficult problem due to three aspects.
1. Amount of Data needed to be processed for sound
2. Interference of Sounds
3. Validation of the results obtained with minimum human
intervention
Papers by Ankit Shah
Thesis Chapters by Ankit Shah
Drafts by Ankit Shah