Skip to main content

Qiang HUO

Followers

11

Following

1

Public Views

Interests

Uploads

Papers by Qiang HUO

An HMM Compensatioon Approach for Dynamic Features Using Unscented Transformation and Its Application to Noisy Speech Recognition

In our previous work, a new HMM compensation approach for static MFCC features was proposed by us... more In our previous work, a new HMM compensation approach for static MFCC features was proposed by using a technique called Unscented Transformation (UT). Three implementations of the UT approach with different computational complexities were evaluated on Aurora2 connected digits database, and significant performance improvements were achieved compared to log-normalapproximation-based PMC (Parallel Model Combination) and firstorder-approximation-based VTS (Vector Taylor Series) approaches. In this paper, we extend our UT-based formulation to compensating for HMM parameters corresponding to both static and dynamic features. New experimental results on Aurora2 task are reported to demonstrate the effectiveness of the proposed UT approach.

A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition

arXiv (Cornell University), Jul 31, 2020

Deep Bidirectional Long Short-Term Memory (DBLSTM) with a Connectionist Temporal Classification (... more Deep Bidirectional Long Short-Term Memory (DBLSTM) with a Connectionist Temporal Classification (CTC) output layer has been established as one of the state-of-the-art solutions for handwriting recognition. It is well-known that the DBLSTM trained by using a CTC objective function will learn both local character image dependency for character modeling and long-range contextual dependency for implicit language modeling. In this paper, we study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition by comparing the performance of using or without using an explicit language model in decoding. It is observed that even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful. To deal with such a large-scale training problem, a GPU-based training tool has been developed for CTC training of DBLSTM by using a mini-batch based epochwise Back Propagation Through Time (BPTT) algorithm.

Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling

IEEE Transactions on Speech and Audio Processing, May 1, 2001

APRNet: Attention-based Pixel-wise Rendering Network for Photo-Realistic Text Image Generation

arXiv (Cornell University), Mar 15, 2022

Style-guided text image generation tries to synthesize text image by imitating reference image's ... more Style-guided text image generation tries to synthesize text image by imitating reference image's appearance while keeping text content unaltered. The text image appearance includes many aspects. In this paper, we focus on transferring style image's background and foreground color patterns to the content image to generate photo-realistic text image. To achieve this goal, we propose 1) a content-style cross attention based pixel sampling approach to roughly mimicking the style text image's background; 2) a pixel-wise style modulation technique to transfer varying color patterns of the style image to the content image spatial-adaptively; 3) a cross attention based multi-scale style fusion approach to solving text foreground misalignment issue between style and content images; 4) an image patch shuffling strategy to create style, content and ground truth image tuples for training. Experimental results on Chinese handwriting text image synthesis with SCUT-HCCDoc and CASIA-OLHWDB datasets demonstrate that the proposed method can improve the quality of synthetic text images and make them more photo-realistic.

A Question-Answering Approach to Key Value Pair Extraction from Form-Like Document Images

Proceedings of the AAAI Conference on Artificial Intelligence

In this paper, we present a new question-answering (QA) based key-value pair extraction approach,... more In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images. Specifically, KVPFormer first identifies key entities from all entities in an image with a Transformer encoder, then takes these key entities as questions and feeds them into a Transformer decoder to predict their corresponding answers (i.e., value entities) in parallel. To achieve higher answer prediction accuracy, we propose a coarse-to-fine answer prediction approach further, which first extracts multiple answer candidates for each identified question in the coarse stage and then selects the most likely one among these candidates in the fine stage. In this way, the learning difficulty of answer prediction can be effectively reduced so that the prediction accuracy can be improved. Moreover, we introduce a spatial compatibility attention bias into the self-attention/cross-a...

TSRFormer: Table Structure Recognition with Transformers

arXiv (Cornell University), Aug 9, 2022

We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recogn... more We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recognizing the structures of complex tables with geometrical distortions from various table images. Unlike previous methods, we formulate table separation line prediction as a line regression problem instead of an image segmentation problem and propose a new two-stage DETR based separator prediction approach, dubbed Separator REgression TRansformer (SepRETR), to predict separation lines from table images directly. To make the two-stage DETR framework work efficiently and effectively for the separation line prediction task, we propose two improvements: 1) A prior-enhanced matching strategy to solve the slow convergence issue of DETR; 2) A new cross attention module to sample features from a high-resolution convolutional feature map directly so that high localization accuracy is achieved with low computational cost. After separation line prediction, a simple relation network based cell merging module is used to recover spanning cells. With these new techniques, our TSRFormer achieves state-of-the-art performance on several benchmark datasets, including SciTSR, PubTabNet and WTW. Furthermore, we have validated the robustness of our approach to tables with complex structures, borderless cells, large blank spaces, empty or spanning cells as well as distorted or even curved shapes on a more challenging real-world in-house dataset.

Model adaptation based on HMM decomposition for reverberant speech recognition

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

The performance of a speech recognizer is degraded drastically in reverberant environments. We pr... more The performance of a speech recognizer is degraded drastically in reverberant environments. We proposed a novel algorithm which can model an observation signal by composition of HMMs of clean speech, noise and an acoustic transfer function(l]. However, how to estimate HMM parameters of the acoustic transfer function is a remaining serious problem. In our previous paperll], we measured real impulse responses of training positions in an experiment room. It is inconvenient and unrealistic to measure impulse responses for every possible new experiment room. This paper presents a new method to estimate HMM parameters of the acoustic transfer function from some adaptation data by using an HMM decomposition algorithm which is an inverse process of the HMM composition. Its effectiveness is confirmed by a series of speaker dependent and independent word recognition experiments on simulated distant-talking speech data.

On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition

4th International Conference on Spoken Language Processing (ICSLP 1996), Oct 3, 1996

We extend our previously proposed quasi-Bayes adaptive learning framework to cope with the correl... more We extend our previously proposed quasi-Bayes adaptive learning framework to cope with the correlated continuous density hidden Markov models (HMM's) with Gaussian mixture state observation densities in which all mean vectors are assumed to be correlated and have a joint prior distribution. A successive approximation algorithm is proposed to implement the correlated mean vectors' updating. As an example, by applying the method to on-line speaker adaptation application, the algorithm is experimentally shown to be asymptotically convergent as well as being able to enhance the efficiency and the effectiveness of the Bayes learning by taking into account the correlation information between different model parameters. The technique can be used to cope with the time-varying nature of some acoustic and environmental variabilities, including mismatches caused by changing speakers, channels, transducers, environments, and so on.

An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

arXiv (Cornell University), Apr 24, 2018

The anchor mechanism of Faster R-CNN and SSD framework is considered not effective enough to scen... more The anchor mechanism of Faster R-CNN and SSD framework is considered not effective enough to scene text detection, which can be attributed to its IoU based matching criterion between anchors and ground-truth boxes. In order to better enclose scene text instances of various shapes, it requires to design anchors of various scales, aspect ratios and even orientations manually, which makes anchorbased methods sophisticated and inefficient. In this paper, we propose a novel anchor-free region proposal network (AF-RPN) to replace the original anchor-based RPN in the Faster R-CNN framework to address the above problem. Compared with a vanilla RPN and FPN-RPN, AF-RPN can get rid of complicated anchor design and achieve higher recall rate on large-scale COCO-Text dataset. Owing to the high-quality text proposals, our Faster R-CNN based two-stage text detection approach achieves state-of-the-art results on ICDAR-2017 MLT, ICDAR-2015 and ICDAR-2013 text detection benchmarks when using single-scale and singlemodel (ResNet50) testing only.

Building Handwriting Recognizers by Leveraging Skeletons of Both Offline and Online Samples

2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015

We present an approach to leveraging both offline and online handwriting samples to build a singl... more We present an approach to leveraging both offline and online handwriting samples to build a single recognizer for recognizing both offline and online handwritings. Given a training set of offline handwriting samples and another set of online hand writing samples, a skeleton is derived first from each offline hand writing sample via vectorization. Then both the skeleton samples and online handwriting samples are normalized and rendered by using the same method to generate a combined training set of skeleton images. Finally a handwriting recognizer based on Deep Bidirectional Long Short-Term Memory (DBLSTM) and Hidden Markov Model (HMM) is built from the skeleton images. In recognition, a preprocessing step consistent with that in training is applied to an unknown offline or online handwriting sample to derive a skeleton image, which is recognized by the hybrid DBLSTM-HMM handwriting recognition system accordingly. We have built such a recognizer by using lAM benchmark databases of offline and online English handwritings plus an internal online handwriting corpus, which outperforms the recognizers built from either offline or online handwriting samples only.

Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016

We present a new approach to scalable training of deep learning machines by incremental block tra... more We present a new approach to scalable training of deep learning machines by incremental block training with intra-block parallel optimization to leverage data parallelism and blockwise model-update filtering to stabilize learning process. By using an implementation on a distributed GPU cluster with an MPI-based HPC machine learning framework to coordinate parallel job scheduling and collective communication, we have trained successfully deep bidirectional long short-term memory (LSTM) recurrent neural networks (RNNs) and fully-connected feed-forward deep neural networks (DNNs) for large vocabulary continuous speech recognition on two benchmark tasks, namely 309-hour Switchboard-I task and 1,860-hour "Switch-board+Fisher" task. We achieve almost linear speedup up to 16 GPU cards on LSTM task and 64 GPU cards on DNN task, with either no degradation or improved recognition accuracy in comparison with that of running a traditional mini-batch based stochastic gradient descent training on a single GPU.

A study on effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition

2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015

Deep Bidirectional Long Short-Term Memory (DBLSTM) with a Connectionist Temporal Classification (... more Deep Bidirectional Long Short-Term Memory (DBLSTM) with a Connectionist Temporal Classification (CTC) output layer has been established as one of the state-of-the-art solutions for handwriting recognition. It is well-known that the DBLSTM trained by using a CTC objective function will learn both local character image dependency for character modeling and long-range contextual dependency for implicit language modeling. In this paper, we study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition by comparing the performance of using or without using an explicit language model in decoding. It is observed that even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful. To deal with such a large-scale training problem, a GPU-based training tool has been developed for CTC training of DBLSTM by using a mini-batch based epochwise Back Propagation Through Time (BPTT) algorithm.

ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents

Document Analysis and Recognition – ICDAR 2021, 2021

Recent grid-based document representations like BERTgrid allow the simultaneous encoding of the t... more Recent grid-based document representations like BERTgrid allow the simultaneous encoding of the textual and layout information of a document in a 2D feature map so that state-of-the-art image segmentation and/or object detection models can be straightforwardly leveraged to extract key information from documents. However, such methods have not achieved comparable performance to state-of-the-art sequence-and graph-based methods such as LayoutLM and PICK yet. In this paper, we propose a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, where the input of CNN is a document image and the BERTgrid is a grid of word embeddings, to generate a more powerful grid-based document representation, named ViBERTgrid. Unlike BERTgrid, the parameters of BERT and CNN in our multimodal backbone network are trained jointly. Our experimental results demonstrate that this joint training strategy improves significantly the representation ability of ViBERTgrid. Consequently, our ViBERTgrid-based key information extraction approach has achieved state-of-the-art performance on real-world datasets.

Proceedings of the 30th ACM International Conference on Multimedia

We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recogn... more We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recognizing the structures of complex tables with geometrical distortions from various table images. Unlike previous methods, we formulate table separation line prediction as a line regression problem instead of an image segmentation problem and propose a new two-stage DETR based separator prediction approach, dubbed Separator REgression TRansformer (SepRETR), to predict separation lines from table images directly. To make the two-stage DETR framework work efficiently and effectively for the separation line prediction task, we propose two improvements: 1) A prior-enhanced matching strategy to solve the slow convergence issue of DETR; 2) A new cross attention module to sample features from a high-resolution convolutional feature map directly so that high localization accuracy is achieved with low computational cost. After separation line prediction, a simple relation network based cell merging module is used to recover spanning cells. With these new techniques, our TSRFormer achieves state-of-the-art performance on several benchmark datasets, including SciTSR, PubTabNet and WTW. Furthermore, we have validated the robustness of our approach to tables with complex structures, borderless cells, large blank spaces, empty or spanning cells as well as distorted or even curved shapes on a more challenging real-world in-house dataset. CCS CONCEPTS • Applied computing → Document analysis; • Computing methodologies → Computer vision.

Discriminative training of HMM based speech recognizer with gradient projection method

4th European Conference on Speech Communication and Technology (Eurospeech 1995)

Classification Approach to Robust Speech Recognition

Compact and Efficient WFST-Based Decoders for Handwriting Recognition

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017

We present two weighted finite-state transducer (WFST) based decoders for handwriting recognition... more We present two weighted finite-state transducer (WFST) based decoders for handwriting recognition. One decoder is a cloud-based solution that is both compact and efficient. The other is a device-based solution that has a small memory footprint. A compact WFST data structure is proposed for the cloud-based decoder. There are no output labels stored on transitions of the compact WFST. A decoder based on the compact WFST data structure produces the same result with significantly less footprint compared with a decoder based on the corresponding standard WFST. For the device-based decoder, on-the-fly language model rescoring is performed to reduce footprint. Careful engineering methods, such as WFST weight quantization, token and data type refinement, are also explored. When using a language model containing 600,000 n-grams, the cloud-based decoder achieves an average decoding time of 4.04 ms per text line with a peak footprint of 114.4 MB, while the device-based decoder achieves an average decoding time of 13.47 ms per text line with a peak footprint of 31.6 MB.

Improved localization accuracy by LocNet for Faster R-CNN based text detection in natural scene images

Pattern Recognition, 2019

Although Faster R-CNN based text detection approaches have achieved promising results, their loca... more Although Faster R-CNN based text detection approaches have achieved promising results, their localization accuracy is not satisfactory in certain cases due to their sub-optimal bounding box regression based localization modules. In this paper, we address this problem and propose replacing the bounding box regression module with a novel LocNet based localization module to improve the localization accuracy of a Faster R-CNN based text detector. Given a proposal generated by a region proposal network (RPN), instead of directly predicting the bounding box coordinates of the concerned text instance, the proposal is enlarged to create a search region so that an "In-Out" conditional probability to each row and column of this search region is assigned, which can then be used to accurately infer the concerned bounding box. Furthermore, we present a simple yet effective two-stage approach to convert the difficult multioriented text detection problem to a relatively easier horizontal text detection problem, which makes our approach able to robustly detect multi-oriented text instances with accurate bounding box localization. Experiments demonstrate that the proposed approach boosts the localization accuracy of Faster R-CNN based text detectors significantly. Consequently, our new text detector has achieved superior performance on both horizontal (ICDAR-2011, ICDAR-2013 and MULTILIGUL) and multi-oriented (MSRA-TD500, ICDAR-2015) text detection benchmark tasks.

A minimax search algorithm for robust continuous speech recognition

IEEE Transactions on Speech and Audio Processing, 2000

In this paper, we propose a novel implementation of a minimax decision rule for continuous densit... more In this paper, we propose a novel implementation of a minimax decision rule for continuous density hidden Markovmodel-based robust speech recognition. By combining the idea of the minimax decision rule with a normal Viterbi search, we derive a recursive minimax search algorithm, where the minimax decision rule is repetitively applied to determine the partial paths during the search procedure. Because of its intrinsic nature of a recursive search, the proposed method can be easily extended to perform continuous speech recognition. Experimental results on Japanese isolated digits and TIDIGITS, where the mismatch between training and testing conditions is caused by additive white Gaussian noise, show the viability and efficiency of the proposed minimax search algorithm.

Robust Table Detection and Structure Recognition from Heterogeneous Document Images

ArXiv, 2022

We introduce a new table detection and structure recognition approach named RobusTabNet to detect... more We introduce a new table detection and structure recognition approach named RobusTabNet to detect the boundaries of tables and reconstruct the cellular structure of the table from heterogeneous document images. For table detection, we propose to use CornerNet as a new region proposal network to generate higher quality table proposals for Faster RCNN, which has significantly improved the localization accuracy of Faster R-CNN for table detection. Consequently, our table detection approach achieves state-of-the-art performance on three public table detection benchmarks, namely cTDaR TrackA, PubLayNet and IIIT-AR-13K, by only using a lightweight ResNet-18 backbone network. Furthermore, we propose a new split-and-merge based table structure recognition approach, in which a novel spatial CNN based separation line prediction module is proposed to split each detected table into a grid of cells, and a Grid CNN based cell merging module is applied to recover the spanning cells. As the spatial...