Tibetan Studies and Digital Tibetan by Kurt Keutzer
Regarding this inconsistency in spellingtshon/mtshon-Tibeto-Burman linguist Nathan Hill offers th... more Regarding this inconsistency in spellingtshon/mtshon-Tibeto-Burman linguist Nathan Hill offers the comforting comment "m-comes and goes a lot." (Personal communication). 7 Katha Upaniṣad 4.13: aṅguṣṭhamātraḥ puruṣo jyotir iva adhūmakaḥ / īśāno bhūtabhavyasya sa evādya sa u śvaḥ //.

ACM Transactions on Asian and Low-Resource Language Information Processing, 2021
Over the past decade, through a mixture of optical character recognition and manual input, there ... more Over the past decade, through a mixture of optical character recognition and manual input, there is now a growing corpus of Tibetan literature available as e-texts in Unicode format. With the creation of such a corpus, the techniques of text analytics that have been applied in the analysis of English and other modern languages may now be applied to Tibetan. In this work, we narrow our focus to examine a modest portion of that literature, the Mind-section portion of the literature of the Tibetan tradition of the Great Perfection. Here, we will use the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection. It has been necessary for us to participate in all portions of this process: corpora identification and text edition selection, rendering the text as e-texts in Unicode using both Optical Character Recognition and manual entry, data cleaning and transformation, implementation of software for text analysis, and interpretation of results. For this reason, we hope this study can serve as a model for other low-resource languages that are just beginning to approach the problem of providing text analytics for their language.

Revue d’Etudes Tibétaines, 2012
In sGa ston's list of the Southern Treasures discovered by gShen chen Klu dga' a series of texts ... more In sGa ston's list of the Southern Treasures discovered by gShen chen Klu dga' a series of texts referred to as the Facets of Mind, Nine Minor Texts on Mind are mentioned. The Bon tradition has acknowledged from that time to the present day that these are seminal texts in the literature of Bon. Furthermore, these texts would eventually be classified as the exemplary works of the Mind Section of Bon Dzogchen. Nevertheless, the precise content of these texts has been unclear to modern scholars, both Tibetan and Western, working outside of Tibet. With the publication in 1999 of Mongyal Lhase's Edition of the Bon Kangyur, as well as with other subsequent publications, we are now in a better position to identify and understand these works. The aim of this paper is to clearly identify the titles of these texts, to identify the various editions in which they are available, and to begin to understand how they work together with tantric elements to form a holistic system of training.

The use of advanced computational methods for the analysis of large corpora of electronic texts i... more The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming increasingly popular in humanities and social science research. Unfortunately, Tibetan Studies has lacked such a repository of electronic, searchable texts. The automated recognition of printed texts, known as Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust OCR systems for the Tibetan language have not been available. In this paper, we introduce one new system, called Namsel, which uses Optical Character Recognition (OCR) to support the production, review, and distribution of searchable Tibetan texts at a large scale. Namsel tackles a number of challenges unique to the recognition of complex scripts such as Tibean uchen and has been able to achieve high accuracy rates on a wide range of machine-printed works. In this paper, we discuss the details of Tibetan OCR, how Namsel works, and the problems it is able to solve. We also discuss the collaborative work between Namsel and its partner libraries aimed at building a comprehensive database of historical and modern Tibetan works—a database that consists of more than one million pages of texts spanning over a thousand years of literary production.

Recognition of Tibetan wood block prints with generalized hidden Markov and kernelized modified quadratic distance function
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data - MOCR_AND '11, 2011
ABSTRACT Recognition of Tibetan wood block print is a difficult problem that has many challenging... more ABSTRACT Recognition of Tibetan wood block print is a difficult problem that has many challenging steps. We propose a two stage framework involving image preprocessing, which consists of noise removal and baseline detection, and simultaneous character segmentation and recognition by the aid of a generalized hidden Markov model (also known as gHMM). For the latter stage, we train a gHMM and run the generalized Viterbi algorithm on our image to decode observations. There are two major motivations for using gHMM. First, it incorporates a language model into our recognition system which in turn enforces grammar and disambiguates classification errors caused by printing errors and image noise. Second, gHMM solves the segmentation challenge. Simply put gHMM is an HMM where the emission model allows multiple consecutive observations to be mapped to the same state. For features of our emission model we apply line and circle Hough transform to stroke detection, and use classspecific scaling for feature weighing. With gHMM, we find KMQDF to be the most effective distance metric for discriminating character classes. The accuracy of our system is 91.29%.
Deep Learning/Deep Neural Nets by Kurt Keutzer
2011 IEEE International Parallel & Distributed Processing Symposium, 2011
We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs ent... more We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.

2013 IEEE International Conference on Image Processing, 2013
2D image convolution is ubiquitous in image processing and computer vision problems such as featu... more 2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations. We focus on convolution with small filters (2x2-7x7), but our techniques can be extended to larger filter sizes. Depending on filter size, our speedups on two NVIDIA architectures range from 1.2x to 4.5x over state-of-the-art GPU libraries.
PyCASP: Pattern-Based, Productive, Efficient and Portable Application Development on Parallel Platforms

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many N... more Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for many edge processors, and it has been a challenge to deploy these models for edge applications and devices that have resource constraints. While quantization can be a viable solution to this, previous work on quantizing Transformer based models uses floating-point arithmetic during inference, thus limiting model deployment on many edge processors. In this work, we propose a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process. In particular, we demonstrate how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations. We use those approximations in our method, I-BERT, with an end-to-end integer-only inference, and...
Efficient Machine Learning by Kurt Keutzer
Clinically feasible reconstruction time for L1-SPIRiT parallel imaging and compressed sensing MRI
... Theory: L1-SPIRiT reconstruction requires solving a non-linear constrained optimization probl... more ... Theory: L1-SPIRiT reconstruction requires solving a non-linear constrained optimization problem: minimize ... Using our OpenMP calibration and our Cuda POCS solver results in 97-second ... Oursolvers are scalable to larger image sizes, more channels, and to larger processing ...

The H.264 decoder has a sequential, control intensive front end that makes it difficult to levera... more The H.264 decoder has a sequential, control intensive front end that makes it difficult to leverage the potential performance of emerging manycore processors. Preparsing is a functional parallelization technique to resolve this front end bottleneck. However, the resulting parallel macro block (MB) rendering tasks have highly input-dependent execution times and precedence constraints, which make them difficult to schedule efficiently on manycore processors. To address these issues, we propose a two step approach: (i) a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, (ii) an MB level scheduling technique to allocate and load balance MB rendering tasks. The run time MB level scheduling increases the efficiency of parallel execution in the rest of the H.264 decoder, providing 60% speedup over greedy dynamic scheduling and 9-15% speedup over static compile time scheduling for more than four processors. The preparsing technique coupled with run time MB level scheduling enables a potential 7x speedup for H.264 decoding.

To realize high performance, embedded applications are deployed on multiprocessor platforms tailo... more To realize high performance, embedded applications are deployed on multiprocessor platforms tailored for an application domain. However, when a suitable platform is not available, only few application niches can justify the increasing costs of an IC product design. An alternative is to design the multiprocessor on an FPGA. This retains the programmability advantage, while obviating the risks in producing silicon. This also opens FPGAs to the world of software designers. In this paper, we demonstrate the feasibility of FPGA-based multiprocessors for high performance applications. We deploy IPv4 packet forwarding on a multiprocessor on the Xilinx Virtex-II Pro FPGA. The design achieves a 1.8 Gbps throughput and loses only 2.6X in performance (normalized to area) compared to an implementation on the Intel IXP-2800 network processor. We also develop a design space exploration framework using Integer Linear Programming to explore multiprocessor configurations for an application. Using this framework, we achieve a more efficient multiprocessor design surpassing the performance of our hand-tuned solution for packet forwarding.
Data-parallel large vocabulary continuous speech recognition on graphics processors

Tremendous compute throughput is becoming available in personal desktop and laptop systems throug... more Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA's GTX280 GPU. Our implementation consists of two phases -compute-intensive observation probability computation phase and communication-intensive graph traversal phase. We take advantage of dynamic elimination of redundant computation in the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs. On 3.1 hours of speech data set, we achieve more than 11× speedup compared to a highly optimized sequential implementation on Intel Core i7 without sacrificing accuracy.
Fast< formula formulatype
Journal of Logic and Computation, 2014

IEEE Transactions on Medical Imaging, 2000
A fast response technique is developed to investigate the short-term postprogram and post-erase d... more A fast response technique is developed to investigate the short-term postprogram and post-erase discharge in Flash memory devices. The procedure is based on fast V TH -evaluation methods developed for bias temperature instability and provides the transient characteristics after 20 ms under the program or erase conditions. The following different structures are investigated: 1) SiO 2 /high-k stacks; 2) charge trap memories; 3) and floating gate memories. Dielectrics targeted for Flash memory applications are used as charge trap layers and interpoly dielectrics. In this paper, we show results on Al 2 O 3 , DyScO, GdScO, and hexagonal and perovskite LuAlO. The postprogram and post-erase curves hold useful information about the dielectric properties and are used as a fast screening technique for alternative materials. Index Terms-Charge trap memory, high-k dielectrics, TaN-AlO-SiN-oxide-Si (TANOS), trap characterization.

IEEE Signal Processing Magazine, 2000
P arallel scalability allows an application to efficiently utilize an increasing number of proces... more P arallel scalability allows an application to efficiently utilize an increasing number of processing elements. In this article, we explore a design space for parallel scalability for an inference engine in large vocabulary continuous speech recognition (LVCSR). Our implementation of the inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs. The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize parallelism opportunities in today's highly parallel processors. We propose four application-level implementation alternatives called algorithm styles and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On a 44-min speech data set, we demonstrate substantial speedups of 3.4 3 on Core i7 and 10.5 3 on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms. [ Inference engines in large vocabulary continuous speech recognition ] Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 02,2010 at 17:05:04 EST from IEEE Xplore. Restrictions apply.
A parallel region based object recognition system
2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011
... In Work-shop on Generative-Model Based Vision, CVPR, 2004. 3 [10] P. Felzenszwalb, D. McAlles... more ... In Work-shop on Generative-Model Based Vision, CVPR, 2004. 3 [10] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. In CVPR, 2008. 1, 2 [11] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. ...
Efficient, high-quality image contour detection
2009 IEEE 12th International Conference on Computer Vision, 2009
Abstract Image contour detection is fundamental to many image analysis applications, including im... more Abstract Image contour detection is fundamental to many image analysis applications, including image segmentation, object recognition and classification. However, highly accurate image contour detection algorithms are also very computationally intensive, ...
Uploads
Tibetan Studies and Digital Tibetan by Kurt Keutzer
Deep Learning/Deep Neural Nets by Kurt Keutzer
Efficient Machine Learning by Kurt Keutzer