Generative models can synthesize photo-realistic images of a single object. For example, for huma... more Generative models can synthesize photo-realistic images of a single object. For example, for human faces, algorithms learn to model the local shape and shading of the face components, i.e., changes in the brows, eyes, nose, mouth, jaw line, etc. This is possible because all faces have two brows, two eyes, a nose and a mouth, approximately in the same location. The modeling of complex scenes is however much more challenging because the scene components and their location vary from image to image. For example, living rooms contain a varying number of products belonging to many possible categories and locations, e.g., a lamp may or may not be present in an endless number of possible locations. In the present work, we propose to add a “broker” module in Generative Adversarial Networks (GAN) to solve this problem. The broker is tasked to mediate the use of multiple discriminators in the appropriate image locales. For example, if a lamp is detected or wanted in a specific area of the scen...
To edit a real photo using Generative Adversarial Networks (GANs), we need a GAN inversion algori... more To edit a real photo using Generative Adversarial Networks (GANs), we need a GAN inversion algorithm to identify the latent vector that perfectly reproduces it. Unfortunately, whereas existing inversion algorithms can synthesize images similar to real photos, they cannot generate the identical clones needed in most applications. Here, we derive an algorithm that achieves near perfect reconstructions of photos. Rather than relying on encoderor optimizationbased methods to find an inverse mapping on a fixed generator G(·), we derive an approach to locally adjust G(·) to more optimally represent the photos we wish to synthesize. This is done by locally tweaking the learned mapping G(·) s.t. ‖x − G(z)‖ < , with x the photo we wish to reproduce, z the latent vector, ‖ · ‖ an appropriate metric, and > 0 a small scalar. We show that this approach can not only produce synthetic images that are indistinguishable from the real photos we wish to replicate, but that these images are readi...
Automatic American Sign Language Imitation Evaluator
Imitation and evaluation procedure is important for ASL learning and teaching. However, the curre... more Imitation and evaluation procedure is important for ASL learning and teaching. However, the current online ASL learning resources do not provide affordable and convenient imitation-evaluation function. To solve this problem, we propose an Automatic American Sign Language Imitation Evaluator (AASLIE) to evaluate the hand movement in the imitation. The proposed AASLIE system extracts 3D trajectory of the centroid of the hand by first applying a two-stage algorithm for 2D hand detection and tracking allowing possible hand-face overlaps. The 3D trajectory is extracted using a Structure from Motion algorithm with the point correspondences calculated from minimizing an affine transformation. The evaluation contains two parts, recognition and quantitative evaluation, for giving more sensitive feedback than the current sign language recognition systems. The recognition is achieved by a classification algorithm. The quantitative evaluation score, which indicates the goodness of imitation, is...
Computer vision algorithms performance are near or superior to humans in the visual problems incl... more Computer vision algorithms performance are near or superior to humans in the visual problems including object recognition (especially those of fine-grained categories), segmentation, and 3D object reconstruction from 2D views. Humans are, however, capable of higher-level image analyses. A clear example, involving theory of mind, is our ability to determine whether a perceived behavior or action was performed intentionally or not. In this paper, we derive an algorithm that can infer whether the behavior of an agent in a scene is intentional or unintentional based on its 3D kinematics, using the knowledge of self-propelled motion, Newtonian motion and their relationship. We show how the addition of this basic knowledge leads to a simple, unsupervised algorithm. To test the derived algorithm, we constructed three dedicated datasets from abstract geometric animation to realistic videos of agents performing intentional and non-intentional actions. Experiments on these datasets show that ...
In just a few years, the photo-realism of images synthesized by Generative Adversarial Networks (... more In just a few years, the photo-realism of images synthesized by Generative Adversarial Networks (GANs) has gone from somewhat reasonable to almost perfect largely by increasing the complexity of the networks, e.g., adding layers, intermediate latent spaces, style-transfer parameters, etc. This trajectory has led many of the state-of-the-art GANs to be inaccessibly large, disengaging many without large computational resources. Recognizing this, we explore a method for squeezing additional performance from existing, low-complexity GANs. Formally, we present an unsupervised method to find a direction in the latent space that aligns with improved photo-realism. Our approach leaves the network unchanged while enhancing the fidelity of the generated image. We use a simple generator inversion to find the direction in the latent space that results in the smallest change in the image space. Leveraging the learned structure of the latent space, we find moving in this direction corrects many i...
This paper details the methodology and results of the EmotioNet challenge. This challenge is the ... more This paper details the methodology and results of the EmotioNet challenge. This challenge is the first to test the ability of computer vision algorithms in the automatic analysis of a large number of images of facial expressions of emotion in the wild. The challenge was divided into two tracks. The first track tested the ability of current computer vision algorithms in the automatic detection of action units (AUs). Specifically, we tested the detection of 11 AUs. The second track tested the algorithms' ability to recognize emotion categories in images of facial expressions. Specifically, we tested the recognition of 16 basic and compound emotion categories. The results of the challenge suggest that current computer vision and machine learning algorithms are unable to reliably solve these two tasks. The limitations of current algorithms are more apparent when trying to recognize emotion. We also show that current algorithms are not affected by mild resolution changes, small occlu...
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Do GANs replicate training images? Previous studies have shown that GANs do not seem to replicate... more Do GANs replicate training images? Previous studies have shown that GANs do not seem to replicate training data without significant change in the training procedure. This leads to a series of research on the exact condition needed for GANs to overfit to the training data. Although a number of factors has been theoretically or empirically identified, the effect of dataset size and complexity on GANs replication is still unknown. With empirical evidence from BigGAN and StyleGAN2, on datasets CelebA, Flower and LSUN-bedroom, we show that dataset size and its complexity play an important role in GANs replication and perceptual quality of the generated images. We further quantify this relationship, discovering that replication percentage decays exponentially with respect to dataset size and complexity, with a shared decaying factor across GAN-dataset combinations. Meanwhile, the perceptual image quality follows a U-shape trend w.r.t dataset size. This finding leads to a practical tool for one-shot estimation on minimal dataset size to prevent GAN replication which can be used to guide datasets construction and selection.
Uploads
Papers by Qianli Feng