Depth map compression for real-time view-based rendering

Bing-Bing Chai; Sriram Sethuraman; Harpreet S Sawhney; Paul Hatrack

doi:10.1016/J.PATREC.2004.01.002

Abstract

Realistic and interactive telepresence has been a hot research topic in recent years. Enabling telepresence using depth-based new view rendering requires the compression and transmission of video as well as dynamic depth maps from multiple cameras. The telepresence application places additional requirements on the compressed representation of depth maps, such as preservation of depth discontinuities, low complexity decoding, and amenability to real-time rendering using graphics cards. We propose an adaptation of an existing triangular mesh generation method for depth representation that can be encoded efficiently. The mesh geometry is encoded using a binary tree structure where single bit enabled flags that mark the split of triangles and the depth values at the tree nodes are differentially coded. By matching the tree traversal to the mesh rendering order, both depth map decoding and triangle strip generation for efficient rendering are achieved simultaneously. The proposed scheme also naturally lends itself to coding segmented foreground layers and providing error resilience. At similar compression ratio, new view generation using the proposed method provided similar quality as depth compression using JPEG2000. However, the new mesh based depth map representation and compression method showed a significant improvement in rendering speed when compared to using separate compression and rendering processes.

Depth map compression for real-time view-based rendering

Bing-Bing Chai *, Sriram Sethuraman, Harpreet S. Sawhney, Paul Hatrack
Sarnoff Corporation, 201 Washington Rd., Princeton, NJ 08543-5300, USA

Abstract

Keywords: Depth map compression; View-based rendering; Triangular mesh; 3D video stream; Foreground/background separation

1. Introduction

In the past few years, there has been an increased interest in 3D scene rendering. It has been used in sports events such as the Super Bowl, in movies and TV commercials. Traditional rendering methods model the complete geometry and texture of a 3D scene or object. Polygonal mesh representation of the 3D geometry is the typical representation to enable fast rendering using graphics hardware. Simplification and compres-

^[1]sion of such mesh representation have received considerable attention in the graphics community over the past few years (Taubin and Rossignac, 1998; Khodakovsky et al., 2000). Such works have concentrated primarily on static models where the models are created off-line and decompression and rendering have no time constraints.

Recently, image-based rendering (IBR) techniques have been proposed in the computer vision/ graphics communities. Unlike the traditional rendering methods, IBR methods synthesize arbitrary views of a scene from a collection of images observed from known viewpoints. Depending on the amount of 3D information being employed, a

- Corresponding author.
E-mail address: bchai@sarnoff.com (B.-B. Chai). ↩︎

continuum of IBR methods have been developed, ranging from pure image based methods such as light field rendering (Levoy and Hanrahan, 1996) and lumigraph (Gortler et al., 1996) to depth based image warping (Tao and Sawhney, 2000). The main advantage of the IBR approach is that it can render complicated scenes that are otherwise difficult to model geometrically. This becomes more practical for rendering dynamic scenes, where it is infeasible to create a global model of the scene at every time instant of the dynamic 3D scene. Though several representations already exist for coding depth or disparity (Grammalidis et al., 2000; Ohm, 1999), these representations have been developed without any real-time decompression or rendering constraints. This paper addresses the issues associated with a specific case of IBR known as local depth-based rendering where the local dynamic depth map at each camera is used in conjunction with the video (called a 3D video stream hereafter) and the camera pose to synthesize a virtual view under real-time constraints.

The application considered here is realistic and interactive telepresence that involves capture, transmission, and remote browsing of dynamic 3D scenes that are captured using a collection of synchronized video cameras. A sequence of depth maps is computed in real-time from the viewpoint of each camera using multiple color video streams. Each depth map stream is compressed along with the corresponding video stream to produce a 3D video stream that can be streamed over the Internet. Multiple such video streams are depacketized, decompressed, rendered from a virtual viewpoint and combined by a remote browser. This application scenario is illustrated in Fig. 1.

Section 2 states the need for a new depth representation by presenting the requirements for having a real-time view-based rendering application. To leverage the power of graphics cards, we choose a depth representation that is amenable to rendering using standard graphics hardware using OpenGL ${ }^{\circledR}$ . Section 3 presents an overview of a mesh simplification scheme used for rendering terrain height fields in real-time under different viewpoints and extends that to triangulate the depth map from a particular camera viewpoint. Section 4 describes how the proposed representa-

Fig. 1. Camera setup for multiple local depth map generation for rendering virtual viewpoints of a dynamic scene.
tion is compressed for transmission or storage. By closely coupling the decompression and rendering orders, real-time rendering rates are achieved. Section 5 presents some experimental results that highlight the level of compression, decompression speed, and rendering speed that are achieved using the proposed representation. Future extensions and conclusions are presented in Section 6.

2. Motivation for a new depth representation and compression

Rendering a virtual view from multiple 3D video streams requires warping the multiple camera views to a single virtual view using the depth maps and the camera poses. If all cameras are calibrated, for a pixel in the reference with image coordinates $p=[x, y, 1]$ , the depth-based image warping function is (Irani, 1996)
$p^{\prime}=p^{\prime \prime}+k\left(T-T_{s} p^{\prime \prime}\right)$
where $p^{\prime}=\left[x^{\prime}, y^{\prime}, 1\right]$ is the homogeneous coordinate of the pixel in the new view, $p^{\prime \prime}=\frac{p_{i p}}{\left|R P_{i}\right|}$ is the result of an intermediate warping that is affected by the camera rotation $R$ but not by the camera translation and the depth, $k=1 / Z$ is the depth information, $T=\left[T_{s}, T_{c}, T_{c}\right]^{1}$ is the camera translation.

For rendering purpose, the depth map needs to be highly accurate at the depth discontinuities

to ensure good quality synthesized new views. A color segmentation based stereo algorithm for recovering accurate depth maps with sharp depth discontinuity boundaries and fine details of thin structures is presented in (Tao and Sawhney, 2000). These sharp depth discontinuities need to be preserved after compression. A typical approach would be to compress the depth map using any known compression techniques such as MPEG-4 auxiliary stream (Grammalidis et al., 2000) and JPEG-2000 (Krishnamurthy et al., 2001), decode this representation at the remote end and then use the depth map for warping the corresponding image. However, these generic compression techniques typically blur the depth discontinuities unless special efforts are taken to preserve them (Krishnamurthy et al., 2001).

To create realistic 3D scenes, high-resolution frames are needed from each video camera. Soft-ware-based rendering of multiple high-resolution, 3D video streams in real-time ( 30 frames/s) is currently not feasible even on high-end PC platforms. Hence, it is desirable to leverage the graphics processors on PCs to perform the warping. However, a straightforward pixels-to-triangle representation results in a very high number of triangles and the graphics card rendering speeds, while better than the software rendering speeds, are still very far from real-time rendering rates. Standard compression techniques that offer moderately high compression have a high decoding complexity as well, adding more overhead to the rendering process.

This motivates a mesh-based representation for the depth map, so that the 3D geometry for rendering is specified over polygons and the video frames provide the texture at each time instant. A mesh representation has been used for coding multiview static images in (Girod and Magnor, 2000) for easy rendering using graphics hardware by creating a 3D mesh model from multiple views and then creating a texture map from bitmaps using the mesh geometry. As mentioned before, such 3D modeling is not feasible for dynamic scenes. Also, generation of local depth maps with information only from nearby cameras enables a scalable design for real-time generation of 3D video streams. Mesh representations have been used
for 3D-model and terrain visualization. Simplified meshes and traversal/decomposition schemes have been used to achieve compression and real-time rendering rates (Taubin and Rossignac, 1998; Khodakovsky et al., 2000; Lindstrom et al., 1996; Duchaineau, 1997). The characteristics of the depth map for 3D video streaming application distinguishes itself from smooth 3D static models that have no discontinuities within. The terrain visualization meshes are more closely related to our application as they use a 2D mesh of the terrain data to view the terrain from different views. However, terrain data remains static and is easier to render different views over time when compared to the dynamic depth data. In addition to the realtime rendering requirement, the depth representation in 3D video streaming is also governed by the level of compression and real-time decompression constraints in the case of a remote browser accessing the 3D video streams over a network. For live streams, the compression also needs to be performed in real-time.

Thus we consider a depth map representation that offers enough compression needed for transmission, is suitable for real-time decompression, and produces 3D data streams in a form such that standard PC graphics hardware can render multiple 3D streams in real-time. We choose the height field rendering mesh proposed by Lindstrom et al. (1996) and adapt this mesh representation for compression of depth maps and for fast viewbased rendering.

3. Proposed triangular-mesh based depth representation

Lindstrom et al. (1996) proposed an algorithm for real-time, continuous level of detail rendering of digital terrain and other height fields (referred hereafter as Lindstrom triangulation), where realtime mesh simplification and rendering was achieved. Although height fields rendering is different from the real-time, view-based video rendering that is targeted in this paper, we found that the Lindstrom triangulation algorithm can be easily adapted to our application. Hence we first give an overview of this algorithm. Then we

present our modifications to the algorithm for representing view-based depth maps.

3.1. Overview of Lindstrom triangulation

The Lindstrom triangulation algorithm makes use of a compact, regular grid mesh representation, and dynamically generates the triangular mesh with proper levels of details (LOD) for each rendered frame based on the location of the viewpoint and the geometry of the height field. A primitive mesh consists of $3 \times 3$ vertices, and is also the finest LOD. Coarser levels are formed by grouping smaller meshes in a $2 \times 2$ array configuration, and discarding every other row and column of the four higher resolution blocks (Fig. 2). In summary, level $l$ corresponds to a block of size $2^{l}+1$ . This hierarchical organization of blocks forms a quadtree structure. Lindstrom triangulation starts from a primitive mesh, and simplifies the mesh based on a criterion measuring the rendering error at the base vertex $B$ (Fig. 3),
$\delta_{B}=\left|B_{z}-\frac{A_{z}+C_{z}}{2}\right|$ ,
where $A_{z}, B_{z}, C_{z}$ are the $Z$ components or height values at the vertices. A view-dependent criterion is formulated in (Lindstrom et al., 1996).

There are two steps in the simplification process:

(a)

(b)

Fig. 2. Meshes with different LOD and parent-child relationship: (a) primitive $3 \times 3$ mesh at the finest LOD; (b) $5 \times 5$ level 2 mesh. In both cases, the vertices that are pointed to by the arrows are the parents of the vertices from which the arrows originate.

Fig. 3. Illustration of a triangle/co-triangle pair, ABD and CBD.

Coarse-grained (or block-based) simplification. This step determines which discrete LOD model is needed. By computing the rendering error of using the LOD of previous frame at current viewpoint and compares it against an uncertainty interval $\left(\delta_{l}, \delta_{B}\right]$ .
Fine-grained (or vertex-based) simplification. Individual vertices are considered for removal in this step. A triangle/co-triangle pair (Fig. 3) could be merged when the triangles in the triangle pair have no further subdivision. The pair is reduced to one single triangle if $\delta_{B}$ is smaller than a threshold $\tau$ .

In addition, special effort has to be made to avoid T-vertices, such as V1 and V2 shown in Fig. 4 a . This can be easily achieved by following vertex dependencies, i.e., the parent-child relationship. Two types of parent-child relationships are defined as in Fig. 2. Vertex $v$ is enabled if its projected delta segment $\delta_{v}$ exceeds $\tau$ , or any of its children are enabled. When a vertex is enabled, both of its triangle/co-triangle pairs have to exist. This also means that the dependencies will propagate across block boundaries (Fig. 4b).

Fig. 4. Illustration of T-vertices. (a) V1 and V2 are the T-vertices. They cause changing the viewpoint. (b) T-vertices problem is fixed by adding more triangles in block and its neighboring blocks. Note that this triangle propagation stops at the neighboring blocks.

Fig. 5. Sample triangular mesh in a block. Green denotes enabled $(v)=$ true, black denotes enabled $\left(v_{t}\right)=$ false. (For interpretation of the references in colour in this figure legend, the reader is referred to the web version of this article.)

Besides simplification, an algorithm for traversing a block to create a single triangle strip for fast rendering on graphics hardware using OpenGL is also presented in (Lindstrom et al., 1996). Figs. 5 and 6 explain the rendering algorithm. Each block is rendered as four quadrants. The mesh in each quadrant can be considered as a binary tree (or a binary vertex tree) of the triangle pairs at each level of resolution. Fig. 5 shows a block with a sample quadrant (shaded) after mesh simplification. Fig. 6 shows the corresponding binary tree. The three vertices on each tree node represent the triangle and the central vertex de-
notes the shared vertex between the co-pair. Each triangle vertex is denoted by subscripts $(l, t, r)$ where $v_{l} v_{r}$ is the hypotenuse. Terminal nodes are the vertices or triangles without further split, i.e. $v_{t}$ is not enabled. The rendering is done as follows.

Rendering order. Traverse the binary tree depthfirst and render the nodes that have their children disabled. Thus $\left[v_{5,1}, v_{4,2}, v_{5,3}\right]\left[v_{5,3}, v_{4,3}, v_{4,2}\right]\left[v_{4,2}\right.$ , $\left.v_{4,3}, v_{3,3}\right]\left[v_{3,3}, v_{4,3}, v_{4,4}\right]\left[v_{4,4}, v_{4,3}, v_{5,3}\right]\left[v_{5,3}, v_{4,4}, v_{5,5}\right]$ are the triangles in rendering order.

Triangle strip generation. The rendering algorithm generates a triangle strip for each quadrant that can be quickly rendered using OpenGL. For each vertex $v$ specified, the previous two vertices in the strip and $v$ form the current triangle to be rendered.

Let $\mathrm{mb}[0]$ and $\mathrm{mb}[1]$ denote two vertex buffers used in stripping. At the beginning of a quadrant, push $v_{l}$ of level $-(2 n+1)$ into $\mathrm{mb}[0]$ , where $2^{n}+1$ is the rendered block dimension. When ascending from a left child and descending to the right child, if $v_{t}$ of the parent node is not already in $\mathrm{mb}[0]$ or $\mathrm{mb}[1]$ , add $v_{t}$ to vertex list and render a triangle. Before adding $v_{t}$ to vertex list, the entries of $\mathrm{mb}[0]$ and $\mathrm{mb}[1]$ are swapped if (current-level + previ-ous-level) is even. After rendering the triangle, the last two vertices of that triangle will be in $\mathrm{mb}[0]$ and $\mathrm{mb}[1]$ . And the previous level is updated with the current level. For the tree in Fig. 6, the vertices added become $\left[v_{5,1} v_{4,2} v_{5,3} v_{4,3} v_{3,3} v_{4,4} v_{5,3}\right]$ . By rendering the left vertex of the next quadrant $\left(v_{5,5}\right)$ first, the last triangle of the current quadrant is

Fig. 6. Binary tree representation of the triangular mesh for the shaded quadrant in Fig. 5.

completed. Thus, by making the right vertex of a quadrant as the left vertex of the next quadrant, the whole block can be rendered in one triangle strip.

3.2. Modified Lindstrom triangulation

Lindstrom triangulation has a few properties that is applicable to our real-time view-based video rendering:

It generates a regular grid mesh that is easy to index and access. In the following section, we will present how a regular grid mesh also makes our mesh compression more efficient.
It generates triangle strips for fast hardware rendering.
The mesh eliminates T-vertices that can cause holes when warped to render a virtual view.
The mesh simplification will naturally preserve depth discontinuities by not allowing large triangles across discontinuities.

On the other hand, there are some differences between terrain rendering and 3D video streaming applications. First, Lindstrom et al. were focusing on generating the best mesh for a given rendered view for each frame, while we are targeting a wide range of viewpoints with a single mesh. The mesh is generated from the camera viewpoint and wrapped to desired virtual views. Second, there is no compression involved in the terrain rendering application, while 3D real-time video streaming concerns with compression of both video and depth information. Our mesh generation has two purposes: (a) to facilitate real-time rendering and (b) to achieve high depth map compression while maintaining an acceptable quality for the depth map. Third, the static terrain data allows frame-to-frame update of the mesh, which reduces the complexity of the mesh generation. The dynamic nature of the depth maps does not allow updates in a straightforward fashion. For the time being, we consider only meshes generated independently for each frame. A mesh generation scheme that considers motion in video will be a future research topic. Lastly, the depth map we are coding could have
holes (undefined value) in it and this special case has to be handled. We modify the Lindstrom triangulation to achieve the above objectives.

For achieving a good rendering quality from multiple viewpoints, we perform the mesh simplification using a simple, viewpoint independent criterion. The mesh simplification starts from the primitive mesh with a predefined coarsest LOD (or largest block). At a certain level of resolution, a triangle/co-triangle pair could be merged if the depth change $\delta_{B}=\left|B_{z}-\frac{d_{z}+r_{z}}{2}\right|$ is less than a predefined threshold $\tau$ . Again, this merge is also conditioned on the dependencies from its neighboring blocks.

Two problems arise in this simplification scheme. First, since depth estimation from stereo views does not render depth values for every pixel, undefined values need to be taken care of. Second, when rendering a new view using the mesh, the background is connected to the foreground resulting in a stretched triangle spanning both foreground and background when changing viewpoint (a rubber band effect). We handle undefined values by representing it with an infinite depth value. In this way, the undefined values can be processed the same way as other depth discontinuities. One simple solution to the second problem is to not render triangles that have disparate depths on their vertices. However, this will result in holes around a foreground object that may not be filled in from another view. Ideally, two depth values should be used at the location of a depth discontinuity, one for the foreground and one for the background. To capture this and solve the problem of the stretched mesh, we check the depth difference among the vertices for each triangle to be rendered. If the difference exceeds a depth discontinuity threshold, then the value of the third vertex that is different from the other two vertices is saved and modified with the depth value of one of the other two vertices. After rendering this triangle, the triangle strip is ended and a new triangle strip is started with the last two vertices in the buffer. The saved depth value is retrieved and the strip continues until we hit another discontinuity. Though this slightly increases the rendering complexity, the true discontinuity is represented at the boundary of objects.

4. Compression of the proposed depth representation

The compression of the depth triangulation mesh includes the coding of the mesh geometry and the depth value at the mesh vertices. The objective of the proposed compression method is not only to achieve the high compression ratio, but also to reduce the computational complexity at the decoding/rendering end of the system and provide a good quality rendered view. The triangulated depth map is represented by two types of information: (a) vertex information (location and enabled flag), (b) the depth value at the vertex. The first two subsections describe the general coding scheme of the triangulated depth map. The rest of the section present solutions to special concerns with 3D video streaming application: coarser mesh resolution for lower complexity, foreground/background separation, and error resilience.

4.1. Coding of vertex information

Two simple and obvious methods for encoding the location of vertices in an image are: (a) coding the $(x, y)$ location of each vertex explicitly using a differential coder; (b) coding a binary map of the presence/absence of all possible vertices. Both of these two methods can encode the mesh geometry with reasonable compression. But both require reforming the triangle strips at the rendering end, which adds a burden to real-time rendering.

To speed up the decoding and rendering process, we choose to encode the vertex location in the vertex rendering order within each rendered block. This is feasible because the rendering algorithm is known at both the encoder and the decoder. By examining the rendering process, one comes to realize that the recursion down the binary triangle tree makes use of only the binary enabled flag at each vertex node. Therefore we propose to encode the enabled flag in the rendering order. When a false flag is encountered, vertices with finer resolution under the current triangle will not be rendered, and thus will not be encoded. That is, the binary enabled flags form a zero tree for coding the mesh geometry and depth values. Hence we encode the enabled flags in the depth-first tree traversal order to code the mesh geometry. Although arithmetic coding could further reduce the 1 bit/ vertex limit, our experimental results show that the enabled and disabled flags are hard to predict from the context, indicating that further entropy coding will not save much in compression. Furthermore, arithmetic decoding will increase the decoding complexity. Hence the enabled flags are just packed as single bits. In the example given in Fig. 4, the shaded quadrant can be encoded by following the binary tree structure as follows. Traverse tree depth-first up to level-1 and code enabled $\left(v_{i}\right)$ . This gives $e\left(v_{3,3}\right), e\left(v_{5,3}\right), e\left(v_{4,2}\right), e\left(v_{5,2}\right)$ , $e\left(v_{5,2}\right), e\left(v_{4,2}\right), e\left(v_{4,3}\right), e\left(v_{4,3}\right), e\left(v_{5,3}\right), e\left(v_{4,4}\right), e\left(v_{4,3}\right)$ , $e\left(v_{4,3}\right), e\left(v_{4,4}\right), e\left(v_{5,4}\right)$ .

From the rendering algorithm, it is easy to see that some vertices are traversed multiple times. A map of the enabled flags (encoded map) is kept at both the encoder and decoder to record whether a vertex has already been encoded/decoded so that there is no redundancy in the coding process. This is a trade-off between memory usage and compression efficiency. Since some vertices appears as many as four times in the rendering order, we feel that this is a good idea to trade memory usage and a little added computational complexity for compression efficiency. In the example above, by storing an array of flags in memory to indicate whether a node has already been visited, we can eliminate repetitions, to get $e\left(v_{3,3}\right), e\left(v_{5,3}\right), e\left(v_{4,2}\right)$ , $e\left(v_{5,2}\right), e\left(v_{4,3}\right), e\left(v_{4,4}\right), e\left(v_{5,4}\right)$ . Thus 1110110 codes this tree.

4.2. Coding of depth information

The original depth values at the vertices were first converted from floating point to 8 -bit unsigned value. To achieve low complexity, the encoding is chosen to be a simple differential coding within a rendered block, followed by Huffman entropy coding. Only the depth values at enabled vertices and the four corners of the rendered blocks are encoded. The infinite depth is assigned a special value in the Huffman table. The current differential coding is simply the difference between the current and previous depth value in the coding order. We will evaluate more efficient DPCM for depth values that would reduce the entropy in Huffman coding in future work.

Two methods for depth value quantization were evaluated: (1) depth value at each vertex is quantized with a uniform quantizer; (2) delta values from the differential coding (DPCM) are quantized with a uniform quantizer. The former results in a step-like depth map that leads to a folding effect in the rendered view. The latter quantization depends on the DPCM used, and results in different reconstructed depth values for the same vertex location in different rendering blocks, whose effect is similar to the blocking artifact in DCT based coding algorithms but is far more annoying when rendering new views. For now, the DPCM values are coded losslessly. Similar to the coding of vertex information, the encoded map is also used to record the coding status of the depth value at each vertex so that each vertex depth will be encoded only once.

4.3. Coding at different mesh resolutions

In addition to the threshold that is used to control the mesh simplification, the depth mesh can also be generated at coarser resolutions for reduced complexity and higher compression, which may be necessary when there are limits on the bandwidth and computational power. The highest resolution (level 0 ) assumes a primitive mesh with a dimension of $3 \times 3$ vertices. The next level mesh (level 1) has a primitive mesh of a dimension of $5 \times 5$ vertices, with vertices in odd rows or columns not present in the depth mesh, resulting in a coarser mesh. Coarser level mesh can be easily coded by terminating the binary tree at the desired resolution. In other words, all vertices at the desired resolution will be coded as disabled. A more sophisticated method is to determine the mesh resolution for each block according to the error introduced by the change in resolution, similar to that in (Lindstrom et al., 1996). The downside of this method is the added complexity.

4.4. Compression with foreground/background separation

In applications where the background is static and a complete background image is available, it makes sense to encode the foreground and back-
ground separately. Foreground/background separation not only saves the bandwidth in transmitting the depth map for the background repeatedly, but also eliminates the holes in the new view that are caused by unavailable information. JPEG2000 has the option of ROI coding, which can be used for separating foreground from background (Krishnamurthy et al., 2001). But it will cause blurred boundaries between the foreground and background. The foreground/background separation is easily achieved in the proposed algorithm. First, a separate, complete background is processed as described above. Then the real-time depth map is preprocessed such that foreground depth values remain unchanged and the background regions are assigned infinite depth values. This new depth map is then processed with the proposed algorithm. This technique of foreground/background separation works better than encoding the complete depth maps because large triangles will be created and coded in the regions with infinite depth.

4.5. Coding unit for resynchronization

Another important issue in internet applications is error resiliency. The proposed algorithm described so far introduced no redundancy, and one error would cause the loss of an entire frame. For the purpose of error resilience, we introduce the concept of coding unit, which is similar to that of slices in some video coding standards such as MPEG2 (ISO/IEC, 1995). The purpose is to localize the error to only a small portion of the bitstream. By separating the bitstream into a few independent segments, errors that occur in one segment do not affect the rest of the bitstream. This will allow the rendering of part of an image, making error concealment possible. As the basic unit in the proposed triangulation/coding algorithm is a rendered block, a coding unit is defined as a set of rendered blocks (Fig. 7). The coding unit is encoded such that it can be decoded without any information from the rest of the image. This also means that vertices on the boundaries of a coding unit will have to be coded multiple times, once in each coding unit. In other words, the encoded map will be reset for each coding unit.

Fig. 7. Coding unit formation. Each coding unit (with different shade) here consists of 4 rendered blocks.

5. Experimental results

Experiments were conducted on $640 \times 480$ sequences. The coding unit, which is used as a default, contains four rendered blocks arranged in a $2 \times 2$ square. The results presented here are for the multi-view sequence, bike. Two original camera views and the estimated uncompressed depth map for the left view are shown in Fig. 8a-c. Table 1 lists the depth simplification and compression results for the left camera view. The case for simplification threshold $\tau=1$ is equivalent to uncompressed depth. It can be seen that $3: 1$ compression is achieved even in this lossless case. As the threshold increases, the depth mesh is gradually simplified. We reach below 30,000 triangles for $\tau=20$ , and obtained about 27:1 compression. Fig. 8 d shows the reconstructed depth map. Real-time rendering can be achieved for $\tau=3$ for a single view and $\tau=10$ for 2 views (Chai et al., 2002). The virtual view rendered from two existing views with the original and compressed depth maps are shown respectively in Fig. 8e and f. The rendered view from the simplified mesh looks very close to the one with no simplification, indicating that the level of mesh simplification that was performed to achieve real-time rendering can provide high quality rendered virtual views.

Table 2 shows the results for different modes of simplifications. Level 0 mesh corresponds to a $3 \times 3$ primitive mesh, and level 1 corresponds to a $5 \times 5$ primitive mesh. Reducing the mesh resolution reduces the mesh complexity to about half and compression ratio roughly doubles. When the
complete background is available as in Fig. 8h, it needs to be compressed only once and then the foreground (Fig. 8g) is updated over time. It can be seen that even combining the separate foreground and background (Fig. 8i-j) yields fewer triangles to render compared to coding both foreground and background as one depth map (Fig. 8k). The explanation is that there are no small triangles to render in the background that is caused by the discontinuity at foreground boundary.

Table 3 shows the comparison between depth compression using JPEG-2000 (Gormish et al., 2000) compression and the proposed mesh-based compression. The PSNR is obtained by converting the reconstructed depth map to 8 -bit precision and performing the computation on the 8 -bit version of it. JPEG-2000 seems to preserve the overall depth quality better than the proposed algorithm. However, the rendered view shows little difference. The advantage of the proposed mesh-based compression is the rendering speed. At 0.29 bpp , the rendering speed for mesh-based compression is 70 frames/s. In the case of JPEG-2000 depth compression, since there was no triangulation performed, rendering was done using non-simplified triangular mesh. The rendering speed for JPEG2000 compression is at 6.35 frames $/ \mathrm{s}$ , which is less than one-tenth of the speed for the proposed technique.

6. Conclusions and future work

We have developed a new algorithm that performs depth map triangulation and compression for 3D video streaming and new view generation. The depth map is encoded as a simplified triangular mesh. It achieves moderate compression by encoding vertices as enabled flags and depth values at enabled vertices in mesh rendering order. The decoding complexity is kept low by the use of a simple compression technique and coding in rendering order. Good quality new view rendering, as well as real-time rates for view-based rendering are accomplished with the proposed algorithm.

Our future research directions include improving depth DPCM coding efficiency, error metric driven quantization of DPCM values, using

Fig. 8. View-based rendering results: (a) left camera view; (b) right camera view; © depth map for the left camera view; (d) reconstructed depth for $\tau=20$ ; (e) rendered view using non-simplified depth mesh; (f) rendered view using simplified depth mesh with $\tau=20$ ; (g) foreground only rendering; (h) background only rendering; (i) corresponding mesh for (g); (j) corresponding mesh for (h); (k) mesh with both foreground and background.

Fig. 8 (continued)

Table 1
Depth mesh simplification and compression results for one frame of bike sequence

$\tau$	No. rendered triangles	Rendering speed (frames/s)	Compression ratio
1	497,600	6.6	2.96
2	123,222	24	8.84
3	81,416	34	12.32
5	51,181	51	17.52
10	34,273	65	23.47
20	27,813	70	27.42

Table 2
Compression results in different simplification modes for one frame of bike sequence, $\tau=20$

Mode	No. rendered triangles	Compression ratio
Level 0 mesh	27,813	27.42
Level 1 mesh	12,550	48.31
Foreground only	6630	61.06
Full background	14,872	46.54

Table 3
Comparison of depth map compression using mesh coding and JPEG-2000 coded at 0.29 bpp

Encoding scheme	Reconstructed depth PSNR $(\mathrm{dB})$	$(Y, U, V)$ PSNR (dB) of a synthesized new view	Rendering speed (frames/s)
Mesh coding	47.1851	34.95, 44.19, 43.44	70
JPEG-2000	51.3578	35.17, 44.46, 44.05	6.35

frame-to-frame correlation to aid mesh generation/compression, reshaping of depth map for better intra-frame bit allocation, and scalable mesh coding and rate control.

Acknowledgements

We would like to acknowledge Rakesh Kumar, Hai Tao, Manoj Aggarwal, and Aydin Arpa for

their valuable inputs to this work. This material is based upon work supported by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under contract F30602-00-0143. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Projects Agency or the United States Air Force.

References

Chai, B.-B., et al., 2002. A depth map representation for realtime transmission and view-based rendering of a dynamic 3D-scene. In: First International Symposium on 3D Data Processing Visualization and Transmission, Padova, Italy.
Duchaineau, M., 1997. ROAMing terrain: Real-time optimally adapting meshes. Lawrence Livermore National Laboratory Technical Report, UCRL-JC-127870. Available from [http://www.llnl.gov/graphics/ROAM/](http://www.llnl.gov/graphics/ROAM/).
Girod, B., Magnor, M., 2000. Two approaches to incorporate approximate geometry into multiview image coding. In: Proc. of ICIP.
Gormish, M.J., Lee, D., Marcellin, M.W., 2000. JPEG-2000: Overview, architecture and applications. In: Proceedings of ICIP-2000.

Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F., 1996. The lumigraph, Computer Graphics Proceedings, Annual Conference Series, Proc. SIGGRAPH '96, pp. 43-54.
Grammalidis, N. et al., 2000. Sprite generation and coding in multiview image sequences. IEEE Trans. Circuits Systems Video Technol.
Irani, A., 1996. Parallax Geometry of Pairs of Points for 3-D Analysis. In: Proceedings of ECCV.
ISO/IEC, 1995. MPEG2 Video, ISO/IEC International Standard, 13818-2.
Khodakovsky, A., Schroder, P., Sweldens, W., 2000. Progressive geometry compression. In: Proc. of SIGGRAPH 2000.

Krishnamurthy, R. et al., 2001. Compression and transmission of depth maps for image based rendering. ICIP 2001, pp. 828-831.
Levoy, M., Hanrahan, P., 1996. Light field rendering. In: Computer Graphics Proceedings, Annual Conference Series, Proc. SIGGRAPH '96, pp. 31-42.
Lindstrom, P. et al., 1996. Real-time, continuous level of detail rendering of height fields. In: Proc. SIGGRAPH '96.
Ohm, J.-R., 1999. Stereo/multiview encoding using the MPEG family of standards. Invited Paper, Electronic Imaging '99, San Diego.
Tao, H., Sawhney, H.S., 2000. Global matching criterion and color segmentation based stereo. In: Proc. Workshop on the Application of Computer Vision (WACV2000), pp. 246253.

Taubin, G., Rossignac, J., 1998. Geometric compression through Topological surgery. ACM Trans. Graphics 17, 2.

Depth map compression for real-time view-based rendering

Sign up for access to the world's latest research

Abstract

Related papers

Depth map compression for real-time view-based rendering

Abstract

1. Introduction

2. Motivation for a new depth representation and compression

3. Proposed triangular-mesh based depth representation

3.1. Overview of Lindstrom triangulation

3.2. Modified Lindstrom triangulation

4. Compression of the proposed depth representation

4.1. Coding of vertex information

4.2. Coding of depth information

4.3. Coding at different mesh resolutions

4.4. Compression with foreground/background separation

4.5. Coding unit for resynchronization

5. Experimental results

6. Conclusions and future work

Acknowledgements

References

References (15)

Related papers

Related topics

Cited by