Handbook of Multimedia for Digital Entertainment and Arts- P19

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:30

Thêm vào BST

Báo xấu

65
lượt xem 5
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Handbook of Multimedia for Digital Entertainment and Arts- P19: The advances in computer entertainment, multi-player and online games, technology-enabled art, culture and performance have created a new form of entertainment and art, which attracts and absorbs their participants. The fantastic success of this new field has influenced the development of the new digital entertainment industry and related products and services, which has impacted every aspect of our lives.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Handbook of Multimedia for Digital Entertainment and Arts- P19

23 Computer Graphics Using Raytracing 539 in a vector form, the directions of the reﬂected and refracted rays can be determined as follows: cos Â1 D N . L/ p cos Â2 D 1 .n1 =n2 /2 .1 .cos Â1 /2 / Vreﬂect D 1 C 2N cos Â1 (14) Â Ã n1 n1 Vrefract D LC cos Â1 cos Â2 N n2 n2 As an optimization that may be made in the implementation of a raytracer, the medium in the absence of geometry may be assumed to have an index of refrac- tion of one. Therefore, as n2 is now one, the index of refraction for each piece of geometry becomes equal to the ratio of the indices of refraction of the medium to that of its surroundings, simplifying (14) to Vrefract D L C .cos Â1 cos Â2 / N (15) Controlling Scene Complexity Because each intersection of a ray with scene geometry may generate additional rays, the number of rays to be traced through the scene grows geometrically with the number of ray-geometry intersections, and therefore with scene complexity. In ex- treme cases, the number of rays traced and therefore the number of intersection tests performed can reach inﬁnity. Fortunately, very few surfaces are perfectly reﬂective or transparent and absorb some of the light energy falling on them. For this rea- son, the total light energy represented by the secondary rays diminishes with each intersection and the effect of the secondary rays on the output image is less signiﬁ- cant with each bound. Note that once a ray escapes the volume enclosing the scene geometry, it can no longer generate any ray-geometry intersections and it will not contribute further to the output image. To reduce computational complexity and therefore calculation time required to generate the output image using the raytracing algorithm, various methods are usu- ally employed to limit the number of recursions which may be generated by a single primary ray. In its simplest form, this can be achieved by placing an upper limit on the number of bounces allowed [Shirley 05]. In a more sophisticated system, the projected effect of the ray on the output image may be estimated after each inter- section [Hall 83]. If the expected contribution of either a reﬂected or transmitted ray falls below a particular threshold, the new ray is not traced. By altering this threshold, a rough approximation to a ﬁnal image may be generated quickly for preview purposes, or a more reﬁned, ﬁnal output. Figure 3 shows examples of an output image generated using primary and shadow rays only for recursion depths of zero to ﬁve. In Figure 3(a), no reﬂection is seen because the reﬂection rays are not generated at the primary ray intersection point. This causes the spheres to appear ﬂat and dull. Figure 3(b) depicts reﬂections
540 G. Sellers and R. Lukac Fig. 3 An image generated using primary and secondary rays for (a) zero, (b) one, (c) two, (d) three, (e) four and (f) ﬁve bounces
23 Computer Graphics Using Raytracing 541 a b c d Fig. 4 Difference images between each successive bounce: (a) ﬁrst, (b) second, (c) third and (d) fourth bounce of the other spheres in the scene due to a ﬁrst bounce. However, in the reﬂections of the spheres on the ﬂoor they still have a dull and ﬂat appearance. By adding succes- sively more bounces, enhanced detail in the reﬂections can be seen, as demonstrated in Figure 3(c) to Figure 3(f), and the differences between each successive recursion of the ray tracing algorithm becomes harder and harder to distinguish. Namely, as shown in Figure 4, very little additional information is added past the third bounce. Therefore, the raytracing process can be stopped in this case after three or four bounces. Image Quality Issues The primary image quality concern with ray traced images is aliasing artifacts [Cook 86]. As ray tracing is a point-sampling technique, it is subject to spatial aliasing when it reaches or exceeds the sampling limit. This limit is described by the Nyquist theorem [Nyquist 28], which states that the maximum frequency of a signal that may be represented by a point-sampled data set is half of the sampling frequency of that set. Any content of a signal beyond this limit will manifest as
542 G. Sellers and R. Lukac an aliased signal at a lower frequency. It follows from this theorem that, in the two-dimensional output image, the sampling rate of the image is determined by the spacing of the pixels, in other words, the image resolution. However, the input sig- nal is essentially the scene geometry, which may contain very ﬁne details. Aliasing can appear in a number of ways. The most obvious effect is jagged edges of objects in the output image. Another serious artifact is the disappearance of ﬁne details, including very small objects. Because objects in a ray-traced scene may be repre- sented analytically, it is possible for a small object to fall between primary rays. In this case, no intersection will be detected; the small object will not have any ef- fect on the output image and will therefore not be visible. In an animation where the small object moves between frames, in some frames the object will generate an intersection and in others it will not. It will therefore seem to appear and disappear as it moves. Figure 5 shows an image with the checkered texture generated procedurally; it has inﬁnite detail and the smallest squares seen in the distance cover an area signiﬁ- cantly smaller than a pixel. As can be seen, the image rendered with no antialiasing (Figure 5a) has noticeably lower quality compared to the image rendered with a moderate level of antialiasing (Figure 5b). To compensate for aliasing artifacts, various antialiasing techniques have been devised. Most methods for reducing aliasing artifacts are based on the concept of super-sampling, which refers to the computation of samples at a higher frequency than the output resolution [Dobkin 96]. A simple approach to super-sampling con- sists of taking multiple samples for each pixel in the output image, and using the averaged value of these samples as the output value. This is equivalent to generating a higher resolution output image and then down-sampling it to the desired output resolution. Namely, each pixel is subdivided into several subpixels; for example, Fig. 5 An inﬁnite checkered plane rendered (a) without antialiasing and (b) with modest antialiasing
23 Computer Graphics Using Raytracing 543 a b Fig. 6 Grid sampling: (a) regular pattern and (b) pseudorandom pattern into a three by three grid, producing nine subpixels per output pixel. A primary ray is then generated passing through each of the subpixels. If the center of each subpixel is used as the target for the primary ray then aliasing artifacts can still be seen, albeit at higher frequencies. This is because the sampling pattern is still regu- lar, and there may still be scene content that would generate image data beyond the new, higher sampling frequency. To compensate for this, more sample points can be added, thus dividing each output pixel into more and more subpixels. Alternatively, an irregular sampling pattern can be created by moving the sample positions within the subpixels. This is known as jittered grid super-sampling. Figure 6(a) shows the spacing of the three by three subpixels of the regular grid. The randomized positions used in a jittered grid are shown in Figure 6(b). The advantage of arranging the subsample positions on a pseudorandom grid is that not only are the effects of aliasing reduced, but straight lines at all angles are equally well represented, which is not the case when a regular grid pattern is used. Figure 7 shows the same image rendered with a regular sampling grid in Fig- ure 7(a) and with a jittered grid in Figure 7(b). Inset at the top of each image is a magniﬁed section of the upper edge of each square. As it can be seen, the appear- ance of jagged hard edges in the rendered image reduces with the increased number of shades used to represent the edge. There are a large number of other methods of implementing antialiasing. Many of these methods are adaptive [Painter 89], [Mitchell 90], allowing higher levels of antialiasing to be applied at edges, or other areas with high frequency content. However, an in-depth analysis is beyond the scope of this article and we will not discuss the topic further here.
544 G. Sellers and R. Lukac a b Fig. 7 Image rendered (a) with a regular smapling grid and (b) with a jittered grid Acceleration of Raytracing As already discussed, the computations involved in raytracing can quickly become extremely complex. To generate a high-resolution output with acceptable levels of antialiasing and detailed scene geometry, billions of calculations may need to be performed. Possibly the largest bottleneck in any raytracer is the intersection calcu- lations. Turner Whitted estimated that for reasonably complex scenes, almost 95% of the computation time is spent performing intersection tests [Whitted 80]. As pri- mary rays enter the scene, they must be tested for intersection with all elements of the scene geometry. The performance of a raytracer is thus heavily inﬂuenced by its ability to coarsely discard large parts of the scene from the set of geometry that must be tested for intersection against a given primary ray [Clark 76]. Further- more, any secondary rays generated as a result of theses intersections must also be tested against the scene geometry, potentially generating millions of additional intersection tests. There has been a large amount of work performed in the area of acceleration and optimization of raytracing algorithms [Weghorst 84], [Kay 86]. Whilst optimiz- ing the intersection test itself can produce reasonable results, it is important to try to minimize the number of tests that are performed. This is usually achieved by employing hierarchical methods for culling; mostly those based on bounding vol- umes and space partitioning tree structures [Arvo 89]. Bounding Volumes When using a bounding hierarchy approach to coarsely culling geometry, a tree is constructed with each node containing the bounding volumes of ﬁner and ﬁner
23 Computer Graphics Using Raytracing 545 pieces of the scene geometry [Rusinkiewicz 00]. At each of the leaf nodes of the tree, individual geometry elements of the scene are contained in a form that is directly tested for intersection with the ray being traced. Two important considera- tions must be made when choosing a structure for the tree representing the bounding hierarchy. First, the geometric primitive representing the bounding volume must be one that is simple to test for the presence of an intersection with a line segment in a three-dimensional space. All that is important for a quick determination whether there is an intersection with the bounding volume or not. If it can be determined that no intersection is made with the bounding volume, all of the geometry contained within that volume may be discarded from the potential set of geometry that may be intersected by the ray. As all that is desired is to know whether a ray may intersect geometry within the volume, the exact location of the intersection of the ray and the geometric primitive is not important; it is sufﬁcient to know only whether there is an intersection with the primitive. Second, as the use of bounding volumes for culling geometry is an optimization strategy, the time taken to generate the bound- ing volumes should be less than the time saved by using them during the raytracing process. If it were not, then the total time required to generate the bounding hierar- chy and then use it to render the image would be greater than the time required to render the image without the acceleration structure. For these reasons, spheres are often used as the geometric primitive representing bounding volumes [Thomas 01]. Unlike ﬁnding a sphere that encloses a set of geom- etry reasonably tightly, ﬁnding the smallest enclosing sphere for a set of geometry is a rather complex task. Although the bounding sphere does not necessarily represent the smallest possible bounding volume for a speciﬁc set of geometry, it is relatively trivial to test against for the presence or absence of an intersection. The lack of an intersection between the ray and the bounding volume allows the volume to be skipped in the tree, but the presence of an intersection with a bounding volume does not necessarily indicate the existence of an intersection with part of the geometry contained within it. Therefore, if an intersection with part of the bounding hierarchy is detected, then the volumes and geometry within that bounding volume must also be tested. It is possible that a ray may intersect part of the bounding hierarchy yet not intersect any geometry within it. Space Partitioning Tree Structures Popular examples of space partitioning data structures are binary space partitioning trees [Fuchs 80], octrees and kd-trees [Bentley 75]. These tree structures recursively subdivide volumes into smaller and smaller parts. The binary space partitioning tree was originally described by Henry Fuchs in 1980 [Fuchs 80]. In this method, a binary tree is constructed by recursively sub- dividing the three-dimensional space of the scene by planes. At each node, a new plane is constructed with geometry that falls on one side of the plane in one child branch of the tree and geometry falling on the other side of the plane placed in the
546 G. Sellers and R. Lukac other. Eventually, each child node of the binary tree will contain a minimal num- ber of geometric primitives, or it will no longer be possible to further subdivide the space without intersecting scene geometry. In the latter case, a decision must be made as to whether to include the geometry in both children of the subdivision plane or to cease the subdivision process. The octree is a space partitioning tree structure that operates on volumes of three- dimensional space. The entirety of the scene geometry is enclosed in an axis aligned bounding box. At each node, the space is subdivided in half in each axis, giving node eight child nodes. Any geometry that straddles more than one of the children is placed within the current node. That geometry that is not cut by the subdivision is carried further down the recursion until it too is cut. The recursion stops either when no geometry is left to carry to the next smaller level, or after some predeﬁned num- ber of steps to limit the recursion depth. Each node maintains a list of the geometry that falls within it. As rays are cast through the scene, a list of octree nodes through which the ray will pass is generated, and only the geometry contained within those nodes need be tested against for intersections. This has the potential to greatly accel- erate the tracing process, especially in cases where the scene is made up of a large amount of very ﬁne geometry. Hardware Accelerated Raytracing Recently, interest has been shown in using hardware to accelerate ray tracing operations, particularly within the context of real-time raytracing. In some instances, standard PC hardware has been shown to have become fast enough to produce high quality raytraced images at interactive and real-time rates. For example, Intel has shown a demonstration of the popular video game Enemy Territory: Quake Wars rendered using raytracing [Phol 09]. In this project, the team was able to achieve frame rates of between 20 and 35 frames per second using four 2.66GHz Intel Dun- nington processors. While not technically hardware acceleration, this demonstrates what is currently possible using very high-end hardware. Furthermore, Intel’s up- coming Larrabee product is expected to be well suited to raytracing [Seiler 08]. Moving further along the spectrum of hardware acceleration, the Graphics Pro- cessing Unit (GPU) has been used for raytracing. At ﬁrst, it was necessary to map the raytracing problem to existing graphics application programming interfaces, such as Open Graphics Library (OpenGL) as in [Purcell 02]. More recently, the GPU has been used as a more general purpose processor and applications such as raytracing have become more efﬁcient on such systems. For example, NVidia presented an implementation of a real-time raytracing algorithm using their Cuda platform [Luebke 08]. Dedicated hardware has also been investigated as a means to achieve high per- formance in ray tracing applications. An example of such work can be found in [Schmittler 04], where a dedicated raytracing engine was prototyped using a ﬁeld- programmable gate array (FPGA). The same team has continued to work on the
23 Computer Graphics Using Raytracing 547 system [Woop 05] and has achieved impressive results using only modest hardware resources. The outcome has been a device designed to accelerate the scene traversal and ray-geometry intersection tests. Using a 66MHz FPGA, the project has shown raytraced scenes at more than 20 frames per second. It can be expected that should such a device scale to the multi-gigahertz range that is seen in modern CPU imple- mentations, signiﬁcantly higher performance could be achieved. Summary This chapter presented the fundamentals of raytracing which is an advanced com- puter graphics methods used to render an image by tracing the path of light through pixels in an image plane. Raytracing is not a new ﬁeld; however, there is still cur- rently a large amount of active research being conducted on the subject as well as in the related ﬁeld of photon mapping. Probably the main boosting factor behind these interests is that direct rendering techniques such as rasterization start to break down as the size of geometry used by artists becomes ﬁner and more detailed, and polygons begin to cover screen areas smaller than a single pixel. Therefore, more so- phisticated computer graphics solutions are called for. Microfacet-based techniques, such as the well-known Reyes algorithm, constitute such a modern solution. Al- though these techniques have been used for some time to render extremely ﬁne geometry and implicit surfaces in ofﬂine systems such as those used for movies, ray- tracing is believed to become a computer graphics method for tomorrow. With the continued increase in computing power available to consumers, it is quite possible that interactive and real-time raytracing could become a commonly used technique in video games, digital content creation, computer-aided design, and other consumer applications. References [Appel 68] Appel, A. (1968), Some Techniques for Shading Machine Renderings of Solids. In Proceedings of the Spring Joint Computer Conference, Volume 32, 37–49. [Whitted 80] Whitted, T. (1980), An Improved Illumination Model for Shaded Display. In Com- munications of the ACM, Volume 26, Number 6. 343–349. [Phong 73] Bui, T. P. (1973), Illumination of Computer Generated Images, Department of Com- puter Science, University of Utah, July 1973. [Perlin 01] Perlin, K. (2001), Improving Noise, In Computer Graphics, Vol. 35, No. 3 [Jensen 96] Jensen, H. W. (1996), Global Illumination using Photon Maps, In Rendering Tech- niques ’96, Springer Wien, X. Peuvo and Schr¨oder, Eds., 21–30. [Phol 09] Phol, D. (2009), Light It Up! Quake Wars Gets Ray Traced, In Intel Visual Adrenalin Magazine, Issue 2, 2009. [Purcell 02] Percell, T., Buck, I., Mark W. R., and Hanrahan, P. (2002), Raytracing on Pro- grammable Graphics Hardware, In ACM Transaction on Graphics. 21 (3), pp. 703–712, (Proceedings of SIGGRAPH 2002).
548 G. Sellers and R. Lukac [Luebke 08] Luebke, D. and Parker, S. (2008), Interactive Raytracing with CUDA, Presentation, NVidia Sponsored Session, SIGGRAPH 2008. [Seiler 08] Seiler, L. et al. (2008), Larrabee: A Many-Core x86 Architecture for Visual Comput- ing, In ACM Transactions on Graphics. 27 (3), Article 18. [Schmittler 04] Schmittler, J., Woop, S., Wagner, D., Paul, W., and Slusallek, P. (2004), Realtime Raytracing of Dynamic Scenes on an FPGA Chip, In Proceedings of Graphics Hardware 2004, Grenoble, France, August 28th–29th, 2004. [Woop 05] Woop, S., Schmittler, J., and Slusallek, P. (2005), RPU: A Programmable Ray Pro- cessing Unit for Realtime Raytracing, In ACM Transactions on Graphics 24 (3), pp. 434–444, (Proceedings of SIGGRAPH 2005). [Jarosz 08] Jarosz, W., Jensen, H. W., and Donner, C. (2008), Advanced Global Illumination using Photon Mapping, SIGGRAPH 2008, ACM SIGGRAPH 2008 Classes. [Nyquist 28] Nyquist, H. (1928), Certain Topics in Telegraph Transmission Theory, in Transac- tions of the AIEE, Volume 47, pp. 617–644. (Reprinted in Proceedings of the IEEE, Volume 90 (2), 2002). [Fuchs 80] Fuchs, H., Kedem, M., and Naylor, B. F. (1980), On Visible Surface Generation by A Priori Tree Structures, in Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, pp. 124–133. [Cook 87] Cook, R. L., Carpenter, L., and Catmull, E., The Reyes Rendering Architecture, In Proceedings of SIGGRAPH ’87, pp. 95–102. [Foley 90] Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F. (1990), Computer Graphics: Principles and Practice, 2nd Ed. [Hanrahan 93] Hanrahan, P. and Krueger, W. (1993), Reﬂection rrom Layered Surfaces due to Subsurface Scattering, in Proceedings of SIGGRAPH 1993, pp. 165–174. [Weidlich 08] Weidlich, A. and Wilkie, A. (2008), Realistic Rendering of Birefringency in Uni- axial Crystals, in ACM Transactions on Graphics 27 (1), pp. 1–12. [Henning 04] Henning, C. and Stephenson, P. (2004), Accellerating the Ray Tracing of Height Fields, in Proceedings of the 2nd International Conference on Computer Graphics and Interac- tive Techniques in Australasia and South East Asia, pp. 254–258. [Houston 06] Houston, B., Nielson, M. B., Batty, C., and Museth, K. (2006), Hierarchical RLE Level Set: A Compact and Versatile Deformable Surface Representation, in ACM Transactions on Graphics, Volume 25 (1), pp. 151–175. [Jensen 01] Jensen, H. W., Realistic Image Synthesis Using Photon Mapping (2001), AK Peters, Ltd. ISBN 978-1568811475. [Glaeser 99] Glaeser, G. and Gr¨ ller, E. (1999), Fast Generation of Curved Perspectives for Ultra- o wide-angle Lenses in VR Applications, in The Visual Computer, Volume 15 (7–8), pp. 365– 376. [Glassner 89] Glassner, A. S. (1989), An Introduction to Ray Tracing, Morgan Kaufmann, ISBN 978-0122861604. [Blinn 77] Blinn, J. F. (1977), Models of Light Reﬂection for Computer Synthesized Models, in Proceedings of the 4th Annual Conference on Computer Graphics and Interactive Techniques, pp. 192–198. [Cook 82] Cook, R. L. and Torrance, K. E. (1982), A Reﬂectance Model for Computer Graphics, in ACM Transactions on Graphics, Volume 1 (1), pp. 7–24. [Oren 95] Oren, M. and Nayar, S. K. (1995), Generalization of the Lambertian Model and Im- plications for Computer Vision, International Journal of Computer Vision, Volume 14 (3), pp. 227–251. [Kajiya 86] Kajiya, J. T. (1986), The Rendering Equation, Computer Graphics, Volume 20 (4), pp. 143–150, Proceedings of SIGGRAPH’89. [Reddy 97] Reddy, M. (1997), The Graphics File Formats Page, http://www.martinreddy.net/gfx/ 3d-hi.html, Updated June 1997, Retrieved February 8, 2009. [Woo 02] Woo, M., Neider, J. L., Davis, T. R. and Shreiner, D. R. (2002), OpenGL Programming Guide, Third Edition, Addison Wesley, ISBN 0-201-60458-02, p. 667.
23 Computer Graphics Using Raytracing 549 [Peachy 85] Peachy, D. (1985) Solid Texturing of Complex Surfaces, in Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, pp. 279–286. [Marschner 05] Marschner, S. R., Westlin, S. H., Arbree, A. and Moon, J. T. (2005), Measuring and Modeling the Appearance of Finished Wood, in ACM Transactions on Graphics, Volume 24 (3), pp. 727–734. [Blinn 78] Blinn, J. F. (1978) Simulation of Wrinkled Surfaces, in Proceedings of the 5th Annual Conference on Computer Graphics and Interactive Techniques, pp. 286–292. [Lambert 1760] Lambert, J. H. (1760), Photometria sive de mensura et gradibus luminis, colorum et umbrae. [Bentley 75] Bentley, J. L. (1975), Multi Dimensional Binary Search Trees Used for Associative Searching, in Communications of the ACM, Volume 18 (9), pp. 509–517. [Cook 86] Cook, R. L. (1986), Stochastic sampling in computer graphics, in ACM Transactions in Graphics, Volume 5 (1), pp. 51–72. [Dobkin 96] Dobkin, D. P., Eppstein, D. and Mitchell, D. P. (1996), Computing the Discrepancy with Applications to Supersampling Patterns, in ACM Transactions on Graphics, Volume 15 (4), pp. 345–376. [Clark 76] Clark, J. H. (1976), Hierarchical Geometric Models for Visible Surface Algorithms, in Communications of the ACM, Volume 19 (10), pp. 547–554. [Weghorst 84] Weghorst, H., Hooper, G., and Greenberg, D.P. (1974), Improved Computational Methods for Ray Tracing, in ACM Transactions on Graphics, Volume 3 (1), pp. 52–69 [Kay 86] Kay, T. L. and Kajiya, J. T. (1986), Ray Tracing Complex Scenes, in Computer Graphics, Volume 20 (4), pp. 269–278 [Arvo 89] Arvo, J. and Kirk, D. (1989), A Survey of Ray Tracing Techniques, in An Introduction to Raytracing, Academic Press Ltd. (publishers), ISBN 0-12-286160-4, pp. 201–262. [Rusinkiewicz 00] Rusinkiewicz, S. and Levoy, M. (2000), QSplat: A Multiresolution Point Ren- dering System for Large Meshes, in Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 343–352. [Thomas 01] Thomas, F. and Torras, C. (2001), 3D Collision Detection, A Survey, in Computers and Graphics, Volume 25, pp. 269–285. [Painter 89] Painter, J. and Sloan, K. (1989), Antialiased ray tracing by adaptive progressive re- ﬁnement, in Proceedings of the 1989 SIGGRAPH Conference, pp. 281–288. [Mitchell 90] Mitchell, D. P. (1990) The Antialiasing Problem in Ray Tracing, SIGGRAPH 1990 Course Notes. [Shirley 05] Shirley, P., Ashikhmin, M., Gleicher, M., Marschner S., Reinhard E., Sung K., Thompson W., Willemsen P. (2005), Fundamentals of Computer Graphics, 2nd Ed., pp. 201–237. A.K. Peters Ltd., ISBN 978-1568812694. [Hall 83] Hall, R. A. and Greenberg, D. P. (1983), Ray Tracing Complex Scenes, in Computer Graphics, Volume 20 (4), pp. 269–278.
Chapter 24 The 3D Human Motion Control Through Reﬁned Video Gesture Annotation Yohan Jin, Myunghoon Suk, and B. Prabhakaran Introduction In the beginning of computer and video game industry, simple game controllers consisting of buttons and joysticks were employed, but recently game consoles are replacing joystick buttons with novel interfaces such as the remote controllers with motion sensing technology on the Nintendo Wii [1] Especially video-based human computer interaction (HCI) technique has been applied to games, and the represen- tative game is ‘Eyetoy’ on the Sony PlayStation 2. Video-based HCI technique has great beneﬁt to release players from the intractable game controller. Moreover, in order to communicate between humans and computers, video-based HCI is very crucial since it is intuitive, easy to get, and inexpensive. On the one hand, extracting semantic low-level features from video human mo- tion data is still a major challenge. The level of accuracy is really dependent on each subject’s characteristic and environmental noises. Of late, people have been using 3D motion-capture data for visualizing real human motions in 3D space (e.g, ‘Tiger Woods’ in EA Sports, ‘Angelina Jolie’ in Bear-Wolf movie) and analyzing motions for speciﬁc performance (e.g, ‘golf swing’ and ‘walking’). 3D motion- capture system (‘VICON’) generates a matrix for each motion clip. Here, a column is corresponding to a human’s sub-body part and row represents time frames of data capture. Thus, we can extract sub-body part’s motion only by selecting speciﬁc columns. Different from low-level feature values of video human motion, 3D hu- man motion-capture data matrix are not pixel values, but is closer to human level of semantics. The motivation of this paper starts from the following observations. Video based human motion data is essential for human computer interaction, but there is a seman- tic gap between human perceptions and the low-level feature values of video human motions. It might be partially covered by some good machine learning algorithm, Y. Jin, M. Suk, and B. Prabhakaran ( ) MySpace (Fox Interactive Media) 407 N. Maple Dr. Beverly Hills, CA. 90210 Department of Computer Science, University of Texas at Dallas, TX, USA e-mail: ychin@myspace.com; mhs071000@utdallas.edu; praba@utdallas.edu B. Furht (ed.), Handbook of Multimedia for Digital Entertainment and Arts, 551 DOI 10.1007/978-0-387-89024-1 24, c Springer Science+Business Media, LLC 2009
552 Y. Jin et al. but it is quite susceptible to variations in subjects and environments. Thus, we need to reﬁne video human motion’s low-level features by using more semantically well represented data, such as motion-capture data in this case. We show how we can use 3D motion-capture data as Knowledge-Base for understanding video human motion classes. For achieving this goal, there is a barrier. Two motion examples belong- ing to one class (e.g, ‘backhand’ motion) of video and 3D motion-capture data are visually similar to human. But based on the under-lying data representation, video and 3D motion-capture motion data are heterogeneous in nature. Video low-level features are extracted from pixel’s intensity values and 3D motion-capture matrix is translational and rotational DOF values of human motions. To reﬁne video low-level feature data, we mix human video low-level features and semantic 3D motion-capture features and use it as the “hidden” states in the Hidden-Markov Model (HMM). HMM already has been used widely for speech recognition and human gesture recognition as well. In this paper, we show that HMM can combine the two heterogeneous data and merge the ‘knowledge-based’ semantic 3D motion capture data as the hidden state. We show this 3D motion capture assisted HMM model can signiﬁcantly improve video human motion recog- nition rate. The player’s motions on the 3D game are recognized as input through the camera, the video-base human motions are estimated from the HMM. The 3D game can be controlled with the estimated results. To synthesize the human motion on the game, we prepare several atomic motion clips. For a video motion sequence, the same motion clip as the motion estimated from the HMM is selected as output. The gap between several different motion clips can be smoothly interpolated. Related Work Video based human gesture problem have attracted great interest in computer vision, machine learning and vision-based computer graphics area. Human action recog- nition problem can be separated into several different approaches. First, in a gait tracking based method, Yacoob et al. [22] tracked legs and got the parameterized values for matching in the Eigenspace. Second, in a motion feature values based approach, Huang et al. [21] computed the principle component analysis with fea- ture values of human silhouette. Using these reduced dimensional motion feature values, they tackled the gait-analysis problem. Masoud et al. [4] also used motion extracted directly from images. They represent motion feature as a feature image. Subsequently feature images were mapped to lower-dimensional manifold that is matched with the most like human action class. Third, some approaches require some devices for taking reﬁned level human actions. Li et al. [23] used data-globe for capturing American Sign Languages and get feature values using SVD. Hidden Markov Model [20] is really powerful and useful modeling for recogniz- ing speech and video human gestures. Yamato et al. [15] ﬁrst applied HMM model for recognizing video human tennis gesture categories and pointed out when training
24 The 3D Human Motion Control Through Reﬁned Video Gesture Annotation 553 and testing data from different subjects, the accuracy dropped. Yang et al. [16] showed how vector quantization can be applied with feature values with HMM and applied with isolated and continuous gesture recognition as well. Eickeler et al. [19] rejected unknown gestures with HMM model by learning undeﬁned action classes. They also demonstrated HMM can deal with recognizing very subtle gestures, such as “hand waving”, “spin”, “head moving” and showed HMM can reduce signiﬁcant error much better than Neural Networks [18]. There are approaches which try to merge some “knowledge” into HMM model for performance enhancement. Yoshio et al. [3] combined HMM with automation process by using one to one gesture recognition results. It is a very local knowledge and delays the decision process. Recently, Neil et al. [1] used domain knowledge for improving the maximum likeli- hood estimates. Some “rules” deﬁned by domain expert can smoothen human tennis action commentary. Vision-based human motion input is the most convenient and real-life applicable way for capturing human actions. From this characteristic, there are many graphic applications which accept vision input and use its low-dimensional values as the control signals [5] [6] [7] [8]. Park et al. [5] synthesized 3D human motion which follows 2D video human action’s trajectories in the soccer video clips. Jinxiang et al. generated new 3D character motions based on video human actor’s poses [6] and synthesize facial expressions followed by vision face expression through parameterized PCA comparison [7]. 2D video sequences give high-level informa- tion from sequences of images [8] to animation and the motion capture data which embedded the reﬁned knowledge for high-quality expression [7]. On the other hand, 3D motion capture data can help video based human motion recognition problem with the semantic characteristics. Hodgins et al. [10] segmented motion capture data into distinct behaviors in unsupervised way very accurately and it is much sim- pler than video segmentation [2]. Li et al. [23] achieved very high performance ratio for segmenting and classifying human motion capture data in real-time. Tian et al. [24] classiﬁed hands signal using 3D motion capture data by minimizing the back projection error between each frame of video and 3D motion sequence using Dynamic Time Warping technique. It has similar intuition to this paper. Here, we apply 3D motion-capture data examples for recognizing human whole body motion and demonstrate how we can mix the heterogeneous data (video and 3D motion- capture) and embed into hidden markov model. Thus, we can show how much 3D motion capture assisted methodology can help quantitatively with the traditional HMM model approach. Proposed Approach In this work, we propose a way of using 3D motion-capture data streams to rec- ognize video human motion data. Here, we consider 3D motion capture data as the knowledgebase since there exits human efforts and knowledge while we are taking 3D human motion capture data. A subject wears reﬂective markers [13]
554 Y. Jin et al. which are corresponding to body joint segments (e.g., ‘lower back’, ‘upper back’, ‘thorax’ -these belong to ‘Torso’). Each joint’s degree of freedom is the column vector value and rows are the time frames in the 3D motion capture matrix. Thus, we can select speciﬁc body segment in which we are interested. Degree of freedom values are also close to human semantic representation. First, to make a correspon- dence between video and 3D motion capture data, 3D motion capture data must be down-sampled since its frame rate (120 fps) is 4 times faster than video motion’s frame rate (30fps) -see Figure 1. After down sampling, we can have the 3D mo- tion capture frames .3df1 ; 3df5 ; 3df9 ; : : : 3dfm / corresponding to the video frames .vf 1 ; vf 2 ; vf 3 ; : : : ; vf n / where down sampling ratio is m D 4.n 1/ C 1. Second, we extract the representative features and combine into a single matrix. The feature vector of a video frame .vf i / has 10-dimensional values Fig. 1 Knowledge-based HMM Learning for Video Gesture Recognition using 3D motion-capture Data Stream
24 The 3D Human Motion Control Through Reﬁned Video Gesture Annotation 555 .vf i .1/; vf i .2/; : : : ; vf i .10// based on human motion shape descriptor, which is the column vector in the video motion feature matrix in Figure 1. We spatially reduce a motion capture frame .3dfi / to 5-dimensional values. We will explain the detail of extracting low-level feature from uncompressed video frames in Section “3D Human Motion Capture Data” and get semantically reduced low dimensional representation from 3D motion capture matrix in Section. To mix these two het- erogeneous data type, we build 6-dimensional hybrid matrix hfi and put them into higher dimensional kernel space .F 2 R11 /. Thus, we can translate the hybrid data matrix into a 1-dimensional sequence, which is the “hidden” state transition sequence. Corresponding to this “hidden” state sequence, there is a 1-dimensional sequence of size n by clustering the 10 dimensional video streams. Based on this one to one mapping between obser- vation sequences .O1 ; O2 ; : : : On / and hidden states . 1 ; 2 ; : : : n /, we can get parameters ( D i (Initial Parameter), aij (Transition Parameter), ej .Ok / (Emission Parameter)) of the current HMM model I by using all the video human motion clips i 2 I in the training set. This proposed approach enables us to directly compute parameters of HMM model from 3D motion-capture data as the “hidden” states. Thus, we don’t need to go through the learning process of HMM (such as ‘Baum-Welch’ algorithm), which is computationally expensive and suffers from “bad” local maxima from time to time. In the experiment part, we show that our proposed method (knowledge-based parameter learning) outperforms the traditional ‘Baum-Welch’ algorithm. Human Motion Analysis & Comparison Video Human Motion Feature Extraction We use the Motion History Image (MHI) [11] for computing how motion images are moving. MHI sets the temporal window size . D c/ as the ‘motion history’ (-we use 4 frames as the size of ). It recursively calculates a pixel’s motion intensity value . D H .x;y;t // based on the motion image M.x;y;t /. Motion im- age M.x; y; t / is a binary image which contains “1” for all active pixels and “0” for all others (see Figure 2). For computing motion history intensity H .x; y; t /, all pixels .x; y/ whose motion image value M.x; y; t / is equal to “1” contains .H .x; y; t / D / and other pixels contain max .0; H .x; y; t 1/ 1/ as their t motion history intensity values. In the sub-image .Isub / at time t (Figure 2), a pixel ‘A’ has as the pixel value in the motion history image since the motion image value of a pixel A is ‘1’. On the other hand, the motion image value of a pixel ‘B’ is ‘0’. Thus, H .Bx ; By ; t / value . 2/ is decided by choosing the maximum value between 0 and H .Bx ; By ; t 1/ 1. MHI shows recently moving image pixels brighter. In this work, we need the representative values for each motion im- age, “low-level feature” values. To get such low-level feature values, we compute
556 Y. Jin et al. Fig. 2 Motion History Image based Video Human Motion Low-level feature Extraction scale and rotational invariant shape moment variables by using motion history scalar values as the pixel’s intensity values . .x; y// .x; y/ ( H .x; y; t / (1) For explanation about the way of extracting low-level features from the motion t history image result matrix, we choose sub-image .Isub /. To compute the spatial moments values from this sub-image (Dsize is 3 3 pixels, where xj ; yj is an ori- gin pixel of sub-image),
24 The 3D Human Motion Control Through Reﬁned Video Gesture Annotation 557 xi C2 yj C2 X X m11 D H .x; y; t /xy (2) xDxi yDyj m11 D .xi yj / C .xi C 1/yj C .xi C 2/yj C .xi C 1/.yj C 1/ C .xi C 1/.yj C 2/ C . 2/.xi /.yj C 2/ C .xi C 1/.yj C 2/ C .xi C 2/.yj C 2/ (3) If only a window parameter has been chosen, then we can easily compute the spatial moment value by multiplying with pixel’s coordinate. This value is variant to geometric transformation. So, centralized moments value is also extracted by adjusting with x and y coordinates of the mass center. xi C2 yj C2 X X Â ÃÂ Ã m10 m01 11 D H .x; y; t / x y (4) xDxi yDyj m00 m00 Â Ã 11 Additionally, normalized central moment value 11 D m2 is another represen- 00 tative value. When we extract low-level feature of an image (uncompressed one), we compute three moments values .m11 ; 11 ; 11 / for the whole image (xi D 0 to width, yj D 0 to height in Eq. 2, 4). Next, video human motion data is variant with each subject’s shape and movement characteristics. So, we construct a total of 10-dimensional video human motion low-level feature data matrix by adding 7 Hu moments [9][14] variables, which is invariant to rotational, scale and reﬂection. 3D Human Motion Capture Data We consider one human motion is characterized by different combination of three main body parts: torso, arms and legs. Each body parts include several segments. For example, we divide 29 segments of human body into three sets, namely “torso,” “arms,” and “legs.” The torso consists of 7 segments (with degree of freedom in parenthesis), namely root (6), lower back(3), upper back(3), thorax(3), lower neck(3), upper neck(3), and head(3) segments. The arms consists of 7 pairs of segments including left and right side, namely clavicle(2), humerus(3), radius(1), wrist(1), hand(2), ﬁngers(1), and thumb(2). And, ﬁnally, legs consists of 4 pairs of segments including left and right side, namely femur(3), tibia(1), foot(2), and toes(1). For extracting spatial relationships and reducing dimensions (from 62 to 3 dimensions) among the 3 different body components, we separate a motion data ma- trix .Mf m / into three sub matrices .M ˛ D Mf k ; M ˇ D Mf j ; M D Mf r , where m D k C j C r) belonging to torso, arms and legs part respectively. From three sub matrices, SVD decomposes “singular” values (see Figure 3). M i D U†V T ; M i v1 D i v1 ; i 2 f˛; ˇ; g (5)
558 Y. Jin et al. Fig. 3 Spatial Dimensional Reduction into singular values (3D, 4D, 5D) Three “singular” values . ˛ ; ˇ ; / which represent torso, arms and legs parts are the coefﬁcient of each frame as the spatial feature, then we have a reduced matrix Mf 3 for a single human motion clip. Singular values represent the periodic and continuous characteristics of human’s motion. Increasing singular value of one body part indicates that part is used more intensively than other parts for a particular motion (see Figure 4). 3-dimensionally segmented human motion segments .M t ; M a ; M l / can be di- vided further for describing more detailed human motions (see Figure 3). For getting 0 00 4D spatial motion representation, we can divide M a into M j (Right Arms), M j (Left Arms) two column-based sub matrix. Likewise, we can also break up one M l 0 00 (Legs) into two M r (Right Legs), M r (Left Legs) sub matrix, thus it can be 5 dimensional spatial representation (Torso, RA (Right Arms), LA (Left Arms), RL (Right Legs), LL(Left Legs)). We can see the effect of separating into different num- ber of human body segments and how spatially reduced compact representation can keep the semantic meaning of human motion in Figure 5.
24 The 3D Human Motion Control Through Reﬁned Video Gesture Annotation 559 Fig. 4 SVD Spatial Feature Dimension Reduction Comparison with “knee-elbow twisting” mo- tion example (x-axis: time-frame, y-axis: svd values) Fig. 5 Low-level extraction value Comparison in Time Series Value