Handbook of Multimedia for Digital Entertainment and Arts- P11

Chia sẻ: Cong Thanh | Ngày: | Loại File: PDF | Số trang:30

Thêm vào BST

Báo xấu

84
lượt xem 10
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Handbook of Multimedia for Digital Entertainment and Arts- P11: The advances in computer entertainment, multi-player and online games, technology-enabled art, culture and performance have created a new form of entertainment and art, which attracts and absorbs their participants. The fantastic success of this new field has influenced the development of the new digital entertainment industry and related products and services, which has impacted every aspect of our lives.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Handbook of Multimedia for Digital Entertainment and Arts- P11

292 J. Zhou and L. Xiao Music Visualization: Tension Visualization Approach The purpose of music visualization is to take advantage of the strong pattern- recognition abilities of the human visual system to reveal similarities, patterns and correlations among a large collection of music. The typical procedure of music visualization is as follows: 1. Automatically extract content information from music stored in digital audio format. 2. Render the extracted information visually. Currently, in the 1st step, low level features are used to represent the content infor- mation of music. For example, the Mel-Frequency Cepstral Coefﬁcients (MFCCs) is frequently employed to describe the timbre related information and the Fluctuation Pattern is utilized to represent the rhythm information [1]. And in the second step, the extracted features’ dimensionality is linearly (e.g. PCA-based techniques [2]) or non-linearly (e.g. SOM-based techniques [1]) reduced so that they can be displayed in a 2D (or 3D) space. One problem of such visualization techniques, which would lower the users’ experience of interacting, is that, users don’t know what the x-axis and the y-axis represent. When the SOM-based techniques are applied, since the mapping is non- linear, it’s not possible to label the axes. And when the PCA-based techniques are used, although we know the two dimensions would correspond to the ﬁrst two prin- cipal components of the extracted low level feature, users still can’t understand the meaning because they cannot be explained semantically. To solve this problem, high level features should be mapped to the axes instead of low level features. The tension visualization is just such a highlevel mapping tech- nique. It derives the content information which could affect the tension of listener, and then map them to x and y axes respectively. One of the most important characteristics of music is the powerful emotional effect. Humans often react to music emotionally. A study about the effect of different type of music on emotion [3] indicates that, when listening to grunge rock music which has quite noisy background and large tempo value, listeners are detected with signiﬁcant increase in tension; while listening to songs with quiet background and small tempo value, listeners show signiﬁcant decrease in tension. Therefore, the noisy level and tempo value could be useful tension descriptors. More importantly, these two descriptors could be easily understood by listeners and thus improve the interacting experience when used for visualization. Noisy Level Calculation Apparently the noisy level of a song with heavy background (e.g. rock) should be higher than that of a song with quiet background. Figure 1 compares the time-frequency distributions of two piece of music, one of which is extracted from a
13 Content Based Digital Music Management and Retrieval 293 Fig. 1 The time-frequency distribution comparison of two kinds of songs. The left is from a lyrics song and the right is from a rock song lyric song and the other from a rock song. They differ a lot. The spectrums extracted from the time-frequency distributions, shown in Figure 2, clearly illustrate the differ- ence between the two kinds of music. The spectrum from the lyric song has several high and sharp resonance peaks while the spectrum of the rock song is much ﬂatter. Thus, the noisy level of a piece of music could be determined by the ﬂatness of the spectrum, which can be computed using the Spectrum Flatness Measure (SFM) [4]. The deﬁnition of SFM is:
294 J. Zhou and L. Xiao Fig. 2 The spectrum comparison of two different songs. The left is from a lyrics song and the right is from a rock song
13 Content Based Digital Music Management and Retrieval 295 Â Ã 1 R exp 2 lnP.!/d! SFM D ; (1) 1 R 2 lnP.!/d! where P.!/ is the spectrum. It also could write in a discrete form: !N 1 Â Ã Q N 1 P exp lnP.!i / P .!i / N i D1 i SFM D 1 P D 1 P : (2) N P.!i / N P.!i / i i In the tension visualization system, the noisy level is determined as follows: 1. Segment a song into frames with 100ms duration. 2. Calculate the SFM of each frame. 3. Use the median of the SFM of all frames as the noisy level of the song. Other statistics such as mean, max and min have also been tested in step 3, but the median outperforms all of them. Tempo Estimation The perceived tempo is an apparent and descriptive feature of songs, and it gives the listener a direct impression of the music emotion. Consequently, one can search for songs with his expectant speed quickly. Considerable approaches have been proposed for automatic tempo estimation. Most of them are optimization model based algorithms [5], which search the best tempo estimation of a song under the constraints of the rhythm patterns, like periodicity and strength of onsets. For example, [6] uses dynamic programmings to ﬁnd the set of beat times that optimize both the onset strength at each beat and the spacing between beats (tempo). The formulation of tempo estimation in terms of a constraint optimization problem requires a clear understanding of the inherent mechanisms of tempo perception. However, people barely know what factors and how these factors affect the perceived tempo. And that is why sometimes, the most salient tempo estimation is not the perceived tempo but half or double of it is. Instead of using the optimization model based algorithm, the tension visualiza- tion system employs a two stage tempo estimation algorithm [7], which incorporates a statistics model to improve the estimation accuracy. We not only consider the information of rhythm patterns, but also utilize the timbre information based on an observation that, songs with high perceived tempo are usually quite noisy while those with low perceived tempo are usually quiet. The correlation between timbre
296 J. Zhou and L. Xiao and perceived tempo is captured by a statistic model via a training procedure. To detect the tempo of a song, a state-of-the-art method is ﬁrst employed to generate several tempo candidates. Then, the likelihood of each candidate is computed based on the statistic model. Finally, the candidate with the highest likelihood is regarded as the tempo of the song. Following are the detailed procedures for training and detecting part. To train the statistic model of the correlation between timbre and tempo, we need to: (1) For each song in the training set, divide it into several 10-second segments, with neighboring segments overlapping 1/2 of the length. (2) Each 10-second seg- ment is further windowed into 0.1-second frames. After extracting a 12-dimensional MFCC vector from each frame, the mean MFCC vector of each 10-second segment is computed. Then the annotated tempo is combined with the mean MFCC vector to form a 13-dimensional tempo-timbre feature. (3) Finally, the aggregate of tempo- timbre features of all songs in the training set is used to train a GMMs model, which describes the probability distribution of tempo-timbre feature. And to detect the tempo, one need to: 1. Partition a testing song into several 10-second segments with 5-second overlapping. 2. For each segment, use the algorithm described in [6] to get a possible tempo. Then, four additional tempi are generated by multiplying the original tempo with factor 0.33, 0.5, 2 and thus, there are ﬁve candidate tempi for a piece of music, denoted as Ti, i D 1; : : : ; 5. 3. Compute the 12-dimensional mean MFCC vector of each segment, denoted as M. 4. For each 10-second segment, compute the probability of the combination of Ti and M in the pre-trained statistic model. The candidate with the highest proba- bility is determined as the tempo of the segment: T D arg max P .Ti jM /; Ti where P.xjy/ is the likelihood of x with respect to y. 5. Select the median of the tempi of all segments as the estimated tempo of the testing song. The employed statistic model of this technique signiﬁcantly improves the tempo estimation accuracy compared with the state-of-the-art method [6]. Furthermore, this technique outperforms all optimization based algorithms described in [5]. Music Summarization: Key Segment Extraction Approach The aim of music summarization is to extract the most representative part of a song. While most existing approaches [8–11] regard the most repeated segment as the summary and extract it based on the self-similarity of a song, Zhang and Zhou [12]
13 Content Based Digital Music Management and Retrieval 297 formulates the music summarization task from a quite novel perspective. It treats the so-called key-segment (chorus most of the time) as the most representative part of a song and models the key-segment extraction to a binary classiﬁcation problem. Description of Key Segment People almost enjoy music every day, among all kinds of music we are known, pop- ular songs may be the most familiar one. Generally, a popular song can be roughly divided into four parts: a beginning part, which is usually a pure accompanying seg- ment; a developing part, a long and slow singing segment; a climactic part, which is usually an ardent singing segment; and an ending part, fading out to end the whole song. Among all the parts, most people like the climactic part more. We have such an experience that there is always a special segment which attracts and moves us strongly in a popular song. Once the song is mentioned, this segment will bring to our mind ﬁrst. Zhang [12] call such kind of segment as the key segment of the whole music. For example, the song, “my heart will go on” is a song from a famous English movie, “Titanic”, sung by Miss Celine Dion. The sentence, “you are here, there’s nothing I fear, and I know that my heart will go on; we’ll stay forever this way, you are safe in my heart, and my heart will go on and on”, is usually regarded as the most impressive part in it. Taking the song, “right here waiting” sung by Mr. Richard Marx as another example, its key segment can be regarded as the sen- tence like “wherever you go, whatever you do, I will be right here waiting for you” or “whatever it takes, or how my heart breaks, I will be right here waiting for you”. Generally speaking, the key segment should be a singing segment rather than a pure instrumental one. It always exits in the climactic part of the whole song. It is highly possible that the title of the song is contained in the lyric of the key segment. Obviously, the length of the key segments varies according to different popular songs, but it usually cannot be too long or too short. The length distribu- tion of key segments manually extracted from the training database is showed in Figure 3. We can see that the length of most key segments is between 20s and 40s. Key-Segment Extraction The Key-Segment Extraction problem is actually a binary classiﬁcation problem. Based on the observation that key-segment of a song differs from its non-key- segment, [12] assumes that the distributions of all key and non-key segments in some particular feature space could be quite different. Therefore classical binary classiﬁcation techniques could be directly utilized for music summarization. The architecture of key-segment extraction is shown in Figure 4, which consists of an off-line training part and an online extraction part. In the training part, the key segments of songs are labeled as positive samples and the other segments are
298 J. Zhou and L. Xiao Fig. 3 The length distribution of key segments Fig. 4 Flowchart of the key segment extraction algorithm regards as negative samples. Then, the system trains a binary classiﬁer with the care- fully designed 17-dimensional features extracted from the samples. Ref. [12] tests several typical classiﬁers and indicates that, the back propagation neural network classiﬁer (BPNNC) outperforms others like k-nearest neighbor (KNN), minimum mahalanobis distance (MMD) and support vector machine (SVM). Therefore this system employs the trained BPNNC as its key segment discriminator. In the online extraction part, things become a little complicated since the length and start point of the key segment are unknown when given an unknown song. The only thing we know is the length distribution as shown in Figure 3. Ref. [12] solves this problem by producing a collection of key segment candidates from the song. With a time resolution of 2 seconds, all segments whose length is from 20s to 40s are cut out to generate the key segment candidate collection. From each candidate, its 17-dimensional feature is extracted. Supposing the length of input popular song is L seconds, the number of candidates will be:
13 Content Based Digital Music Management and Retrieval 299 10 X L 20 2i SD : (3) 2 i D0 Using the well trained BPNNC, we can select a music segment which is most likely to be the key segment of this song from all of the candidates. In this system, the BPNNC has two output neurons which are corresponding to the class of “key seg- ment” and the class of “non key segment”, respectively. The sum of these two output values is 1. For an input feature vector, if the value of the output neuron correspond- ing to the class of “key segment” is larger than 0.5, the corresponding candidate will be labeled as “key segment”. The larger the value is, the higher the reliability is. And the candidate with the largest value is ﬁnally chosen as the key segment. The last step of this system is to reﬁne the boundary to guarantee that the key segment locates at the clearance of singing. Since the energy of human voice mainly concentrates in the band of 100Hz to 3500Hz [13], a band pass ﬁlter is used to strengthen the human voice. And then, the lowest energy points around the boundary are regarded as the reﬁned starting and ending points. The performance of this system is tested on a dataset containing 130 songs in [12]. Six volunteers are invited to evaluate both the automatically generated and manually labeled key segment in the same song. The average rating score over the whole test set shows that the quality of automatically extracted key segment is com- parable to manually selected one. Music Similarity Measure: Chroma-Histogram Approach Subjective similarity between musical pieces is an elusive concept, but one must be pursued in support of application to provide automatic organization of large music collections. The aim of music similarity measure is to automatically compute the similarity between songs which matches the subjective similarity as well as possible. Works in music similarity mainly focus on three areas: symbolic representations [14], subjective information [15] and acoustic properties [16]. The symbolic rep- resentation of music, such as MIDI music data or score, is hard to obtain without human effort. Thus such technique is not very suitable for automatic system. The subjective information, which is the human opinions about songs, is beyond the concept of “content based”. Therefore, in this chapter we only focus on the acoustic property based one, which analyzes the music content directly. All acoustic property based techniques have common components as: 1. Feature extraction: extract low level features from audio content; 2. Model construction: build a model for the song from the extracted feature; 3. Distance measures: design a proper distance to measure the similarity between different models representing songs.
300 J. Zhou and L. Xiao Among the three components, the ﬁrst one determines the music factor on which one particular technique will measure the similarity. For example, Ref. [17] measures the timbre similarity by extracting the feature Mel-Frequency Cepstral Coefﬁcients (MFCCs) and Ref. [18] measures the rhythm similarity by extracting the feature emphasizing the percussion sound. Chroma-histogram approach [19] is a technique attempting to measure the melody/chord similarity between songs. Generally speaking, the melody/chord pat- tern is the prior factor when people compare different pieces of music. One typical case is that different interpretations of the same piece of music, such as the two versions of “when I fall in love” sung by Celine Dion and Julio Iglesias respectively, are regarded very similar. Melody and chord are also efﬁcient features for people to distinguish songs from different genres such as Blues, Jazz and Folk. Further- more, melody/chord pattern contains rich information related to human feelings. Thus, it can be used to measure the high-level music similarity such as emotion. The goals of the Chroma-histogram approach are to design a model which is able to capture the melody/chord pattern of music, and to ﬁnd a matching method to simulate the human’s perception. To implement such a system, one has to account for the following fundamental issues. First, melody/chord pattern is very robust to variations of parameters such as timbre and tempo. Second, the human’s perception of melody/chord pattern is irrelevant to the key transposition. For example, people usually think there is little difference between a song in C and its counterpart in G. The three components of Chroma-histogram are described in the remainder of this section. Feature Extraction Since this approach aims to measure the melody/chord similarity, the chroma [20] is chosen to be extracted from an audio music ﬁle. Chroma is a short term acoustic feature which records the energy distribution on 12 pitch classes of a small duration music signal. The ﬁrst step of feature extraction is to segment a song into a series of small duration frames (usually around 100ms). Then, the 12-element chroma vector is calculated to capture the dominant note as well as the broad harmonic accompani- ment. Thus, a song is represented by a sequence of chroma vectors. Figure 5(a) is a sample of a chroma sequence extracted from a piece of Robbie Williams’ song “Better man”. In Chroma-histogram approach [19], the chroma calculation algorithm provided by Dan Ellis [21] is utilized. The major advantage of Ref. [21] is that, it uses instan- taneous frequency thus get a higher resolution estimation of the frequency.
13 Content Based Digital Music Management and Retrieval 301 Fig. 5 (a) A sample of a chroma sequence and (b) its corresponding Chroma-histogram Model Construction Denote the chroma vector sequence as CM D fc.1/; c.2/; : : : ; c.l/g, where l is the total number of frames and c.i / is a chroma feature. A CM uniquely characterizes a piece of music. However it is too speciﬁc for individual songs and cannot represent the common properties of a group of music. Therefore, the Chroma-histogram is designed as the model to capture the common pattern of similar music. The process of building the model is given as: 1. Elements of CM are normalized in [0, 1], and quantized with a loudness resolu- tion of 50. For example we assign the value 20 to CM.i; j / if 0:38 < CM.i; j / < 0:42. We denote this quantized chroma feature matrix as CMq . 2. Partition CM q into 2N sub-sequences of chroma features with overlaps. 3. For each sub-sequence, we use a chroma histogram to summarize the melody and chord pattern. A chroma histogram has 12 columns for 12 chroma bands and 50 rows for loudness resolution. The histogram counts how many times a speciﬁc loudness in a speciﬁc chroma band was reached or exceeded. The sum of histogram is normalized to 1. Figure 5(b) is the chroma histogram derived from the chroma matrix (Figure 5(a)). In the end, a piece of music is represented by 2N chroma histograms. It is necessary to bring Step2 for further discussion. Compared with computing only one chroma histogram for one song, using a series of chroma histograms is more reasonable. First of all, a series of chroma histograms can preserve both the struc- ture information and the sequence information of CM q . Second, it is often the case that a song is performed in more than one key. Step2 guarantees that every sub- sequence is in only one key (e.g. C Major) and facilitates the transposition-invariant matching in next section. Choosing the parameter N is crucial. If N is too small, the aforementioned advantages are not apparent. On the other hand if N is too large, the feature will fail to describe the common characteristics shared by a group of songs. Ref. [19] reports that N D 3 is a good choice.
302 J. Zhou and L. Xiao Distance Measure As mentioned above, we obtain a series of chroma histogram for every song, de- noted as S = fCH1 ; CH2 ; : : : ; CH8 g. We then compare the chroma histograms of two different songs. Since similar songs usually have similar structure (e.g. ABAB), to take advantage of structure information, we sequentially compute the distance between 2 corresponding chroma histograms. The formula is as following: 8 X i i Dist.S1 ; S2 / D D.CH1 ; CH2 /; (4) i D1 where S1 = fCHi 1 g; S2 = fCHi 2 g. And D(.) is the transposition-invariant distance, which will be introduced below. In a general sense, human’s perception of a piece of music is irrelevant to its key. For example, people usually think there is little difference between a song in C and its counterpart in G. If we cyclically shift the C version rightwards by 7 semitones, the chroma histograms of these two versions will be exactly the same. Thus, [my] derives a transposition-invariant distance measure by using this property of chroma histogram. For a chroma histogram CH D fv.1/; v.2/; : : : ; v.12/g, where v.i / corresponds to the ith column of CH, the transposition function is deﬁned as: f 1 .v.1/; v.2/; : : : ; v.12// D fv.12/; v.1/; : : : ; v.11/g; (5) Accordingly, the i-transposed version of CH is f i .CH/, and f 12 .CH/ D CH. In the end, the transposition-invariant distance of two chroma histograms is deﬁned as: D.CH1 ; CH2 / D min d.CH1 ; f i .CH2 //; (6) i 2Œ0W11 where d.x; y/ is the Euclidean distance between x and y. The performance of chroma-histogram approach is evaluated through two pre- liminary objective experiments [19]. It is ﬁrst used to identify the Cover Songs, the alternate interpretations of the same underlying piece of music. Cover songs typi- cally keep the essence of the melody and the chord, but may vary greatly in other aspects such as timbre, tempo and key. And then it is tested by a Genre Classiﬁca- tion task. The results of chroma-histogram approach in two experiments outperform other timbre and rhythm related similarity measures. An Realized Music Archive Management System We build a music archive management system using these techniques described above. Figure 6 illustrates the User Interface (UI), which consists of two panels: the
13 Content Based Digital Music Management and Retrieval 303 Fig. 6 The UI of the music archive management system upper one (display panel) is used for music collection visualization and the lower one (function panel) is used to play, retrieve and comment individual songs. In this system, currently three music attributes can be used for visualization, which are Coarseness (the Noisy Level), Tempo and Duration. And Coarseness and Tempo are selected in Figure 6, which are mapped to x-axis and y-axis respectively. The green circles on the display panel represent songs. With such visualization system, suffering in large scale dataset is convenient. When the user wants quiet and slow songs such as lyrics song, he/she can choose songs in the left-top corner and when one wants noisy and fast songs, he/she should select songs in the right- bottom corner. Figure 7 shows how to construct a playlist with little human effort. A zoom-in/out function is provided thus one can check the detail information of songs in a small region. A snapshot of the display panel after zoom-in is shown in Figure 8. And when clicking on one green circle, the structure of the corresponding song will be shown in the function panel. As shown in Figure 9, the selected circle be- comes red and the summary of the song show up as the orange block. Then the user could skim the song by only listen to the orange region of it. The blue and yellow colors are used to distinguish vocal and non-vocal part of the song respec- tively. Considerable techniques could be used for vocal part detection; here we use the algorithm in Ref. [22]. With such technique, the user can locate the singing part, which is the more informative than non-vocal part like prelude and interlude. The similarity measure approach described above is utilized as the core compo- nent of the searching function. When the user encounters a wonderful song, he/she could immediately get all songs similar to it by typing the title into the “Search”
304 J. Zhou and L. Xiao Fig. 7 An illustration of building a playlist with songs quiet and slow. The title and artist informa- tion in Chinese is listed in the right panel Fig. 8 The snapshot of the display panel after zoom in. The information about title and artist in Chinese shows up blank. Without such function, the only way to obtain those songs is to listen to all songs in the dataset. Conclusion and Future Directions This chapter describes the content-based music management and retrieval tech- niques from the research areas of music visualization, music summarization and
13 Content Based Digital Music Management and Retrieval 305 Fig. 9 The structure of a selected song displayed on the function panel music similarity measure. The realized music archive management system demon- strates the effectiveness of those content-based techniques in facilitating user to surf in large scale music dataset. There are potential directions to improve the perfor- mance of such system. In our opinion, the following directions will make the system much smarter: The ﬁrst one is similarity fusion. Currently, there are many attempts proposed to measure similarities corresponding to different attributes of music, such as timbre, rhythm and melody. However, none of them could precisely describe the subjective similarity perceived by human, since music contains so many kinds of information. With the similarity fusion technique, we could combine different similarity mea- sures together to approximate the subjective similarity. The second one is relevance feedback retrieval. Starting with the query (generally one or more songs with label “prefer” or “not prefer”), such technique iterates to ask user to give feedback on some automatically selected “most informative” songs. And after a small number of rounds, it will generate result much better than that generated solely using the starting query. Mandel et al. [23] applies the active SVM classiﬁer for relevance feedback retrieval and gets quite promising result. And the last one is the personalized music recommendation. Difference users have different appetites. By recording the frequencies of the listened songs, the system could learn the preference of individual user and is able to automatically recommend songs based on the learned preference.
306 J. Zhou and L. Xiao References 1. E. Pampalk, A. Rauber, and D. Merkl, “Content-Based Organization and Visualization of Mu- sic Archives,” Proceedings of ACM Multimedia, pp. 570–579, 2002 2. G. Tzanetakis, and P. Cook, “3D Graphics Tools for Sound Collections,” the 3rd International Conference on Digital Audio Effects, pp. 1–4, 2000 3. R. McCraty, B. Barrios-Choplin, M. Atkinson, and D. Tomasino, “The Effects of Different Types of Music on Mood, Tension, and Mental Clarity,” Alternation Therapies in Health and Medicine, 4(1): 75–84, 1998 4. N. Jayant, and P. Noll. Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs, NJ, 1984 5. F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle and P. Cano, “An experimental comparison of audio tempo induction algorithms,” IEEE Transactions on Au- dio, Speech and Language processing, vol.14, no. 5, pp. 1832–1844, 2006 6. D.P.W. Ellis, “Beat Tracking with Dynamic Programming,” MIREX’06, 2006 7. L. Xiao, A. Tian, W. Li and J. Zhou, “Using A Statistic Model to Capture the Association Between Timbre and Perceived Tempo,” Proceedings of the International Conference on Music Information Retrieval, pp. 659–662, 2008 8. B. Logan and S. Chu, “Music summarization using key phrases” Proceeding of IEEE Interna- tional Conference on Signal and Speech Processing, vol. 2, pp. 749–752, 2000 9. C. S. Xu, Y. W. Zhu and Q. Tian, “Automatic Music Summarization Based on Temporal, Spectral and Cepstral Features”, Proceedings of IEEE International Conference on Multimedia and Expo., vol.1, pp. 117–120, 2002 10. M. Cooper and J. Foote, “Automatic Music Summarization via Similarity Analysis”, Proceed- ings of Internation Symposium on Music Information Retrieval (ISMIR), 2002 11. C. Wei and V. Barry, “Music Thumbnailing via Structural Analysis”, Proceedings of ACM International Conference on Multimedia, pp. 223–226, 2003 12. Y. Zhang, J. Zhou and Z. Bian, “Sample-Based Automatic Key Segment Extraction for Pop- ular Songs”, Proceedings of International Conference on Machine Learning and Cybernetics, pp. 4891–4897, 2005 13. F. G. Owens, Signal Processing of Speech, Macmillan: London, pp. 3–7, 1993 14. A. Ghias, J. Logan, D. Chamberlin and B. Smith, “Query by Humming: Music Information Retrieval in An Audio Database”, Proceedings of ACM Multimedia, pp. 231–236, 1995 15. D.P. Ellis, B. Whitman, A. Berenzweig and S. Lawrence, “The Quest for Ground Truth in Musical Artist Similarity” Proceedings of International Conference on Music Information Re- trieval, pp. 170–177, 2002 16. E. Pamoak, S. Dixon, and G. Widmer, “On the evaluation of perceptual similarity measures for music,” Proceedings of the 6th conference on Digital Audio Effects, pp. 7–13, 2003 17. B. Logan and A. Salomon, “A music similarity function based on signal analysis,” Proceedings of IEEE International Conference on Multimedia and Expo, pp. 745–748, 2001 18. E.D. Scheirer, “Tempo and beat analysis of acoustic musical signals,” Journal of Acoustical Society of America, vol. 103, no.1, pp. 588–601, 1998 19. L. Xiao and J. Zhou, “Using Chroma Histogram to Measure the Perceived Similarity of Music,” Proceedings of IEEE International Conference of Multimedia and Expo., pp. 1317– 1320, 2008 20. Mark A. Bartsch and Gregory H. Wakeﬁeld, “Audio thumbnailing of popular music using chroma-based representations,” IEEE Transactions on multimedia, vol. 7, no. 1, pp. 96–104, 2005 21. D.P. Ellis, www.ee.columbia.edu/dpwe/resources/matlab/chromaansyn 22. L. Xiao, J. Zhou and T. Zhang, “Using DTW Based Unsupervised Segment to Improve the Vocal Part Detection in Pop Music,” Proceedings of IEEE International Conference of Multi- media and Expo., pp. 1193–1196, 2008 23. M. Mandel, G.E. Poliner and D.P. Ellis, “Support Vector Machine Active Learning for Music Retrieval,” Multimedia Systems, Vol. 12, No. 1, pp. 3–13, 2006
Chapter 14 Incentive Mechanisms for Mobile Music Distribution Marco Furini and Manuela Montangero Introduction Music anywhere and anytime is a desire of most of the people that legally down- load digital music from the Internet. The number of these people is growing at an exceptional speed and today, according to the International Federation of the Phono- graphic Industry [17], it generates the 15% of the entire music revenues. As an example, the number of legal downloads grew from 156 millions in 2004 to 420 millions in 2005 to 795 millions in 2006. The speed of this success is expected to further increase with the involvement of the mobile scenario. The mobile digital world is seen as an important business opportunity for two main reasons: the widespread usage of cellphones (more than two billions [30], most of them with sound features) and the pervasiveness of mobile technologies. As a result, music industry and telecoms are bringing the successful Internet-based music market strategy into the mobile scenario: record labels are setting up agreements with cellphone network providers (Sprint, Verizon, Vodafone, Orange just to name a few) to offer a download music service also in the mobile scenario. The strategy is to use wireless channels to distribute music contents in the attempt of replicating the success of the Internet-based download scenario. Although mobile music distribution is expected to play an important role in the future music business, the employed distribution strategy might be compromised by the differences between the current mobile and Internet-based scenario. In fact, according to [11, 12, 23], the success of any distribution strategy depends on three main factors: the characteristics of the communication infrastructure, the pricing strategy applied to the distribution model and the copyright protection method used to secure the distributed contents. M. Furini ( ) Faculty of Communication and Economic Sciences, University of Modena and Reggio Emilia, Italy e-mail: marco.furini@unimore.it M. Montangero Dipartimento di Ingegneria dell’Informazione, University of Modena and Reggio Emilia, Italy e-mail: manuela.montangero@unimore.it B. Furht (ed.), Handbook of Multimedia for Digital Entertainment and Arts, 307 DOI 10.1007/978-0-387-89024-1 14, c Springer Science+Business Media, LLC 2009
308 M. Furini and M. Montangero The ﬁrst contribution of this paper is an analysis of the current mobile music scenario. The analysis shows that the communication infrastructure is inadequate for music ﬁles download (the available transfer data rate is still too far from the wired scenario, causing the download time to be too long); the pricing strategy can be questioned (the use of the expensive cellphone data network causes users to pay much more to downloaded a song in the mobile environment than in the wired sce- nario); the copyright protection is a burden (it usually denies customers to listen to legally acquired music over different mobile devices). The second contribution of this paper is to propose and analyze a multi-channel distribution strategy to ameliorate the problems of the current mobile music sce- nario. The idea is to involve customers in music distribution, so that distribution can be done using both the traditional cellphone network and the free-of-charge commu- nication technology (e.g., Bluetooth and Wi-Fi, provided in recent cellphones). To be successful, distribution is based on a license security mechanism and on an effec- tive incentive mechanism that ﬁnancially compensates customers that participate to the music distribution. In essence, customers can acquire music from the music store using the cellphone network, but can also receive song ﬁles from other customers through the faster free-of-charge communication technology. Song ﬁles are locked and can be unlocked only through license ﬁles. To avoid the sharing of licenses, a license ﬁle is bounded to the customer’s mobile device and is released only by the music store. Hence, customers can buy license ﬁles only through the cellphone data network. To stimulate cooperation among customers we design an incentive mech- anism and three different reward policies, with the goal of ﬁnancially compensating the cooperating users. To evaluate the multi-channel distribution strategy we investigate its effects on the entities involved in the mobile music distribution (i.e., customers, mobile music store and cellphone providers). Results show that the usage of an incentive mecha- nism can ameliorate the problems of the current distribution strategy and produces beneﬁts to all the entities involved in the mobile music market. In the following, we ﬁrst analyze the current mobile music market and then we outline a possible multi-channel distribution strategy along with an incentive mecha- nism that stimulates cooperation in a mobile environment. We proceed by presenting the evaluation of the multi-channel distribution strategy and by overviewing related works in the area of content distribution. The Current Mobile Music Market Thanks to the pervasiveness of the mobile technologies, music can be sold anywhere and anytime. To this aim, record labels and telecoms operators are opening mobile music stores, with the goal of replicating the success of the Internet-based music distribution strategy also in the mobile scenario.
14 Incentive Mechanisms for Mobile Music Distribution 309 In this section, we analyze the differences between the two scenarios, by focusing on communication infrastructure, pricing strategy and copyright protection. These aspects are in fact, according to [11, 12, 23], the key-factors of any digital distribu- tion strategy. Communication Infrastructure The fastest available data rate for the mobile scenario is provided by 3G networks, which offer three different transfer data rates: vehicular (144Kbps), pedestrian (384Kbps) and ﬁxed (2Mbps). Unfortunately, the real situation is very different [22] and actual tests show that the downloading speed ranges from 264 Kbps to 360 Kbps (a 264 Kbps leads to a download time of 120 seconds for a 4MB song). Not to men- tion that 3G networks are available only in some areas (usually big cities) and where 3G networks are not available, the transfer data rate is the one of EDGE or GPRS; if the service scales down to EDGE (the 384 Kbps maximum transfer data rate is usu- ally a 80 Kbps actual connection), a 4MB song can be downloaded in 390 seconds, more than 6 minutes; if the service is GPRS (the maximum 170 Kbps connection is usually a 50 Kbps actual connection), around 15 minutes are necessary to download a 4MB song. Needless to say, this may be a shock for users accustomed to residential xDSL broadband connection, where different Mbps are commonly available. Pricing Strategy The pricing strategy plays a fundamental role in the success of a distribution chan- nel, as the song price should be sufﬁciently attractive to entice customers to buy music from the new distribution channel. In the Internet scenario, the most successful business model is the a- la-carte model, which accounts for 86% of the on-line sales. In this model each song is priced around one dollar and users choose and pay song by song. Thanks to low prices, many left the illegal P2P world in favor of legal music stores. In the mobile scenario, although using the same a-la-carte model, the song is priced much higher, ranging from 1.5 to 2.50 dollars. The reason of this high price is mainly due to the usage of the expensive cellphone data network. In fact, although some ﬂat rate data plans are becoming available, data service to mobile phones are often metered, with users paying by the megabyte. It is to point out that, for music download, mobile providers charge a forfeit (from 1.5 to 2.5 dollars), regardless of the amount of bits that compose the song (otherwise the song price would be much higher). Needless to say, it is likely that most of the customers will wait to be home to download the song.
310 M. Furini and M. Montangero Copyright Protection A Digital Right Management (DRM) scheme allows content providers (e.g., record labels or artists) to protect their data by wrapping a media ﬁle with a control mech- anism that speciﬁes user rights (e.g., a collection of permissions and constraints) and discloses the material only to authorized users. Although employed extensively to ﬁght piracy, DRM schemes are questioned for intrusion in user privacy, as they impose restrictions on what customers can do (e.g., using the content over several devices and/or deny the fair use, which is the possibility for the user to legally per- form a number of personal copies without incurring in violation of copyright) [5]. In the Internet scenario, the proliferation of proprietary DRM systems is making things difﬁcult for customers since a song acquired legally cannot be played over all the user’s devices. Furthermore, in addition to privacy concerns, DRM schemes are also questioned for security issues, since they are employed on un-trusted environ- ments (the play out device is under the user’s control). In the mobile scenario, the security level can be increased as cellphones provide a more secure environment, by offering reliable information regarding both the device and the user (e.g., the IMEI and the IMSI code) [20]. Unfortunately, the increase of security affects user privacy: most of the DRMs currently used allow the song play out only on the device used to acquire and to download the song. To mitigate this burden, some mobile providers offer two versions of the same song: one is designed to be played out on the cellphone, while the other is to be used over a computer, allowing customers to burn their music on CDs. Needless to say, this lack of ﬂexibility limits the attractiveness of the mobile music market. A Multi-Channel Distribution Approach In this section, we present a multi-channel distribution strategy coupled with an effective incentive mechanism, in order to mitigate the problems of the current mo- bile music scenario. Our goal is to show that an additional opportunity is available for the mobile music market and hence, instead of focusing on implementation de- tails, we outline a possible distribution strategy focusing on its effects on such a music market. We start by presenting a possible mobile scenario, depicted in Figure 1. Cus- tomers access to a wireless music store through the cellphone network and share music ﬁles using the cellphone free-of-charge communication technology (e.g., Bluetooth and Wi-Fi). Despite the price is high and the download time is consid- erable, Alice acquires a song from the mobile music store and downloads both the song and the license through the cellphone network. She decides to share the song with Bob and Marc, so she Wi-Fis them the song. Marc is curios about the song and thinks about buying the license. Since the license ﬁle size is much smaller than the song ﬁle, the price is reasonable (more similar to the price that Marc would pay
14 Incentive Mechanisms for Mobile Music Distribution 311 Fig. 1 Multi-channel distribution strategy: customers are authorized to re-distribute music contents at home to download the same song) and hence he acquires the license song using the cellphone network. After Marc acquired the license, Alice receives a ﬁnancial compensation for her cooperation in the song distribution. Bob, on the other hand, decides to ignore the message and does nothing. The scenario just described shows that music distribution among customers can be used to decrease both the usage of the expensive cellphone data network and the time necessary to download a song in the wireless environment. In addition, customers can get commissions back for their cooperation. To be effective, the multi-channel distribution needs customers cooperation and hence, in the follow- ing we also propose an incentive mechanism to stimulate such cooperation. Multi-Channel Mobile Distribution The main goal of a multi-channel distribution is to reduce costs and the download time of a song. To meet these two goals without penalizing the music store and the cellphone provider, we propose the following multi-channel distribution. A customer can browse, acquire and download songs directly from the mobile music store using the cellphone network. Once the song has been downloaded, he/she can share the song with other customers using the free-of-charge commu- nication technologies of the cellphone. Since these technologies offer a higher data transfer rate than the one of the cellphone network, the time for song download is highly reduced (e.g., few seconds to download a 4MB song with a Wi-Fi enable cellphone like the Nokia N91). Customers who receive the song can play it out only upon license acquisition, released by the music store through the cellphone network.