Plenary overview session I – Speech, Language, and Audio (SLA)
Session chair: Sadaoki Furui
Title: Environmental sound recognition: A survey
Speaker: C.-C. Jay Kuo
Affiliation: University of Southern California
Environmental sound recognition (ESR) has numerous applications like audio tagging, audio retrieval, hearing aid, robotic audition, etc. By environmental sounds, we refer to various quotidian sounds, both natural and artificial, i.e. sounds one encounters in daily life other than speech and music. ESR related publications, though sparse as compared to those for speech and music, have significantly risen in the past decade. In its infancy, ESR algorithms were a mere reflection of the speech and music recognition paradigms. However, on account of considerably non-stationary characteristics of environmental sounds, these algorithms proved ineffectual even for small-scale databases. Recent publications, hence, have focused on appraisal of non-stationary aspects of environmental sounds. Consequently, several new features have been proposed which are predicated to capture short-term variations in non-stationary signals. These features, in essence, strive to maximize their information content pertaining to signal’s temporal and spectral characteristics as bounded by the uncertainty principle. For most real life sounds, even these features exhibit non-stationarity when observed over a long period of time. In order to capture these long-term variations, several sequential learning methods have also been developed. Despite increased interest in the field, there is no single consolidated database for ESR which often hinders benchmarking of these new breed of features and classifiers. Hence, in this paper, we will attempt to present a comparative and elucidatory overview of recent developments in ESR.
Speaker Photo and Bio
Dr. C.-C. Jay Kuo received the B.S. degree from theNationalTaiwanUniversity,Taipei, in 1980 and the M.S. and Ph.D. degrees from the Massachusetts Institute of Technology,Cambridge, in 1985 and 1987, respectively, all in Electrical Engineering. From October 1987 to December 1988, he was Computational and Applied Mathematics Research Assistant Professor in the Department of Mathematics at the University of California, Los Angeles. Since January 1989, he has been with the University of Southern California (USC). He is presently Director of the Multimedia Communication Lab. and Professor of Electrical Engineering and Computer Science at the USC.
His research interests are in the areas of multimedia data compression, communication and networking, multimedia content analysis and modeling, and information forensics and security. Dr. Kuo has guided 115 students to their Ph.D. degrees and supervised 23 postdoctoral research fellows. Currently, his research group at the USC has around 30 Ph.D. students (please visit website http://viola.usc.edu), which is one of the largest academic research groups in multimedia technologies. He is co-author of about 200 journal papers, 850 conference papers and 10 books. He delivered over 550 invited lectures in conferences, research institutes, universities and companies. Dr. Kuo is a Fellow of AAAS, IEEE and SPIE. He is currently Editor-in-Chief for the IEEE Transactions on Information Forensics and Security.
Title: Voice Conversion and Spoofing Attack of Speaker Verification System
Speaker: Haizhou Li
Affiliation: Institute for Infocomm Research, Singapore
A speaker verification system is supposed to automatically accept or reject the claimed identity of a speaker. In the past few years, with the improved variability compensation techniques, such as joint factor analysis and i-vector Probabilistic Linear Discriminant Analysis (PLDA), we have we have advanced speaker verification technology considerably, and deployed in mass market products such as smartphones and in online commerce for user authentication. One of the main concerns when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a better insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attack and presented a threat to the speaker verification systems. In this talk, we will briefly introduce the spoofing attack studies simulated by different kinds of techniques with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.
Speaker Photo and Bio
Dr. Haizhou Li is currently a Principal Scientist and Department Head of the Human Language Technology at the Institute for Infocomm Research (I2R), Singapore. He is also a conjoint Professor at the University of New South Wales, Australia. Dr. Li has worked on speech and language technology in academia and industry since1988. Prior to joining I2R, he was a Professor at South China University of Technology, visiting professor at CRIN/INRIA France, Research Manager in Apple-ISS Research Centre, Research Director of Lernout & Hauspie Asia Pacific, and Vice President of InfoTalk Corp. Ltd.
Dr. Li’s research interests include automatic speech recognition, natural language processing and information retrieval. Dr. Li has served as an Associate Editor of IEEE Transactions on Audio, Speech and Language Processing, ACM Transactions on Speech and Language Processing, Computer Speech and Language. He is an elected Board Member of the International Speech Communication Association (2009-2013). He was appointed the General Chair of the 50th Annual Meeting of ACL in 2012 and INTERSPEECH 2014. He was the recipient of National Infocomm Award of Singapore in 2002. He was named one of the two Nokia Professors 2009 by Nokia Foundation in recognition of his contribution to speaker and language recognition technologies.
Title: Multi-modal Conversation Analysis
Speaker: Tatsuya Kawahara
Affiliation: Kyoto University, Japan
Speech communication is vital in exchanging knowledge and ideas, and it is essentially multi-modal. Conversations with poster presentations, which are norm in many academic events, pose interesting and challenging topics on multi-modal signal and information processing. This talk gives an overview of our project on multi-modal sensing, analysis and “understanding” of poster conversations. We focus on the audience’s feedback behaviors such as non-lexical reactive tokens and eye-gaze events. We investigate whether we can predict when and who will ask what kind of questions, and also interest level of the audience. Based on these analyses, we design a smart posterboard which can sense human behaviors via cameras and a microphone array, and annotate key interaction events during poster conversations. and who will ask what kind of questions, and also interest level of the audience. Based on these analyses, we design a smart posterboard which can sense human behaviors via cameras and a microphone array, and annotate key interaction events during poster conversations.
Speaker Photo and Bio
Dr. Tatsuya Kawahara received B.E. in 1987, M.E. in 1989, and Ph.D. in 1995, all in information science, from Kyoto University, Kyoto, Japan. Currently, he is a Professor in the Academic Center for Computing and Media Studies and an Affiliated Professor in the School of Informatics, Kyoto University. He has also been an Invited Researcher at ATR and NICT. He has published more than 250 technical papers on speech recognition, spoken language processing, and spoken dialogue systems. He has been conducting several speech-related projects in Japan including free large vocabulary continuous speech recognition software (http://julius.sourceforge.jp/) and the automatic transcription system for the Japanese Parliament (Diet).
Dr. Kawahara received the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT) in 2012. From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. He was a general chair of IEEE Automatic Speech Recognition & Understanding workshop (ASRU 2007). He also served as a Tutorial Chair of INTERSPEECH 2010 and a Local Arrangement Chair of IEEE ICASSP 2012. He is an editorial board member of Elsevier Journal of Computer Speech and Language, ACM Transactions on Speech and Language Processing, and APSIPA Transactions on Signal and Information. He is a senior member of IEEE.
Title: Emotion Recognition from Multi-Modal Information
Speaker: Chung-Hsien Wu
Affiliation: National Cheng Kung University, Taiwan
Intact perception and experience of emotion is vital for communication in the social environment. Emotion recognition is the ability to identify what you are feeling from moment to moment and to understand the connection between your feelings and your verbal/non-verbal expressions. When you are aware of your emotions, you can think clearly and creatively; manage stress and challenges; communicate well with others; and display trust, empathy, and confidence. Technologies for processing daily activities including facial expression, speech and language have expanded the interaction modalities between humans and computer-supported communicational artifacts, such as robots, iPAD, and mobile phones. In this talk, I will present theoretical and practical work offering new and broad views of the latest research in emotion recognition from multi-modal information including facial expression, speech and language. I will talk about several parts spanning a variety of theoretical background and applications ranging from salient emotional features, emotional-cognitive model, to emotional information processing on these modalities.
Speaker Photo and Bio
Dr. Chung-Hsien Wu received the B.S. degree in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1981, and the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University (NCKU), Taiwan, in 1987 and 1991, respectively. Since 1991, he has been with the Department of Computer Science and Information Engineering, NCKU, Taiwan. He became professor and distinguished professor in 1997 and 2004, respectively. He also worked at Computer Science and Artificial Intelligence Laboratory of Massachusetts Institute of Technology (MIT), Cambridge, MA, in summer 2003 as a visiting scientist. Currently, he is the Deputy Dean of the College of Electrical Engineering and Computer Science, NCKU. He received the Outstanding Research Award of National Science Council in 2010 and the Distinguished Electrical Engineering Professor Award of the Chinese Institute of Electrical Engineering in 2011, Taiwan. He is currently the associate editor of IEEE Trans. Audio, Speech and Language Processing, IEEE Trans. Affective Computing, ACM Trans. Asian Language Information Processing, and the Subject Editor on Information Engineering of Journal of the Chinese Institute of Engineers (JCIE). His research interests include affective computing, expressive speech synthesis, and spoken language processing. Dr. Wu is a senior member of IEEE and a member of International speech communication association (ISCA). He was the President of the Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taiwan, in 2009~2011. He was the Chair of IEEE Tainan Signal Processing Chapter and has been the Vice Chair of IEEE Tainan Section since 2009.