声纹识别技术及其应用现状

摘要/Abstract

摘要： 随着信息技术的快速发展，如何准确认证一个人的身份、保护个人隐私和保障信息安全，成为当前亟需解决的问题.与传统身份认证方式相比，生物特征识别身份认证技术在使用过程中具有不会丢失、被盗或遗忘的特性；其不但快捷、方便，而且准确、可靠.声纹识别作为当前最热门的生物特征识别技术之一，在远程认证等应用领域中具有独特优势，受到了越来越多的关注.以声纹识别技术及其应用现状为主线，将依次介绍声纹识别的基本概念、发展历程、应用现状及其行业标准化现状；综述声纹识别所面临的各类问题及其解决方案；最后对声纹识别技术以及应用的发展前景进行展望.

关键词: 生物特征识别, 身份认证, 声纹识别, 发展历程, 技术应用

Abstract: With the rapid development of information technology, how to identify a person to protect hisher personal privacy as well as information security has become a hot issue. Comparing with the traditional identity authentication, the biometric authentication technologies have the features of not being to get lost, to be stolen or forgotten when being used. The use of them is not only fast and convenient, but also accurate and reliable. Being one of the most popular biometric authentication technologies, the voiceprint recognition technology has its unique advantages in the field of remote authentication and other areas, and has attracted more and more attention. In this paper, the voiceprint recognition technology and its applications will be mainly introduced, including the fundamental concept, development history, technology applications and industrial standardizations. Various kinds of problems and corresponding solutions are overviewed, and the prospects are pointed out finally.

Key words: biometric recognition, identity authentication, voiceprint recognition, development history, technology applications

郑方. 声纹识别技术及其应用现状[J]. 信息安全研究, 2016, 2(1): 44-57.

参考文献

［1］Wikipedia. Biomerics. ［OL］. ［20151220］. https:en.wikipedia.orgwikiBiometrics［2］张陈昊. 短语音说话人识别研究［D］. 北京: 清华大学计算机科学与技术系, 2014［3］中华人民共和国电子行业标准. SJT 11380—2008. 自动声纹识别(说话人识别)技术规范［J］. 信息技术与标准化, 2008 (8): 2729［4］Atal B S. Automatic recognition of speakers from their voices［J］. Proceedings of the IEEE,1976, 64(4): 460475［5］Campbell Jr J P. Speaker recognition: A tutorial［J］. Proceedings of the IEEE, 1997, 85(9): 14371462［6］Wikipedia. Speaker recognition［OL］. ［20151220］. https:en.wikipedia.orgwikiSpeaker_recognition［7］Martin A, Doddington G, Kamm T, et al. The DET curve in assessment of detection task performance［C］ Proc of the European Conf on Speech Communication and Technology (Eurospeech 1997). 1997: 18951898［8］吴玺宏. 声纹识别听声辨人［N］. 计算机世界, 20010813［9］Pruzansky S, Mathews M V. Talkerrecognition procedure based on analysis of variance［J］. Journal of the Acoustical Society of America, 1965, 36(11): 20412047［10］Atal B S, Hanauer S L. Speech analysis and synthesis by linear prediction of the speech wave［J］. Journal of the Acoustical Society of America, 1971, 50(2B): 637655［11］Doddington G R, Flanagan J L, Lummis R C. Automatic speaker verification by nonlinear time alignment of acoustic parameters: US Patent 3,700,815［P］. 19721024［12］Atal B S. Automatic speaker recognition based on pitch contours［J］. Journal of the Acoustical Society of America, 1972, 52(6B): 16871697［13］Makhoul J, Cosell L. LPCW: An LPC vocoder with linear predictive spectral warping［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1976: 466469［14］Hermansky H. Perceptual linear predictive (PLP) analysis of speech［J］. Journal of the Acoustical Society of America, 1990, 87(4): 17381752［15］Vergin R, Oshaughnessy D, Farhat A. Generalized mel frequency cepstral coefficients for largevocabulary speakerindependent continuousspeech recognition［J］. IEEE Trans on Speech and Audio Processing, 1999, 7(5): 525532［16］Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition［J］. IEEE Trans on Acoustics, Speech and Signal Processing, 1978, 26(1): 4349［17］Burton D K, Shore J E, Buck J T. A generalization of isolated word recognition using vector quantization［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1983: 10211024［18］Rabiner L R, Juang B H. An introduction to hidden Markov models［J］. ASSP Magazine, 1986, 3(1): 416［19］Jain A K, Mao J, Mohiuddin K M. Artificial neural networks: A tutorial［J］. Computer, 1996, 29(3): 3144［20］Reynolds D. Gaussian mixture models［M］ Encyclopedia of Biometrics. Berlin: Springer, 2009: 659663［21］Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models［J］. Digital Signal Processing, 2000, 10(1): 1941［22］Dehak N, Dumouchel P, Kenny P. Modeling rosodic features with joint factor analysis for speaker verification［J］. IEEE Trans on Audio, Speech, and Language Processing, 2007, 15(7): 20952103［23］Dehak N, Kenny P, Dehak R, et al. Frontend factor analysis for speaker verification［J］. IEEE Trans on Audio, Speech, and Language Processing, 2011, 19(4): 788798［24］Variani E, Lei X, McDermott E, et al. Deep neural networks for small footprint textdependent speaker verification［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 40524056［25］Kenny P, Gupta V, Stafylakis T, et al. Deep neural networks for extracting BaumWelch statistics for speaker recognition［C］ Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2014［26］Furui S. Recent advances in speaker recognition［C］ Proc of the Audioand Videobased Biometric Person Authentication. Berlin: Springer, 1997: 235252［27］Zheng T F. Prove yourself by yourself with the use of speaker recognition technology［EBOL］. ［20151220］. http:cslt.riit.tsinghua.edu.cnfzhengR&D.htm#R&D_Invited［28］Zheng T F, Jin Q, Li L T, et al. An overview of robustness related issues in speaker recognition［C］ Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2014). 2014: 110［29］Boll S F. Suppression of acoustic noise in speech using spectral subtraction［J］. IEEE Trans on Acoustics, Speech and Signal Processing, 1979, 27(2): 113120［30］Berouti M, Schwartz R, Makhoul J. Enhancement of speech corrupted by acoustic noise［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1979: 208211［31］Hermansky H, Morgan N. RASTA processing of speech［J］. IEEE Trans on Speech and Audio Processing, 1994, 2(4): 578589［32］Kocsor A, Tóth L, Kuba A, et al. A comparative study of several feature transformation and learning methods for phoneme classification［J］. Journal of Speech Technology, 2000, 3(34): 263276［33］Lomax R G, HahsVaughn D L. Statistical Concepts: A Second Course［M］. United States of America: Taylor & Francis Group, 2012［34］Saon G, Padmanabhan M, Gopinath R, et al. Maximum likelihood discriminant feature spaces［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 11291132［35］Gales M J F, Young S J. Robust continuous speech recognition using parallel model combination［J］. IEEE Trans on Speech and Audio Processing, 1996, 4(5): 352359［36］Renevey P, Drygajlo A. Statistical estimation of unreliable features for robust speech recognition［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 17311734［37］Reynolds D. Channel robust speaker verification via feature mapping［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2003: 5356［38］Zhu D, Ma B, Li H, et al. A generalized feature transformation approach for channel robust speaker verification［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2007: 6164［39］Vair C, Colibro D, Castaldo F, et al. Channel factors compensation in model and feature domain for speaker recognition［C］ Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2006: 16［40］Heck L P, Weintraub M. Handsetdependent background models for robust textindependent speaker recognition［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1997: 10711074［41］Teunen R, Shahshahani B, Heck L P. A modelbased transformational approach to robust speaker recognition［C］ Proc of the 6th Int Conf on Spoken Language Processing (ICSLP 2000). 2000: 495498［42］Auckenthaler R, Carey M, LloydThomas H. Score normalization for textindependent speaker verification systems［J］. Digital Signal Processing, 2000, 10(1): 4254［43］Hatch A O, Kajarekar S S, Stolcke A. Withinclass covariance normalization for SVMbased speaker recognition［C］ Proc of the INTERSPEECH. 2006［44］McLaren M, Van Leeuwen D. Sourcenormalisedandweighted LDA for robust speaker recognition using ivectors［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2011: 54565459［45］Solomonoff A, Quillen C, Campbell W M. Channel compensation for SVM speaker recognition［C］ Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2004: 219226［46］Ioffe S. Probabilistic Linear Discriminant Analysis［M］. Computer Vision—ECCV 2006. Berlin: Springer, 2006: 531542［47］Prince S J D, Elder J H. Probabilistic linear discriminant analysis for inferences about identity［C］ Proc of the 11th Int Conf on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2007: 18［48］Kishore S P, Yegnanarayana B. Speaker verification: Minimizing the channel effects using autoassociative neural network models［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 11011104［49］Tranter S E, Reynolds D. An overview of automatic speaker diarization systems［J］. IEEE Trans on Audio, Speech, and Language Processing, 2006, 14(5): 15571565［50］Kotti M, Moschou V, Kotropoulos C. Speaker segmentation and clustering［J］. Signal Processing, 2008, 88(5): 10911124［51］Meignier S, Bonastre J F, Fredouille C, et al. Evolutive HMM for multispeaker tracking system［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 12011204 ［52］Ajmera J, Wooters C. A robust speaker clustering algorithm［C］ Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Piscataway, NJ: IEEE, 2003: 411416［53］Wooters C, Huijbregts M. The ICSI RT07s speaker diarization system［M］ Multimodal Technologies for Perception of Humans. Berlin: Springer, 2008: 509519［54］Imseng D, Friedland G. Tuningrobust initialization methods for speaker diarization［J］. IEEE Trans on Audio, Speech, and Language Processing, 2010, 18(8): 20282037［55］Fox E B, Sudderth E B, Jordan M I, et al. A sticky HDPHMM with application to speaker diarization［J］. The Annals of Applied Statistics, 2011, 5(2A): 10201056［56］Huang C, Chen T, Li S Z, et al. Analysis of speaker variability［C］ Proc of the INTERSPEECH. 2001: 13771380［57］Tull R G, Rutledge J C. Analysis of “coldaffected”speech for inclusion in speaker recognition systems［J］. Journal of the Acoustical Society of America, 1996, 99(4): 25492574［58］Tull R G, Rutledge J C. “Cold Speech” for Automatic Speaker Recognition［C］ Acoustical Society of America 131st Meeting Lay Language Papers, 1996［59］Kersta L G. Voiceprint Recognition［J］. Nature, 1962, (4861): 12531257［60］Bonastre J F, Bimbot F, Bo L J, et al. Person authentication by voice: A need for caution［C］ Proc of the INTERSPEECH. 2003［61］Kato T, Shimizu T. Improved speaker, verification over the cellular phone network using phonemebalanced and digitsequencepreserving connected digit patterns［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2003: 5760［62］Hébert M. TextDependent Speaker Recognition［M］. Berlin: Springer, 2008: 743762［63］Bimbot F, Bonastre J F, Fredouille C, et al. A tutorial on textindependent speaker verification［J］. EURASIP Journal on Applied Signal Processing, 2004, 2004: 430451［64］Beigi H. Effects of time lapse on speaker recognition results［C］ Proc of the 16th IEEE Int Conf on Digital Signal Processing. Piscataway, NJ: IEEE, 2009: 16［65］Beigi H. Fundamentals of Speaker Recognition［M］. Berlin: Springer, 2011［66］Lamel L F, Gauvain J L. Speaker verification over the telephone［J］. Speech Communication, 2000, 31(2): 141154［67］Wang LL, Wu XJ, Zheng T F, et al. An investigation into better frequency warping for timevarying speaker recognition［C］ Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2012). 2012: 14［68］Wang LL, Zheng T F. Creation of timevarying voiceprint database［C］ Proc of the OCOCOSDA 2010. 2010［69］Bie FH, Wang D, Zheng T F, et al. Emotional speaker verification with linear adaptation［C］ Proc of the IEEE China Summit & Int Conf on Signal and Information Processing (ChinaSIP). Piscataway, NJ: IEEE, 2013: 9194［70］Zetterholm E. Prosody and voice quality in the expression of emotions［C］ Proc of the ICSLP. 1998［71］Pereira C, Watson C I. Some acoustic characteristics of emotion［C］ Proc of the ICSLP. 1998［72］Wu T, Yang Y, Wu Z. Improving speaker recognition by training on emotionadded models［M］ Affective Computing and Intelligent Interaction. Berlin: Springer, 2005: 382389［73］Shahin I. Speaker identification in emotional environments［J］. Iranian Journal of Electrical and Computer Engineering, 2009, 8(1): 4146［74］Bie FH, Wang D, Zheng T F, et al. Emotional adaptive training for speaker verification［C］ Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2013). Piscataway, NJ: IEEE, 2013: 14［75］Atal B S. Automatic recognition of speakers from their voices［J］. Proceedings of the IEEE, 1976, 64(4): 460475［76］Matsui T, Furui S. Comparison of textindependent speaker recognition methods using VQdistortion and discretecontinuous HMMs［J］. IEEE Trans on Speech and Audio Processing, 1994, 2(3): 456459［77］Yasuda H, Kudo M. Speech rate change detection in martingale framework［C］ Proc of the 12th IEEE Int Conf on Intelligent Systems Design and Applications (ISDA). Piscataway, NJ: IEEE, 2012: 859864［78］Ma B, Meng H. EnglishChinese bilingual textindependent speaker verification［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2004: 293296［79］Nagaraja B G, Jayanna H S. Combination of features for multilingual speaker identification with the constraint of limited data［J］. International Journal of Computer Applications, 2013, 70(6): 16［80］Lu L, Dong Y, Zhao X, et al. The effect of language factors for robust speaker recognition［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2009: 42174220［81］Lindberg J, Blomberg M. Vulnerability in speaker verification—A study of technical impostor techniques［C］ Proc of the Eurospeech. 1999: 12111214［82］Evans N, Kinnunen T, Yamagishi J. Spoofing and countermeasures for automatic speaker verification［C］ Proc of the INTERSPEECH. 2013: 925929［83］Wu Z, Evans N, Kinnunen T, et al. Spoofing and countermeasures for speaker verification: A survey［J］. Speech Communication, 2015, 66: 130153［84］Lau Y W, Wagner M, Tran D. Vulnerability of speaker verification to voice mimicking［C］ Proc of the 2004 IEEE Int Symp on Intelligent Multimedia, Video and Speech Processing. Piscataway, NJ: IEEE, 2004: 145148［85］Perrot P, Aversano G, Blouet R, et al. Voice forgery using ALISP: Indexation in a client memory［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2005: 1720［86］Masuko T, Tokuda K, Kobayashi T, et al. Speech synthesis using HMMs with dynamic features［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1996: 389392［87］Masuko T, Tokuda K, Kobayashi T, et al. Voice characteristics conversion for HMMbased speech synthesis system［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1997: 16111614［88］De Leon P L, Pucher M, Yamagishi J, et al. Evaluation of speaker verification security and detection of HMMbased synthetic speech［J］. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(8): 22802290［89］Galou G, Chollet G. Synthetic voice forgery in the forensic context: a short tutorial［C］ Proc of the Forensic Speech and Audio Analysis Working Group (ENFSIFSAAWG). 2011［90］Wu Z, Siong C E, Li H. Detecting converted speech and natural speech for antispoofing attack in speaker recognition［C］ Proc of the INTERSPEECH. 2012［91］Chen LW, Guo W, Dai LR. Speaker verification against synthetic speech［C］ Proc of the 7th Int Symp on Chinese Spoken Language Processing (ISCSLP). 2010: 309312［92］Ogihara A, Hitoshi U, Shiozaki A. Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification［J］. IEICE Trans on Fundamentals of Electronics, Communications and Computer Sciences, 2005, 88(1): 280286［93］De Leon P L, Stewart B, Yamagishi J. Synthetic speech discrimination using pitch pattern statistics derived from image analysis［C］ Proc of the INTERSPEECH. 2012［94］Stylianou Y. Voice transformation: A survey［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2009: 35853588［95］Alegre F, Vipperla R, Evans N. Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals［C］ Proc of the INTERSPEECH. 2012［96］Alegre F, Amehraye A, Evans N. Spoofing countermeasures to protect automatic speaker verification from voice conversion［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2013: 30683072［97］Wu Z, Gao S, Cling E S, et al. A study on replay attack and antispoofing for textdependent speaker verification［C］ Proc of the AsiaPacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA ASC). Piscataway, NJ: IEEE, 2014: 15［98］Villalba J, Lleida E. Detecting replay attacks from farfield recordings on speaker verification systems［M］ Biometrics and ID Management. Berlin: Springer, 2011: 274285［99］Wang ZF, Wei G, He QH. Channel pattern noise based playback attack detection algorithm for speaker recognition［C］ Proc of the IEEE Int Conf on Machine Learning and Cybernetics (ICMLC). Piscataway, NJ: IEEE, 2011: 17081713［100］Shiota S, Villavicencio F, Yamagishi J, et al. Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification［C］ Proc of the 16th Annual Conf of the International Speech Communication Association. 2015: 239243［101］郑方. 基于动态密码语音的身份确认系统及方法: 中国, ZL201310123555.0［P］. 20150225［102］Li K P, Wrench Jr E H. An approach to textindependent speaker recognition with short utterances［C］ Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1983: 555558［103］Kwon S, Narayanan S. Robust speaker identification based on selective use of feature vectors［J］. Pattern Recognition Letters, 2007, 28(1): 8589［104］Zhang CH, Zheng T F. A fishervoice based feature fusion method for short utterance speaker recognition［C］ Proc of the IEEE China Summit & Int Conf on Signal and Information Processing (ChinaSIP). Piscataway, NJ: IEEE, 2013: 165169［105］Zhang CH, Wu XJ, Zheng T F, et al. A Kphonemeclass based multimodel method for short utterance speaker recognition［C］ Proc of the AsiaPacific Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC 2012). 2012: 14［106］Malegaonkar A, Ariyaeeinia A, Sivakumaran P, et al. On the enhancement of speaker identification accuracy using weighted bilateral scoring［C］ Proc of the 42nd IEEE Annual Int Carnahan Conf on Security Technology (ICCST). Piscataway, NJ: IEEE, 2008: 254258

[1]	赵娟娟刘昌华. WebSocket子协议的设计与实现[J]. 信息安全研究, 2021, 7(1): 64-68.
[2]	王新文. 基于区块链的数字证书系统在电子政务外网中的应用探究[J]. 信息安全研究, 2021, 7(1): 81-85.
[3]	李俊柴海新. 生物特征识别隐私保护研究[J]. 信息安全研究, 2020, 6(7): 589-601.
[4]	张慧王钰成舸向银杉郑方. 基于“声纹＋”的无监督可信身份认证[J]. 信息安全研究, 2020, 6(7): 615-621.
[5]	王滨刘贤刚陈学明李琳. 物联网智能联网设备口令保护技术研究[J]. 信息安全研究, 2020, 6(7): 652-656.
[6]	王柏华孙长杰李照川王伟兵. 远程办公中基于区块链技术的身份认证方法[J]. 信息安全研究, 2020, 6(4): 317-326.
[7]	王斯梁冯暄蔡友保陈翼. 零信任安全模型解析及应用研究[J]. 信息安全研究, 2020, 6(11): 0-0.
[8]	张放李朝伟张宁王上. 整机信创生态发展面临的问题及对策研究[J]. 信息安全研究, 2020, 6(10): 0-0.
[9]	齐锋陈庄蔡定雯于溯. 一种基于USB Key的双私钥安全因子身份认证方案[J]. 信息安全研究, 2019, 5(6): 500-506.
[10]	蔡友保冯暄陈翼王斯梁. 一种安全增强型云计算身份认证方案[J]. 信息安全研究, 2019, 5(3): 253-256.
[11]	彭小斌. 网络身份安全技术研究中心建设的重要[J]. 信息安全研究, 2019, 5(10): 913-917.
[12]	杨威王宇建吴永强. 物联网设备身份认证安全性分析[J]. 信息安全研究, 2019, 5(10): 918-923.
[13]	李兆森杨洋. 基于国产密码算法的物联网应用研究[J]. 信息安全研究, 2019, 5(10): 924-928.
[14]	刘文印吴鸿文李昕凡帅张启翔巫家宏沈治恒. 登录易，一种基于可信用户代理的多方闭环网络身份认证及管理机制[J]. 信息安全研究, 2018, 4(7): 652-661.
[15]	陈庄陈亚茹. 一种解决组合公钥密钥碰撞的方案[J]. 信息安全研究, 2018, 4(3): 256-260.