信息安全研究 ›› 2016, Vol. 2 ›› Issue (1): 44-57.
郑方
收稿日期:
2015-12-18
出版日期:
2016-01-05
发布日期:
2016-01-18
通讯作者:
郑方
作者简介:
郑方
教授,博士生导师,主要研究方向为说话人识别、语音识别、自然语言处理.
fzheng@tsinghua.edu.cn
李蓝天
博士研究生,主要研究方向为说话人识别.
lilt@cslt.riit.tsinghua.edu.cn
张慧
本科生,主要研究方向为说话人识别.
hebe.hui.zhang@gmail.com
艾斯卡尔·肉孜
博士研究生,主要研究方向为说话人识别.
askar@cslt.riit.tsinghua.edu.cn
Received:
2015-12-18
Online:
2016-01-05
Published:
2016-01-18
摘要: 随着信息技术的快速发展,如何准确认证一个人的身份、保护个人隐私和保障信息安全,成为当前亟需解决的问题.与传统身份认证方式相比,生物特征识别身份认证技术在使用过程中具有不会丢失、被盗或遗忘的特性;其不但快捷、方便,而且准确、可靠.声纹识别作为当前最热门的生物特征识别技术之一,在远程认证等应用领域中具有独特优势,受到了越来越多的关注.以声纹识别技术及其应用现状为主线,将依次介绍声纹识别的基本概念、发展历程、应用现状及其行业标准化现状;综述声纹识别所面临的各类问题及其解决方案;最后对声纹识别技术以及应用的发展前景进行展望.
郑方. 声纹识别技术及其应用现状[J]. 信息安全研究, 2016, 2(1): 44-57.
[1]Wikipedia. Biomerics. [OL]. [20151220]. https:en.wikipedia.orgwikiBiometrics[2]张陈昊. 短语音说话人识别研究[D]. 北京: 清华大学计算机科学与技术系, 2014[3]中华人民共和国电子行业标准. SJT 11380—2008. 自动声纹识别(说话人识别)技术规范[J]. 信息技术与标准化, 2008 (8): 2729[4]Atal B S. Automatic recognition of speakers from their voices[J]. Proceedings of the IEEE,1976, 64(4): 460475[5]Campbell Jr J P. Speaker recognition: A tutorial[J]. Proceedings of the IEEE, 1997, 85(9): 14371462[6]Wikipedia. Speaker recognition[OL]. [20151220]. https:en.wikipedia.orgwikiSpeaker_recognition[7]Martin A, Doddington G, Kamm T, et al. The DET curve in assessment of detection task performance[C] Proc of the European Conf on Speech Communication and Technology (Eurospeech 1997). 1997: 18951898[8]吴玺宏. 声纹识别听声辨人[N]. 计算机世界, 20010813[9]Pruzansky S, Mathews M V. Talkerrecognition procedure based on analysis of variance[J]. Journal of the Acoustical Society of America, 1965, 36(11): 20412047[10]Atal B S, Hanauer S L. Speech analysis and synthesis by linear prediction of the speech wave[J]. Journal of the Acoustical Society of America, 1971, 50(2B): 637655[11]Doddington G R, Flanagan J L, Lummis R C. Automatic speaker verification by nonlinear time alignment of acoustic parameters: US Patent 3,700,815[P]. 19721024[12]Atal B S. Automatic speaker recognition based on pitch contours[J]. Journal of the Acoustical Society of America, 1972, 52(6B): 16871697[13]Makhoul J, Cosell L. LPCW: An LPC vocoder with linear predictive spectral warping[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1976: 466469[14]Hermansky H. Perceptual linear predictive (PLP) analysis of speech[J]. Journal of the Acoustical Society of America, 1990, 87(4): 17381752[15]Vergin R, Oshaughnessy D, Farhat A. Generalized mel frequency cepstral coefficients for largevocabulary speakerindependent continuousspeech recognition[J]. IEEE Trans on Speech and Audio Processing, 1999, 7(5): 525532[16]Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition[J]. IEEE Trans on Acoustics, Speech and Signal Processing, 1978, 26(1): 4349[17]Burton D K, Shore J E, Buck J T. A generalization of isolated word recognition using vector quantization[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1983: 10211024[18]Rabiner L R, Juang B H. An introduction to hidden Markov models[J]. ASSP Magazine, 1986, 3(1): 416[19]Jain A K, Mao J, Mohiuddin K M. Artificial neural networks: A tutorial[J]. Computer, 1996, 29(3): 3144[20]Reynolds D. Gaussian mixture models[M] Encyclopedia of Biometrics. Berlin: Springer, 2009: 659663[21]Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Processing, 2000, 10(1): 1941[22]Dehak N, Dumouchel P, Kenny P. Modeling rosodic features with joint factor analysis for speaker verification[J]. IEEE Trans on Audio, Speech, and Language Processing, 2007, 15(7): 20952103[23]Dehak N, Kenny P, Dehak R, et al. Frontend factor analysis for speaker verification[J]. IEEE Trans on Audio, Speech, and Language Processing, 2011, 19(4): 788798[24]Variani E, Lei X, McDermott E, et al. Deep neural networks for small footprint textdependent speaker verification[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2014: 40524056[25]Kenny P, Gupta V, Stafylakis T, et al. Deep neural networks for extracting BaumWelch statistics for speaker recognition[C] Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2014[26]Furui S. Recent advances in speaker recognition[C] Proc of the Audioand Videobased Biometric Person Authentication. Berlin: Springer, 1997: 235252[27]Zheng T F. Prove yourself by yourself with the use of speaker recognition technology[EBOL]. [20151220]. http:cslt.riit.tsinghua.edu.cnfzhengR&D.htm#R&D_Invited[28]Zheng T F, Jin Q, Li L T, et al. An overview of robustness related issues in speaker recognition[C] Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2014). 2014: 110[29]Boll S F. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Trans on Acoustics, Speech and Signal Processing, 1979, 27(2): 113120[30]Berouti M, Schwartz R, Makhoul J. Enhancement of speech corrupted by acoustic noise[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1979: 208211[31]Hermansky H, Morgan N. RASTA processing of speech[J]. IEEE Trans on Speech and Audio Processing, 1994, 2(4): 578589[32]Kocsor A, Tóth L, Kuba A, et al. A comparative study of several feature transformation and learning methods for phoneme classification[J]. Journal of Speech Technology, 2000, 3(34): 263276[33]Lomax R G, HahsVaughn D L. Statistical Concepts: A Second Course[M]. United States of America: Taylor & Francis Group, 2012[34]Saon G, Padmanabhan M, Gopinath R, et al. Maximum likelihood discriminant feature spaces[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 11291132[35]Gales M J F, Young S J. Robust continuous speech recognition using parallel model combination[J]. IEEE Trans on Speech and Audio Processing, 1996, 4(5): 352359[36]Renevey P, Drygajlo A. Statistical estimation of unreliable features for robust speech recognition[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 17311734[37]Reynolds D. Channel robust speaker verification via feature mapping[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2003: 5356[38]Zhu D, Ma B, Li H, et al. A generalized feature transformation approach for channel robust speaker verification[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2007: 6164[39]Vair C, Colibro D, Castaldo F, et al. Channel factors compensation in model and feature domain for speaker recognition[C] Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2006: 16[40]Heck L P, Weintraub M. Handsetdependent background models for robust textindependent speaker recognition[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1997: 10711074[41]Teunen R, Shahshahani B, Heck L P. A modelbased transformational approach to robust speaker recognition[C] Proc of the 6th Int Conf on Spoken Language Processing (ICSLP 2000). 2000: 495498[42]Auckenthaler R, Carey M, LloydThomas H. Score normalization for textindependent speaker verification systems[J]. Digital Signal Processing, 2000, 10(1): 4254[43]Hatch A O, Kajarekar S S, Stolcke A. Withinclass covariance normalization for SVMbased speaker recognition[C] Proc of the INTERSPEECH. 2006[44]McLaren M, Van Leeuwen D. Sourcenormalisedandweighted LDA for robust speaker recognition using ivectors[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2011: 54565459[45]Solomonoff A, Quillen C, Campbell W M. Channel compensation for SVM speaker recognition[C] Proc of the IEEE Odyssey—The Speaker and Language Recognition Workshop. Piscataway, NJ: IEEE, 2004: 219226[46]Ioffe S. Probabilistic Linear Discriminant Analysis[M]. Computer Vision—ECCV 2006. Berlin: Springer, 2006: 531542[47]Prince S J D, Elder J H. Probabilistic linear discriminant analysis for inferences about identity[C] Proc of the 11th Int Conf on Computer Vision (ICCV). Piscataway, NJ: IEEE, 2007: 18[48]Kishore S P, Yegnanarayana B. Speaker verification: Minimizing the channel effects using autoassociative neural network models[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 11011104[49]Tranter S E, Reynolds D. An overview of automatic speaker diarization systems[J]. IEEE Trans on Audio, Speech, and Language Processing, 2006, 14(5): 15571565[50]Kotti M, Moschou V, Kotropoulos C. Speaker segmentation and clustering[J]. Signal Processing, 2008, 88(5): 10911124[51]Meignier S, Bonastre J F, Fredouille C, et al. Evolutive HMM for multispeaker tracking system[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2000: 12011204 [52]Ajmera J, Wooters C. A robust speaker clustering algorithm[C] Proc of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Piscataway, NJ: IEEE, 2003: 411416[53]Wooters C, Huijbregts M. The ICSI RT07s speaker diarization system[M] Multimodal Technologies for Perception of Humans. Berlin: Springer, 2008: 509519[54]Imseng D, Friedland G. Tuningrobust initialization methods for speaker diarization[J]. IEEE Trans on Audio, Speech, and Language Processing, 2010, 18(8): 20282037[55]Fox E B, Sudderth E B, Jordan M I, et al. A sticky HDPHMM with application to speaker diarization[J]. The Annals of Applied Statistics, 2011, 5(2A): 10201056[56]Huang C, Chen T, Li S Z, et al. Analysis of speaker variability[C] Proc of the INTERSPEECH. 2001: 13771380[57]Tull R G, Rutledge J C. Analysis of “coldaffected”speech for inclusion in speaker recognition systems[J]. Journal of the Acoustical Society of America, 1996, 99(4): 25492574[58]Tull R G, Rutledge J C. “Cold Speech” for Automatic Speaker Recognition[C] Acoustical Society of America 131st Meeting Lay Language Papers, 1996[59]Kersta L G. Voiceprint Recognition[J]. Nature, 1962, (4861): 12531257[60]Bonastre J F, Bimbot F, Bo L J, et al. Person authentication by voice: A need for caution[C] Proc of the INTERSPEECH. 2003[61]Kato T, Shimizu T. Improved speaker, verification over the cellular phone network using phonemebalanced and digitsequencepreserving connected digit patterns[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2003: 5760[62]Hébert M. TextDependent Speaker Recognition[M]. Berlin: Springer, 2008: 743762[63]Bimbot F, Bonastre J F, Fredouille C, et al. A tutorial on textindependent speaker verification[J]. EURASIP Journal on Applied Signal Processing, 2004, 2004: 430451[64]Beigi H. Effects of time lapse on speaker recognition results[C] Proc of the 16th IEEE Int Conf on Digital Signal Processing. Piscataway, NJ: IEEE, 2009: 16[65]Beigi H. Fundamentals of Speaker Recognition[M]. Berlin: Springer, 2011[66]Lamel L F, Gauvain J L. Speaker verification over the telephone[J]. Speech Communication, 2000, 31(2): 141154[67]Wang LL, Wu XJ, Zheng T F, et al. An investigation into better frequency warping for timevarying speaker recognition[C] Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2012). 2012: 14[68]Wang LL, Zheng T F. Creation of timevarying voiceprint database[C] Proc of the OCOCOSDA 2010. 2010[69]Bie FH, Wang D, Zheng T F, et al. Emotional speaker verification with linear adaptation[C] Proc of the IEEE China Summit & Int Conf on Signal and Information Processing (ChinaSIP). Piscataway, NJ: IEEE, 2013: 9194[70]Zetterholm E. Prosody and voice quality in the expression of emotions[C] Proc of the ICSLP. 1998[71]Pereira C, Watson C I. Some acoustic characteristics of emotion[C] Proc of the ICSLP. 1998[72]Wu T, Yang Y, Wu Z. Improving speaker recognition by training on emotionadded models[M] Affective Computing and Intelligent Interaction. Berlin: Springer, 2005: 382389[73]Shahin I. Speaker identification in emotional environments[J]. Iranian Journal of Electrical and Computer Engineering, 2009, 8(1): 4146[74]Bie FH, Wang D, Zheng T F, et al. Emotional adaptive training for speaker verification[C] Proc of the AsiaPacific Signal and Information Processing Association Annual Summit and Conf (APSIPA ASC 2013). Piscataway, NJ: IEEE, 2013: 14[75]Atal B S. Automatic recognition of speakers from their voices[J]. Proceedings of the IEEE, 1976, 64(4): 460475[76]Matsui T, Furui S. Comparison of textindependent speaker recognition methods using VQdistortion and discretecontinuous HMMs[J]. IEEE Trans on Speech and Audio Processing, 1994, 2(3): 456459[77]Yasuda H, Kudo M. Speech rate change detection in martingale framework[C] Proc of the 12th IEEE Int Conf on Intelligent Systems Design and Applications (ISDA). Piscataway, NJ: IEEE, 2012: 859864[78]Ma B, Meng H. EnglishChinese bilingual textindependent speaker verification[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2004: 293296[79]Nagaraja B G, Jayanna H S. Combination of features for multilingual speaker identification with the constraint of limited data[J]. International Journal of Computer Applications, 2013, 70(6): 16[80]Lu L, Dong Y, Zhao X, et al. The effect of language factors for robust speaker recognition[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2009: 42174220[81]Lindberg J, Blomberg M. Vulnerability in speaker verification—A study of technical impostor techniques[C] Proc of the Eurospeech. 1999: 12111214[82]Evans N, Kinnunen T, Yamagishi J. Spoofing and countermeasures for automatic speaker verification[C] Proc of the INTERSPEECH. 2013: 925929[83]Wu Z, Evans N, Kinnunen T, et al. Spoofing and countermeasures for speaker verification: A survey[J]. Speech Communication, 2015, 66: 130153[84]Lau Y W, Wagner M, Tran D. Vulnerability of speaker verification to voice mimicking[C] Proc of the 2004 IEEE Int Symp on Intelligent Multimedia, Video and Speech Processing. Piscataway, NJ: IEEE, 2004: 145148[85]Perrot P, Aversano G, Blouet R, et al. Voice forgery using ALISP: Indexation in a client memory[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2005: 1720[86]Masuko T, Tokuda K, Kobayashi T, et al. Speech synthesis using HMMs with dynamic features[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1996: 389392[87]Masuko T, Tokuda K, Kobayashi T, et al. Voice characteristics conversion for HMMbased speech synthesis system[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1997: 16111614[88]De Leon P L, Pucher M, Yamagishi J, et al. Evaluation of speaker verification security and detection of HMMbased synthetic speech[J]. IEEE Trans on Audio, Speech, and Language Processing, 2012, 20(8): 22802290[89]Galou G, Chollet G. Synthetic voice forgery in the forensic context: a short tutorial[C] Proc of the Forensic Speech and Audio Analysis Working Group (ENFSIFSAAWG). 2011[90]Wu Z, Siong C E, Li H. Detecting converted speech and natural speech for antispoofing attack in speaker recognition[C] Proc of the INTERSPEECH. 2012[91]Chen LW, Guo W, Dai LR. Speaker verification against synthetic speech[C] Proc of the 7th Int Symp on Chinese Spoken Language Processing (ISCSLP). 2010: 309312[92]Ogihara A, Hitoshi U, Shiozaki A. Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification[J]. IEICE Trans on Fundamentals of Electronics, Communications and Computer Sciences, 2005, 88(1): 280286[93]De Leon P L, Stewart B, Yamagishi J. Synthetic speech discrimination using pitch pattern statistics derived from image analysis[C] Proc of the INTERSPEECH. 2012[94]Stylianou Y. Voice transformation: A survey[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2009: 35853588[95]Alegre F, Vipperla R, Evans N. Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals[C] Proc of the INTERSPEECH. 2012[96]Alegre F, Amehraye A, Evans N. Spoofing countermeasures to protect automatic speaker verification from voice conversion[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2013: 30683072[97]Wu Z, Gao S, Cling E S, et al. A study on replay attack and antispoofing for textdependent speaker verification[C] Proc of the AsiaPacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA ASC). Piscataway, NJ: IEEE, 2014: 15[98]Villalba J, Lleida E. Detecting replay attacks from farfield recordings on speaker verification systems[M] Biometrics and ID Management. Berlin: Springer, 2011: 274285[99]Wang ZF, Wei G, He QH. Channel pattern noise based playback attack detection algorithm for speaker recognition[C] Proc of the IEEE Int Conf on Machine Learning and Cybernetics (ICMLC). Piscataway, NJ: IEEE, 2011: 17081713[100]Shiota S, Villavicencio F, Yamagishi J, et al. Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification[C] Proc of the 16th Annual Conf of the International Speech Communication Association. 2015: 239243[101]郑方. 基于动态密码语音的身份确认系统及方法: 中国, ZL201310123555.0[P]. 20150225[102]Li K P, Wrench Jr E H. An approach to textindependent speaker recognition with short utterances[C] Proc of the IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 1983: 555558[103]Kwon S, Narayanan S. Robust speaker identification based on selective use of feature vectors[J]. Pattern Recognition Letters, 2007, 28(1): 8589[104]Zhang CH, Zheng T F. A fishervoice based feature fusion method for short utterance speaker recognition[C] Proc of the IEEE China Summit & Int Conf on Signal and Information Processing (ChinaSIP). Piscataway, NJ: IEEE, 2013: 165169[105]Zhang CH, Wu XJ, Zheng T F, et al. A Kphonemeclass based multimodel method for short utterance speaker recognition[C] Proc of the AsiaPacific Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC 2012). 2012: 14[106]Malegaonkar A, Ariyaeeinia A, Sivakumaran P, et al. On the enhancement of speaker identification accuracy using weighted bilateral scoring[C] Proc of the 42nd IEEE Annual Int Carnahan Conf on Security Technology (ICCST). Piscataway, NJ: IEEE, 2008: 254258 |
[1] | 赵娟娟 刘昌华. WebSocket子协议的设计与实现[J]. 信息安全研究, 2021, 7(1): 64-68. |
[2] | 王新文. 基于区块链的数字证书系统在电子政务外网中的应用探究[J]. 信息安全研究, 2021, 7(1): 81-85. |
[3] | 李俊 柴海新. 生物特征识别隐私保护研究[J]. 信息安全研究, 2020, 6(7): 589-601. |
[4] | 张慧 王钰 成舸 向银杉 郑方. 基于“声纹+”的无监督可信身份认证[J]. 信息安全研究, 2020, 6(7): 615-621. |
[5] | 王滨 刘贤刚 陈学明 李琳. 物联网智能联网设备口令保护技术研究[J]. 信息安全研究, 2020, 6(7): 652-656. |
[6] | 王柏华 孙长杰 李照川 王伟兵. 远程办公中基于区块链技术的身份认证方法[J]. 信息安全研究, 2020, 6(4): 317-326. |
[7] | 王斯梁 冯暄 蔡友保 陈翼. 零信任安全模型解析及应用研究[J]. 信息安全研究, 2020, 6(11): 0-0. |
[8] | 张放 李朝伟 张宁 王上. 整机信创生态发展面临的问题及对策研究[J]. 信息安全研究, 2020, 6(10): 0-0. |
[9] | 齐锋 陈庄 蔡定雯 于溯. 一种基于USB Key的双私钥安全因子身份认证方案[J]. 信息安全研究, 2019, 5(6): 500-506. |
[10] | 蔡友保 冯暄 陈翼 王斯梁. 一种安全增强型云计算身份认证方案[J]. 信息安全研究, 2019, 5(3): 253-256. |
[11] | 彭小斌. 网络身份安全技术研究中心建设的重要[J]. 信息安全研究, 2019, 5(10): 913-917. |
[12] | 杨威 王宇建 吴永强. 物联网设备身份认证安全性分析[J]. 信息安全研究, 2019, 5(10): 918-923. |
[13] | 李兆森 杨洋. 基于国产密码算法的物联网应用研究[J]. 信息安全研究, 2019, 5(10): 924-928. |
[14] | 刘文印 吴鸿文 李昕 凡帅 张启翔 巫家宏 沈治恒. 登录易,一种基于可信用户代理的多方闭环 网络身份认证及管理机制[J]. 信息安全研究, 2018, 4(7): 652-661. |
[15] | 陈庄 陈亚茹. 一种解决组合公钥密钥碰撞的方案[J]. 信息安全研究, 2018, 4(3): 256-260. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||