Spark框架下支持差分隐私保护的K-means++聚类方法

信息安全研究 ›› 2024, Vol. 10 ›› Issue (8): 712-.

Spark框架下支持差分隐私保护的K-means++聚类方法

石江南1彭长根1谭伟杰2

1(公共大数据国家重点实验室(贵州大学)贵阳550025)
2(现代制造技术教育部重点实验室(贵州大学)贵阳550025)

出版日期:2024-08-08 发布日期:2024-08-08
通讯作者: 彭长根博士，二级教授，CCF杰出会员.主要研究方向为密码学与信息安全. peng_stud@163.com
作者简介:石江南硕士.主要研究方向为大数据安全与隐私保护. sjn6529482@163.com 彭长根博士，二级教授，CCF杰出会员.主要研究方向为密码学与信息安全. peng_stud@163.com 谭伟杰博士，副教授.主要研究方向为通信网络安全. tanweijie829@126.com

K-means++ Clustering Method Supporting Differential Privacy Protection in Spark Framework

Shi Jiangnan1, Peng Changgen1, and Tan Weijie2#br#

#br#

1(State Key Laboratory of Public Big Data(Guizhou University), Guiyang 550025)
2(Key Laboratory of Advanced Manufacturing Technology(Guizhou University),Ministry of Education, Guiyang 550025)

Online:2024-08-08 Published:2024-08-08

摘要/Abstract

摘要： 针对差分隐私聚类算法在处理海量数据时其隐私性和可用性之间的矛盾，提出了一种分布式环境下支持差分隐私的Kmeans++聚类算法.该算法通过内存计算引擎Spark，创建弹性分布式数据集，利用转换算子及行动算子操作数据进行运算，并在选取初始化中心点及迭代更新中心点的过程中，通过综合利用指数机制和拉普拉斯机制，以解决初始聚类中心敏感及隐私泄露问题，同时减少计算过程中对数据实施的扰动.根据差分隐私的特性，从理论角度对整个算法进行证明，以满足ε差分隐私保护.实验结果证明了该方法在确保聚类结果可用性的前提下，具备出色的隐私保护能力和高效的运行效率.

关键词: 数据挖掘, 聚类算法, 差分隐私, Spark框架, 指数机制

Abstract: To address the tradeoff between privacy and utility in differentially private clustering algorithms when handling with massive data, a distributed differentially private Kmeans++ clustering algorithm is proposed. This algorithm leverages the memorybased computing engine Spark to create resilient distributed datasets(RDD) and performs computations using transformation and action operators. During the selection of initial centroids and iterative updates, a combination of the exponential mechanism and the Laplace mechanism is employed to mitigate the issues of sensitive initial centroids and privacy leakage, while reducing perturbation applied to the data during the computation. According to the characteristics of differential privacy, this paper provides a theoretical proof for the entire algorithm to satisfy εdifferential privacy protection. Experimental results demonstrate that this method possesses excellent privacy protection capabilities and efficient operational efficiency while ensuring the usability of clustering results.

Key words: data mining, clustering algorithm, differential privacy, Spark, exponential mechanism

中图分类号:

TP301

石江南, 彭长根, 谭伟杰, . Spark框架下支持差分隐私保护的K-means++聚类方法[J]. 信息安全研究, 2024, 10(8): 712-.

参考文献

［1］安鹏, 李宏飞, 高铭, 等. 运营商数据安全合规检查技术研究与实践［J］. 信息安全研究, 2023, 9(7): 643647［2］张恩, 李会敏, 常键. 可验证的隐私保护Kmeans聚类方案［J］. 计算机应用, 2021, 41(2): 413421［3］Zhang Peng, Huang Teng, Sun Xiaoqiang, et al. Privacypreserving and outsourced multiparty Kmeans clustering based on multikey fully homomorphic encryption additively homomorphic encryption［J］. IEEE Trans on Dependable and Secure Computing, 2023, 20(3): 23482359［4］Dwork C. A firm foundation for private data analysis［J］. Communications of the Association for Computing Machinery, 2011, 54(1): 8695［5］何清, 庄福振, 曾立, 等. PDMiner: 基于云计算的并行分布式数据挖掘工具平台［J］. 中国科学: 信息科学, 2014, 44(7): 871885［6］李洪成, 吴晓平, 陈燕. MapReduce框架下支持差分隐私保护的Kmeans聚类方法［J］. 通信学报, 2016, 37(2): 125131［7］毛伊敏, 甘德瑾, 廖列法, 等. 基于Spark框架和ASPSO的并行划分聚类算法［J］. 通信学报, 2022, 43(3): 148163［8］Arthur D, Vassilvitskii S. Kmeans++: The advantages of careful seeding［C］ Proc of the 18th Annual ACMSIAM Symp on Discrete Algorithms(SODA’07). New York: ACM, 2007: 10271035［9］傅彦铭, 李振译. 基于拉普拉斯机制的差分隐私保护Kmeans++聚类算法研究［J］. 信息网络安全, 2019, 19(2): 4352［10］Zaharia M, Chowdhury M, Franklin M J, et al. Spark: Cluster computing with working sets［C］ Proc of the 2nd USENIX Workshop on Hot Topics in Cloud Computing(HotCloud 10). Berkeley, CA: USENIX Association, 2010［11］Dwork C. Differential privacy［C］ Proc of the 33rd Int Conf on Automata,Languages and Programming. Berlin: Springer, 2006: 112［12］Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis［C］ Proc of the Theory of Cryptography Conf. Berlin: Springer, 2006: 265284［13］McSherry F, Talwar K. Mechanism design via differential privacy［C］ Proc of the 48th Annual IEEE Symp on Foundations of Computer Science(FOCS’07). Piscataway, NJ: IEEE, 2007: 94103［14］McSherry F D. Privacy integrated queries:An extensible platform for privacy preserving data analysis［C］ Proc of the 2009 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2009: 1930

[1]	关业礼, 罗森林, 潘丽敏, 张笈, 于经纬, . 强化语义一致性的差分隐私文本脱敏方法[J]. 信息安全研究, 2024, 10(8): 706-.
[2]	曾辉, 熊诗雨, 狄永正, 史红周, . 基于差分隐私的联邦大模型微调技术[J]. 信息安全研究, 2024, 10(7): 616-.
[3]	赵佳璐, 李格菲, 葛晓囡, 朱磊, 韦宇星, 严毅恒, 阿依登·塔布斯, . 基于数据挖掘的等级保护测评数据再利用模型研究[J]. 信息安全研究, 2024, 10(4): 353-.
[4]	刘晓迁, 许飞, 马卓, 袁明, 钱汉伟, . 联邦学习中的隐私保护技术研究[J]. 信息安全研究, 2024, 10(3): 194-.
[5]	安鹏, 李宏飞, 高铭, 王世彪, 喻波, . 运营商数据安全合规检查技术研究与实践[J]. 信息安全研究, 2023, 9(7): 643-.
[6]	盛雪晨, 陈丹伟, . 基于联邦学习和差分隐私的文本分类模型研究[J]. 信息安全研究, 2023, 9(12): 1145-.
[7]	梁晨, 王利斌, 李卓群, 薛源, . 生成式对抗网络技术与研究进展[J]. 信息安全研究, 2022, 8(3): 235-.
[8]	粟勇, 刘文龙, 刘圣龙, 江伊雯, . 基于安全洗牌和差分隐私的联邦学习模型安全防护方法[J]. 信息安全研究, 2022, 8(3): 270-.
[9]	胡韵, 刘嘉驹, 李春国, . 一种基于差分隐私的可追踪深度学习分类器[J]. 信息安全研究, 2022, 8(3): 277-.
[10]	傅思敏, 王健, 鹿全礼, 赵阳阳, . 面向交通流量预测隐私保护的联邦学习方法[J]. 信息安全研究, 2022, 8(10): 1035-.
[11]	张帆潘亚雄胡勇. 基于改进Single-Pass的新闻话题检测与追踪技术研究[J]. 信息安全研究, 2020, 6(5): 396-403.
[12]	黄莉峥刘嘉勇郑荣锋李孟铭. 一种基于暗网的威胁情报主动获取框架[J]. 信息安全研究, 2020, 6(2): 131-138.
[13]	吕彬张悦齐标石志鑫. 大数据在信息安全领域的应用分析[J]. 信息安全研究, 2019, 5(7): 599-607.
[14]	陈泽峰方勇刘亮左政李抒霞. 基于多维特征的Android恶意应用检测系统[J]. 信息安全研究, 2018, 4(2): 133-139.
[15]	王鲁华. 基于数据挖掘的网络入侵检测方法[J]. 信息安全研究, 2017, 3(9): 810-816.