信息安全研究 ›› 2024, Vol. 10 ›› Issue (8): 712-.

• 学术论文 • 上一篇    下一篇

Spark框架下支持差分隐私保护的K-means++聚类方法

石江南1彭长根1谭伟杰2


  

  1. 1(公共大数据国家重点实验室(贵州大学)贵阳550025)
    2(现代制造技术教育部重点实验室(贵州大学)贵阳550025)

  • 出版日期:2024-08-08 发布日期:2024-08-08
  • 通讯作者: 彭长根 博士,二级教授,CCF杰出会员.主要研究方向为密码学与信息安全. peng_stud@163.com
  • 作者简介:石江南 硕士.主要研究方向为大数据安全与隐私保护. sjn6529482@163.com 彭长根 博士,二级教授,CCF杰出会员.主要研究方向为密码学与信息安全. peng_stud@163.com 谭伟杰 博士,副教授.主要研究方向为通信网络安全. tanweijie829@126.com

K-means++ Clustering Method Supporting Differential Privacy Protection in Spark Framework

Shi Jiangnan1, Peng Changgen1, and Tan Weijie2#br#

#br#
  

  1. 1(State Key Laboratory of Public Big Data(Guizhou University), Guiyang 550025)
    2(Key Laboratory of Advanced Manufacturing Technology(Guizhou University),Ministry of Education, Guiyang 550025)

  • Online:2024-08-08 Published:2024-08-08

摘要: 针对差分隐私聚类算法在处理海量数据时其隐私性和可用性之间的矛盾,提出了一种分布式环境下支持差分隐私的Kmeans++聚类算法.该算法通过内存计算引擎Spark,创建弹性分布式数据集,利用转换算子及行动算子操作数据进行运算,并在选取初始化中心点及迭代更新中心点的过程中,通过综合利用指数机制和拉普拉斯机制,以解决初始聚类中心敏感及隐私泄露问题,同时减少计算过程中对数据实施的扰动.根据差分隐私的特性,从理论角度对整个算法进行证明,以满足ε差分隐私保护.实验结果证明了该方法在确保聚类结果可用性的前提下,具备出色的隐私保护能力和高效的运行效率.

关键词: 数据挖掘, 聚类算法, 差分隐私, Spark框架, 指数机制

Abstract: To address the tradeoff between privacy and utility in differentially private clustering algorithms when handling with massive data, a distributed differentially private Kmeans++ clustering algorithm is proposed. This algorithm leverages the memorybased computing engine Spark to create resilient distributed datasets(RDD) and performs computations using transformation and action operators. During the selection of  initial centroids and iterative updates, a combination of the exponential mechanism and the Laplace mechanism is employed to mitigate the issues of sensitive initial centroids and privacy leakage, while reducing perturbation applied to the data during the computation. According to the characteristics of differential privacy, this paper provides a theoretical proof for the entire algorithm to satisfy εdifferential privacy protection. Experimental results demonstrate that this method possesses excellent privacy protection capabilities and efficient operational efficiency while ensuring the usability of clustering results.

Key words: data mining, clustering algorithm, differential privacy, Spark, exponential mechanism

中图分类号: