Chinese Dark Web Product Detection and Classification Based on  Multimodal Data Augmentation#br#

Journal of Information Security Reserach ›› 2026, Vol. 12 ›› Issue (6): 575-.

Chinese Dark Web Product Detection and Classification Based on Multimodal Data Augmentation#br#

Yang Kaijie1, Luo Wenhua1, and Li Jing2

1(School of Public Security Information Technology and Intelligence, Criminal Investigation Police University of China, Shenyang 110035)
2(Basic Teaching and Research Department, Criminal Investigation Police University of China, Shenyang 110035)

Online:2026-06-07 Published:2026-06-07

基于多模态数据增强的中文暗网商品检测与分类

杨凯杰1罗文华1李晶2

1(中国刑事警察学院公安信息技术与情报学院沈阳110035)
2(中国刑事警察学院基础教研部沈阳110035)

通讯作者: 罗文华硕士，教授.主要研究方向为网络安全执法技术. luowenhua770404@126.com
作者简介:杨凯杰硕士研究生.主要研究方向为网络安全执法技术. 2571761460@qq.com 罗文华硕士，教授.主要研究方向为网络安全执法技术. luowenhua770404@126.com 李晶硕士，讲师.主要研究方向为公安学. 592579981@qq.com
基金资助:
国家重点研发计划项目(2021YFC3301801)；辽宁省教育厅高校基本科研项目(LJ212410175002)；中央高校基本科研业务费项目(C2024012)；中国刑事警察学院研究生创新能力提升项目(2025YCZD03)

Abstract

Abstract: In order to address the issues of coarse granularity in existing dark Web intelligence classification research and the predominance of Englishlanguage datasets, this paper proposes a finegrained analysis study focused on Chinese dark Web content. To overcome the scarcity of Chinese dark Web data and the misalignment of multimodal data, this study employs a large language model prompt rewriting strategy and a differentiated image enhancement strategy to achieve text and image data augmentation. By integrating product data from a certain platform on the Surface Web, a dataset comprising 14,052 product records was constructed. A feature selection optimization module was designed to establish an intertask coupling mechanism, and a Chinese dark Web product detection and classification model based on multimodal data augmentation was proposed. Experimental results demonstrate that the proposed model achieves macroF1 scores of 0.992 and 0.941 in dark Web product detection and classification tasks, respectively, representing an approximately 2% improvement over the best baseline model in classification task and significantly outperforming existing singlemodal and multimodal methods. This approach effectively enhances the performance of finegrained classification tasks for Chinese dark Web intelligence, offering new insights and methodologies for dark Web intelligence analysis.

Key words: dark Web product, multimodal, data augmentation, detection, classification

摘要： 解决现有暗网情报分类研究粒度较粗且数据集多为英文的问题，提出针对中文暗网内容的细粒度分析方法.针对中文暗网数据稀缺及多模态数据不对齐问题，利用大语言模型提示词改写策略及差异化图像增强策略实现文本与图像数据增强，并通过混合明网某平台商品数据，构建包含14052条商品记录的数据集，设计特征选择优化模块建立任务间耦合机制，提出基于多模态数据增强的中文暗网商品检测与分类模型.实验结果表明，该模型在暗网商品检测和分类任务中，宏F1值分别达到0.992和0.941，在分类任务上较最佳基线模型提升约2%，显著优于现有单模态和多模态方法，有效提升了中文暗网情报细粒度分类任务的性能，为暗网情报分析提供了新思路和方法.

关键词: 暗网商品, 多模态, 数据增强, 检测, 分类

CLC Number:

TP393

杨凯杰, 罗文华, 李晶, . 基于多模态数据增强的中文暗网商品检测与分类[J]. 信息安全研究, 2026, 12(6): 575-.

References

［1］杨亚飞, 王诺亚. 暗网犯罪情报分析研究［J］. 情报杂志, 2023, 42(4): 4249［2］Ghosh S, Das A, Porras P, et al. Automated categorization of onion sites for analyzing the darkweb ecosystem［C］ Proc of the 23rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2017: 17931802［3］He S, He Y, Li M. Classification of illegal activities on the dark Web［C］ Proc of the 2nd Int Conf on Information Science and Systems. New York: ACM, 2019: 7378［4］郑献春, 王瑞, 闫皓楠, 等. 基于分布式爬虫的高性能Tor网络内容监控系统［J］. 信息安全学报, 2023, 8(1): 144153［5］Shin G Y, Jang Y, Kim D W, et al. Dark side of the Web: Dark Web classification based on TextCNN and topic modeling weight［J］. IEEE Access, 2023, 12: 3636136371［6］周宇, 蔡都. 基于BERT模型的暗网犯罪情报挖掘技术研究［J］. 现代信息科技, 2024, 8(23): 165169, 174［7］张鹏, 罗文华. 基于布隆过滤器查找树的日志数据区块链溯源机制［J］. 信息网络安全, 2024, 24(11): 17391748［8］罗文华, 许彩滇. 利用改进DBSCAN聚类实现多步式网络入侵类别检测［J］. 小型微型计算机系统, 2020, 41(8): 17251731［9］Cordonnier J B, Loukas A, Jaggi M. Multihead attention: Collaborate instead of concatenate［J］. arXiv preprint, arXiv: 2006.16362, 2020［10］Lin H, Cheng X, Wu X, et al. Cat: Cross attention in vision transformer［C］ Proc of 2022 IEEE Int Conf on Multimedia and Expo (ICME). Piscataway, NJ: IEEE, 2022: 16［11］Liu S, Johns E, Davison A J. Endtoend multitask learning with attention［C］ Proc of the IEEECVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 18711880［12］罗文华, 程家兴. 基于Hadoop架构的混合型DDoS攻击分布式检测系统［J］. 信息网络安全, 2021, 21(2): 6169［13］Kim Y. Convolutional neural networks for sentence classification［J］. arXiv preprint, arXiv:1408.5882, 2014［14］Chen G, Ye D, Xing Z, et al. Ensemble application of convolutional and recurrent neural networks for multilabel text categorization［C］ Proc of 2017 Int Joint Conf on Neural Networks (IJCNN). Piscataway, NJ: IEEE, 2017: 23772383［15］Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding［C］ Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 41714186［16］Dosovitskiy A. An image is worth 16×16 words: Transformers for image recognition at scale［J］. arXiv preprint, arXiv:2010.11929, 2020［17］Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need［C］ Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 59986008［18］Su W, Zhu X, Cao Y, et al. VlBERT: Pretraining of generic visuallinguistic representations［J］. arXiv preprint, arXiv:1908.08530, 2019

[1]	. Research on Smart Contract Vulnerability Detection Method Based on Multimodal Feature Fusion [J]. Journal of Information Security Reserach, 2026, 12(6): 503-.
[2]	. Research Review on Collaborative Intrusion Detection Based on Federated Learning [J]. Journal of Information Security Reserach, 2026, 12(6): 526-.
[3]	. EWGNN: Edge Weightaware Graph Neural Network for Encrypted Traffic Classification [J]. Journal of Information Security Reserach, 2026, 12(6): 533-.
[4]	. Research on AIempowered Cybersecurity Detection and Assessment Technologies [J]. Journal of Information Security Reserach, 2026, 12(6): 559-.
[5]	. A Network Traffic Anomaly Detection Model Based on Semisupervised Twochannel Multiscale Gating Fusion [J]. Journal of Information Security Reserach, 2026, 12(6): 566-.
[6]	. LLMenhanced Static Analysis for Detecting Broken Object Level Authorization Vulnerabilities in Java Web Applications#br# #br# [J]. Journal of Information Security Reserach, 2026, 12(5): 394-.
[7]	. OSN Intrusion Detection Method Based on Residual Timeattention with Feature Selection#br# #br# [J]. Journal of Information Security Reserach, 2026, 12(5): 402-.
[8]	. Research on Harmful Website Detection Based on Graph Neural Network and Multifeature Fusion [J]. Journal of Information Security Reserach, 2026, 12(5): 420-.
[9]	. Research on Log Anomaly Detection Method Integrating Semantic Features [J]. Journal of Information Security Reserach, 2026, 12(4): 383-.
[10]	. Research on Domain Adaptive Intrusion Detection Method Based on Dynamic Feature Fusion [J]. Journal of Information Security Reserach, 2026, 12(4): 294-.
[11]	. Anomaly Traffic Detection Based on Improved Bidirectional TCN Model in Software Defined Network [J]. Journal of Information Security Reserach, 2026, 12(4): 303-.
[12]	. Approximate Decision Boundary Approach for Blackbox Adversarial Attacks Based on Saliency Detection [J]. Journal of Information Security Reserach, 2026, 12(4): 340-.
[13]	. Anomaly Encrypted Traffic Detection Method Based on Graph Attention Network [J]. Journal of Information Security Reserach, 2026, 12(3): 237-.
[14]	. Log Anomaly Detection Based on Graph Attention Networks and Collaborative Learning [J]. Journal of Information Security Reserach, 2026, 12(3): 246-.
[15]	. A Rapid Method for WebShell Attack Success Determination Based on Web Page Structural Similarity [J]. Journal of Information Security Reserach, 2026, 12(3): 255-.