Journal of Information Security Reserach ›› 2023, Vol. 9 ›› Issue (7): 637-.

Previous Articles     Next Articles

Data Scarcity and Large Language Model Data Value Asymmetry


  • Online:2023-07-01 Published:2023-07-01



  1. 1(全国海关信息中心北京100005)
  • 通讯作者: 王翔 博士,高级工程师.主要研究方向为数字政府.
  • 作者简介:王翔 博士,高级工程师.主要研究方向为数字政府. 周辉 博士,副研究员.主要研究方向为网络治理、数据隐私、智能法治. 李志鹏 硕士,高级工程师.主要研究方向为通关信息化. 邢云 高级工程师.主要研究方向为通关信息化.

Abstract: With the rapid development of the large language model (LLM) industry, due to market competition situations, LLM scale has expanded rapidly. However meanwhile on the supply side, available training datasets is relatively insufficient and increasing scarce, especially highvalue ones cannot fulfill the exponential growth on LLM computation scale on the demand side. Status quo, under stringent institutional constraints on data factor, the operation mechanism of LLM has been proved with natural monopoly characteristics. Differences among economies in data governance philosophy and international section technical environment, and algorithm discrimination all increase value asymmetry between supply and demand, impact LLM data value distribution, and strengthen LLM owners’ data monopoly. For China’ LLM industry, although it confronts a series of technical constraints in the international section, however advantages of great potential in dataset endowment, both quantity and quality, could improve contributions for data value benefits accumulations. It is necessary to strengthen the construction of selfsupporting LLM platforms, input and output value indicators, international rules, and also an emphasis on policy guidance for the future development of LLM industry.

Key words: data scarcity, data value asymmetry, data monopoly, artificial intelligence generated content (AIGC), large language model (LLM), crossborder data chain

摘要: 随着大模型产业的快速发展,出于市场竞争的需要,模型规模快速膨胀,但同时可用于训练的数据供给相对不足、未来日趋稀缺,特别是高质量数据无法满足大模型计算规模指数级增长需求.在数据制度性约束日趋严密的今天,大模型的运行机理呈现自然垄断特征,而主要经济体之间数据治理思路的差异、国际段技术条件的差异以及算法歧视等因素都在持续加大供需双方的价值非对称性,影响大模型的数据价值分配,进而强化大模型所有者的数据垄断.我国发展大模型产业尽管面临国际段一系列技术条件限制,但是拥有数据禀赋优势,无论数量还是质量均具有很大潜力.为了更好积累数据价值收益,未来需要在自主平台、评估指标、国际规则等方面加强建设,并注重对大模型产业的政策引导.

关键词: 数据稀缺性, 数据价值非对称性, 数据垄断, 智能生成(AIGC), 大模型(LLM), 跨境数据链