欢迎访问中国科学院大学学报,今天是

中国科学院大学学报

• •    下一篇

基于模型平均与γ-散度的稳健半监督学习方法*

吴慧桢, 张三国   

  1. 中国科学院大学数学科学学院,中国科学院大数据挖掘与知识管理重点实验室,北京100049
  • 收稿日期:2024-01-03 修回日期:2024-04-18 发布日期:2024-05-22
  • 通讯作者: E-mail:sgzhang@ucas.ac.cn
  • 基金资助:
    * 国家自然科学基金(批准号:12171454,U19B2940)和中央高校基本科研业务费专项资金资助

Robust semi-supervised learning model based on model averaging and γ-divergence

WU Huizhen, ZHANG Sanguo   

  1. School of Mathematical Sciences, University of Chinese Academy of Sciences, Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-01-03 Revised:2024-04-18 Published:2024-05-22

摘要: 半监督学习是模式识别和机器学习领域研究的重点问题,近年来在各领域得到广泛应用。在实际问题中,有标签样本获取代价高昂,而无标签样本虽然缺少标签信息但更为容易获得,半监督学习同时使用大量的无标签数据和少量的有标签数据来进行模式识别工作。本文提出了一种基于模型平均与γ-散度的稳健半监督方法:一方面,通过引入模型平均方法解决无标签数据质量不高的问题;另一方面,通过引入基于γ-散度的逻辑回归解决有标签数据存在误标签的问题。所提出的模型的优点是,我们能够利用不同模型的预测差异来处理数据,有效利用无标签数据的信息,同时尽可能减少其中的有害信息;并通过引入γ-散度减少有标签数据中误标签数据对拟合效果的影响,最终得到对于无标签数据和有标签数据都稳健的模型。模拟研究和Breast Cancer数据应用表明,与现有半监督学习方法相比,当数据质量较低时,本文所提出的新方法在预测性能上有明显的提升。

关键词: 半监督学习, 模型平均, γ-散度, 稳健性

Abstract: Semi-supervised learning is a key research problem in the field of pattern recognition and machine learning, and has been widely used in various fields in recent years. In practical problems, labeled samples are costly to obtain, while unlabeled samples are easier to obtain despite the lack of labeling information. Semi-supervised learning uses a large amount of unlabeled data and a small amount of labeled data at the same time to perform pattern recognition work. In this paper, we propose a robust semi-supervised approach based on model averaging and γ-divergence: on the one hand, the problem of low quality of unlabeled data is addressed by introducing model averaging method; on the other hand, the problem of mislabeling of labeled data is addressed by introducing logistic regression based on γ-divergence. One of the advantages of the proposed model is that we are able to process the data by exploiting the predictive differences of the different models to effectively utilize the information of the unlabeled data while minimizing the harmful information in it. And by introducing γ-divergence to reduce the effect of mislabeled data in labeled data on the fitting effect, we ultimately obtain a model that is robust for both unlabeled and labeled data. Simulation studies and applications of Breast Cancer Wisconsin Dataset show that compared with existing semi-supervised learning methods, the new method proposed in this paper has a significant improvement in prediction performance when the data quality is low.

Key words: semi-supervised learning, model averaging, γ-divergence, robustness

中图分类号: