面向企业数据孤岛的联邦排序学习
作者:
作者简介:

史鼎元(1998-),男,本科,主要研究领域为联邦学习,时空大数据分析处理,众包计算,群体智能,隐私保护.
王晏晟(1994-),男,博士,CCF学生会员,主要研究领域为联邦学习,时空大数据分析处理,众包计算,群体智能,隐私保护.
郑鹏飞(1996-),男,硕士,CCF学生会员,主要研究领域为联邦学习,时空大数据分析处理,众包计算,群体智能,隐私保护.
童咏昕(1982-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为联邦学习,时空大数据分析处理,众包计算,群体智能,隐私保护.

通讯作者:

童咏昕,E-mail:yxtong@buaa.edu.cn

基金项目:

国家重点研发计划(2018AAA0101100);国家自然科学基金(61822201,U1811463);软件开发环境国家重点实验室(北京航空航天大学)开放课题(SKLSDE-2020ZX-15)


Cross-Silo Federated Learning-to-Rank
Author:
  • SHI Ding-Yuan

    SHI Ding-Yuan

    State Key Laboratory of Software Development Enviroment(Beihang University), Beijing 100191, China;Beijing Advanced Innovation Center for Big Data and Brain Computing(Beihang University), Beijing 100191, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • WANG Yan-Sheng

    WANG Yan-Sheng

    State Key Laboratory of Software Development Enviroment(Beihang University), Beijing 100191, China;Beijing Advanced Innovation Center for Big Data and Brain Computing(Beihang University), Beijing 100191, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • ZHENG Peng-Fei

    ZHENG Peng-Fei

    State Key Laboratory of Software Development Enviroment(Beihang University), Beijing 100191, China;Beijing Advanced Innovation Center for Big Data and Brain Computing(Beihang University), Beijing 100191, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • TONG Yong-Xin

    TONG Yong-Xin

    State Key Laboratory of Software Development Enviroment(Beihang University), Beijing 100191, China;Beijing Advanced Innovation Center for Big Data and Brain Computing(Beihang University), Beijing 100191, China;School of Computer Science and Engineering, Beihang University, Beijing 100191, China
    在期刊界中查找
    在百度中查找
    在本站中查找
Fund Project:

National Key Research and Development Program of China (2018AAA0101100); National Natural Science Foundation of China (61822201, U1811463); State Key Laboratory of Software Development Environment (Beihang University) Open Program (SKLSDE-2020ZX-15)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [60]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    排序学习(learning-to-rank,简称LTR)模型在信息检索领域取得了显著成果,而该模型的传统训练方法需要收集大规模文本数据.然而,随着数据隐私保护日渐受到人们重视,从多个数据拥有者(如企业)手中收集数据训练排序学习模型的方式变得不可行.各企业之间数据被迫独立存储,形成了数据孤岛.由于排序模型训练需要使用查询记录、文档等诸多隐私信息,数据孤岛难以融合打通,这制约了排序学习模型的训练.联邦学习能够让多数据拥有方在隐私保护的前提下联合训练模型,是一种打通数据孤岛的新方法.在其启发下,提出了一种新的框架,即面向企业数据孤岛的联邦排序学习,它同时解决了联邦学习场景下排序学习所面临的两大挑战,即交叉特征生成与缺失标签处理.为了应对多方交叉特征的生成问题,使用了一种基于略图(sketch)数据结构与差分隐私的方法,其相比于传统加密方法具有更高的效率,同时还具有隐私性与结果精度的理论保证.为了应对缺失标签问题,提出了一种新的联邦半监督学习方法.最终,通过在公开数据集上的大量实验,验证了所提方法的有效性.

    Abstract:

    Learning-to-rank (LTR) model has made a remarkable achievement. However, traditional training scheme for LTR model requires large amount of text data. Considering the increasing concerns about privacy protection, it is becoming infeasible to collect text data from multiple data owners as before, and thus data is forced to save separately. The separation turns data owners into data silos, among which the data can hardly exchange, causing LTR training severely compromised. Inspired by the recent progress in federated learning, a novel framework is proposed named cross-silo federated learning-to-rank (CS-F-LTR), which addresses two unique challenges faced by LTR when applied it to federated scenario. In order to deal with the cross-party feature generation problem, CS-F-LTR utilizes a sketch and differential privacy based method, which is much more efficient than encryption-based protocols meanwhile the accuracy loss is still guaranteed. To tackle with the missing label problem, CS-F-LTR relies on a semi-supervised learning mechanism that facilitates fast labeling with mutual labelers. Extensive experiments conducted on public datasets verify the effectiveness of the proposed framework.

    参考文献
    [1] Burges CJC, Shaked T, Renshaw E, et al. Learning to rank using gradient descent. In:Proc. of the 22th Int'l Conf. on Machine Learning. 2005. 89-96.
    [2] Chapelle O, Chang Y. Yahoo! Learning to rank challenge overview. In:Proc. of the 28th Int'l Conf. on Machine Learning. 2011. 1-24.
    [3] Liu TY. Learning to Rank for Information Retrieval. Springer-Verlag, 2011.
    [4] Yang Q, Liu Y, Chen TJ, et al. Federated machine learning:Concept and applications. ACM Trans. on Intelligent Systems and Technology, 2019,10(2):12:1-12:19.
    [5] Chang EY, Zhu KH, Wang H, et al. Parallelizing support vector machines on distributed computers. In:Proc. of the 21st Neural Information Processing Systems. 2007. 257-264.
    [6] Yang Q, Liu Y, Chen TJ, et al. Federated learning. Communications of the CCF, 2018,14(11):49-55(in Chinese with English abstract).
    [7] Yin DW, Hu YN, Tang JL, et al. Ranking relevance in Yahoo search. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 323-332.
    [8] Li C, Shen DR, Kou Y, et al. Diversity-aware kNN query processing approaches for temporal spatial textual content. Pattern Recognition and Artificial Intelligence, 2017,30(1):64-72(in Chinese with English abstract).
    [9] Hou YX, Duan L, Li L, et al. Search of genes with similar phenotype based on disease information network. Ruan Jian Xue Bao/Journal of Software, 2018,29(3):721-733(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5445.htm[doi:10.13328/j.cnki.jos.005445]
    [10] Barbaro M, Zeller T. A face is exposed for AOLsearcherno. 4417749. New York Times, 2006.
    [11] Chor B, Goldreich O, Kushilevitz E, et al. Private information retrieval. In:Proc. of the 36th IEEE Symp. on Foundations of Computer Science. 1995. 41-50.
    [12] Liu YM, Zhou HF, Wang ZH, et al. Purpose fusion:The risk purpose basedprivacy-aware data access control. Chinese Journal of Computers, 2010,33(08):1339-1348(in Chinese with English abstract).
    [13] Pang HH, Ding XH, Xiao XK. Embellishing text search queries to protect user privacy. Proc. of the VLDB Endowmen, 2010,3(1):598-607.
    [14] Rebollo-Monedero D, Forné J. Optimized query forgery for private information retrieval. IEEE Tans. on Information Theory, 2010, 56(9):4631-4642.
    [15] Murugesan M, Clifton C. Providing privacy through plausibly deniable search. In:Proc. of the SIAM Int'l Conf. on Data Mining. 2009. 768-779.
    [16] Gaboardi M, Arias EJG, Hsu J, et al. Dual query:Practical private query release for high dimensional data. In:Proc. of the 31th Int'l Conf. on Machine Learning. 2011. 1170-1178.
    [17] Gertner Y, Ishai Y, Kushilevitz E, et al. Protecting data privacy in private information retrieval schemes. Journal of Computer and System Sciences, 2000,60(3):592-629.
    [18] Freedman MJ, Ishai Y, Pinkas B, et al. Keyword search and oblivious pseudorandom functions. In:Proc. of the Theory of Cryptography Conf. 2005. 303-324.
    [19] Weng L, Amsaleg L, Morton A, et al. A privacy-preserving framework for large-scale content-based information retrieval. IEEE Trans. on Information Forensics and Security, 2015,10(1):152-167.
    [20] Curtmola R, Garay JA, Kamara S, et al. Searchable symmetric encryption:Improved definitions and efficient constructions. Journal of Computer Security, 2011,19(5):895-934.
    [21] Agrawal R, Kiernan J, Srikant R, et al. Order-Preserving encryption for numeric data. In:Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. 2004. 563-574.
    [22] Zhang W, Lin YP, Xiao S, et al. Privacy preserving ranked multi-keyword search for multiple data owners in cloud computing. IEEE Trans. on Computers, 2016,65(5):1566-1577.
    [23] Ji SY, Shao JJ, Agun D, et al. Privacy-Aware ranking with tree ensembles on the cloud. In:Proc. of ACM SIGIR Conf. on Research and Development in Information Retrieval. 2018. 315-324.
    [24] Konecny J, McMahan HB, Yu FX, et al. Federated learning:Strategies for improving communication efficiency. CoRR, 2016, abs/1610.05492.
    [25] Bonawitz K, Ivanov V, Kreuter B, et al. Practical secure aggregation for privacy-preserving machine learning. In:Proc. of ACM Conf. on Computer and Communications Security. 2017. 1175-1191.
    [26] McMahan HB, Ramage D, Talwar K, et al. Learning differentially private recurrent language models. In:Proc. of Int'lConf. on Learning Representations. 2018.
    [27] Jiang D, Song YF, Tong YX, et al. Federated topic modeling. In:Proc. of the 28th ACM Int'l Conf. on Information and Knowledge Management. 2019. 1071-1080.
    [28] Wang YS, Tong YX, Shi DY. Federated latent dirichletallocation:A local differential privacy basedframework. In:Proc. of the 34th AAAI Conf. on Artificial Intelligence. 2020. 6283-6290.
    [29] McMahan B, Moore E, Ramage D, et al. Communication-Efficient learning of deep networks from decentralized data. In:Proc. of the 20thInt'l Conf. on Artificial Intelligence and Statistics. 2017. 1273-1282.
    [30] Song TS, Tong YX, Wei SY. Profit allocation for federated learning. In:Proc. of the IEEE Int'l Conf. on Big Data. 2019. 2577-2586.
    [31] Smith V, Chiang CK, Sanjabi M, et al. Federated multi-task learning. In:Proc. of the 31st Neural Information Processing Systems. 2017. 4424-4434.
    [32] Zhao Y, Li M, Lai L, et al. Federated learning with non-iiddata. arXiv preprint arXiv, 2018, 1806.00582.
    [33] Pan RS, Han DM, Pan JC, et al. Visualization for federated learning:Chanllenges and framework. Journal of Computer-Aided Design & Computer Graphics, 2020,32(4):513-519(in Chinese with English abstract).
    [34] Kairouz P, McMahan HB, Avent B, et al. Advances and open problems in federated learning. CoRR, 2019, abs/1912.04977.
    [35] Kharitonov E. Federated online learning to rank with evolutionstrategies. In:Proc of the Int'l Conf. on Web Search and Data Mining. 2019. 249-257.
    [36] Amini MR, Truong TV, Goutte C. A boosting algorithm for learning bipartite ranking functions with partially labeled data. In:Proc. of the ACM SIGIR Conf. on Research and Development in Information Retrieval. 2008. 99-106.
    [37] Duh K, Kirchhoff K. Learning to rank with partially-labeled data. In:Proc. of the ACM SIGIR Conf. on Research and Development in Information Retrieval. 2008. 251-258.
    [38] Zhang X, He B, Luo TJ, et al. Performance analysis of clustering-based transductivelearning. Ruan Jian Xue Bao/Journal of Software, 2014,25(12):2865-2876(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4726.htm[doi:10.13328/j. cnki. jos.004726]
    [39] Laine S, Aila T. Temporal ensembling for semi-supervised learning. In:Proc. of the Int'l Conf. on Learning Representations. 2017.
    [40] Tarvainen A, Valpola H. Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learning results. In:Proc. of the 31st Neural Information Processing Systems. 2017. 1195-1204.
    [41] Muthukrishnan S. Data streams:Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 2005.
    [42] Charikar M, Chen KC, Farach-Colton M. Findingfrequent items in data streams. Theoretical Computer Science, 2004,312(1):3-15.
    [43] Cormode G, Muthukrishnan S. An improved data stream summary:The count-min sketch and its applications. Journal of Algorithms, 2005,55(1):58-75.
    [44] Jiang JW, Fu FC, Yang T, et al. SketchML:Accelerating distributed machine learning with data sketches. In:Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. 2018. 1269-1284.
    [45] Zhu WN, Kairouz P, Sun HC, et al. Federated heavy-hitters discovery with differential privacy. CoRR, 2019, abs/1902.08534
    [46] Dwork C. Differential privacy. In:Proc. of the 33rd Int'l Colloquium on Automata, Languages and Programming. 2006. 1-12.
    [47] Liang WJ, Chen H, Wu YC, et al. Differential privacy under continual observation. Ruan Jian Xue Bao/Journal of Software, 2020,31(6):1761-1785(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6042.htm[doi:10.13328/j.cnki.jos. 006042]
    [48] Li T, Liu ZX, Sekar V, et al. Privacy for free:Communication-efficient learning with differential privacy using sketches. CoRR, 2019, abs/1911.00972.
    [49] Robertson SE. Overview of the Okapi projects. Journal of Documentation, 1997,53(1):3-7.
    [50] Ponte JM, Croft WB. A language modeling approach to information retrieval. In:Proc. of the ACM SIGIR Conf. on Research and Development in Information Retrieval. 1998. 275-281.
    [51] Qin T, Liu TY. Introducing LETOR 4.0 datasets. CoRR, 2013, abs/1306.2597.
    [52] Cormode G, Muthukrishnan S. Summarizing and mining skewed data streams. In:Proc. of the SIAM Int'l Conf. on Data Mining. 2005. 44-55.
    附中文参考文献:
    [6] 杨强,刘洋,陈天健,等.联邦学习.中国计算机学会通讯,2018,14(11):49-55.
    [8] 李晨,申德荣,寇月,等.多样性感知的时空文本信息的KNN查询处理方法.模式识别与人工智能,2017,30(01):64-72.
    [9] 侯泳旭,段磊,李岭,等.基于疾病信息网络的表型相似基因搜索.软件学报,2018,29(3):721-733. http://www.jos.org.cn/1000-9825/5445.htm[doi:10.13328/j.cnki.jos.005445]
    [12] 刘逸敏,周浩峰,王智慧,等.Purpose融合:基于风险purpose的隐私查询访问控制.计算机学报,2010,33(8):1339-1348.
    [33] 潘如晟,韩东明,潘嘉铖,等.联邦学习可视化:挑战与框架.计算机辅助设计与图形学学报,2020,32(4):513-519.
    [38] 张新,何苯,罗铁坚,等.基于聚类的直推式学习的性能分析.软件学报,2014,25(12):2865-2876. http://www.jos.org.cn/1000-9825/4726.htm[doi:10.13328/j.cnki.jos.004726]
    [47] 梁文娟,陈红,吴云乘,等.持续监控下差分隐私保护.软件学报,2020,31(6):1761-1785. http://www.jos.org.cn/1000-9825/6042.htm[doi:10.13328/j.cnki.jos.006042]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

史鼎元,王晏晟,郑鹏飞,童咏昕.面向企业数据孤岛的联邦排序学习.软件学报,2021,32(3):669-688

复制
分享
文章指标
  • 点击次数:2635
  • 下载次数: 6851
  • HTML阅读次数: 3601
  • 引用次数: 0
历史
  • 收稿日期:2020-07-19
  • 最后修改日期:2020-09-03
  • 在线发布日期: 2021-01-21
  • 出版日期: 2021-03-06
文章二维码
您是第19745195位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号