基于核矩阵学习的XML文档相似度量方法
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Similarity Measures for XML Documents Based on Kernel Matrix Learning
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    XML文档作为一种新的数据形式,成为当前的研究热点.XML文档间相似度的计算是XML文档分析、管理及文本挖掘的基础.结构链接向量模型(structuredlink vector model,简称SLVM)是一种综合考虑XML文档结构信息与内容信息进行XML文档相似度量的方法.体现XML文档结构单元关系的核矩阵在结构链接向量模型中扮演着重要角色.为自动捕获XML文档结构单元关系,提出了两种核矩阵的学习算法,分别是基于支持向量机(support vector machine,简称SVM)的回归学习算法和基于矩阵迭代的学习算法.相似搜索实验对比结果表明,基于核矩阵学习方法的XML文档相似度量方法的准确性明显优于其他方法.进一步实验表明,基于矩阵迭代学习的核矩阵学习算法与基于支持向量机的回归学习算法相比,不仅具有更高的准确性,而且所需训练文档更少、计算代价更小.

    Abstract:

    XML document as a new data model has been a hot research area. Similarity measure is a basic of analyses, management and text mining for XML documents. Structured Link Vector Model (SLVM) is a document model for the XML documents’ similarity measure based on both the content and structure. The kernel matrix, which describes the relations between the structure units, plays an important role in the SLVM. In the paper, two algorithms are derived to learn the kernel matrix for capturing the relations between the structure units: one is based on the support vector machine and the other is based on matrix iterative analysis. For the performance evaluation, the proposed similarity measure is applied to similarity search. The experimental results show that the similarity measure based on kernel matrix learning outperform significantly the traditional measures. Furthermore, comparing with the kernel matrix leaning algorithm based on the support vector machine (SVM)’s regression, the kernel matrix leaning algorithms based on matrix iterative analysis not only acquires higher precision but also needs less training documents and cost.

    参考文献
    相似文献
    引证文献
引用本文

杨建武,陈晓鸥.基于核矩阵学习的XML文档相似度量方法.软件学报,2006,17(5):991-1000

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2005-06-30
  • 最后修改日期:2005-10-20
  • 录用日期:
  • 在线发布日期:
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号