无监督的中文商品属性结构化方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点基础研究发展计划(973)(2012CB316203);国家自然科学基金(61332006,61472321);西北工业大学基础研究基金(3102014JSJ0013,3102014JSJ0005)


Unsupervised Structralization Method of Merchandise Attributes in Chinese
Author:
Affiliation:

Fund Project:

National Program on Key Basic Research Project of China (973) (2012CB316203); National Natural Science Foundation of China (61332006, 61472321); Northwestern Polytechnical University Foundation for Fundamental Research (3102014JSJ0013, 3102014JSJ0005)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    从非结构化商品描述文本中抽取结构化属性信息,对于电子商务实现商品的对比与推荐及用户需求预测等功能具有重要意义.现有结构化方法大多采用监督或半监督的分类方法抽取属性值与属性名,通过文法分析器分析属性值与属性名之间的文法依存关系,并根据关联规则实现属性值与属性名的匹配.这些方法存在以下不足:(1)需要人工标记部分属性值、属性名及它们之间的对应关系;(2)属性值-属性名匹配的准确度受到语言习惯、句意逻辑、语料库及属性名候选集质量的严重制约.提出了一种无监督的中文商品属性结构化方法.该方法借助搜索引擎,基于小概率事件原理分析文法关系来抽取属性值与属性名.同时,提出相对不选取条件概率场,并使用Page Rank算法来计算属性值与属性名的配对概率.该方法无需人工标记的开销,且无论商品描述中是否显式地包含相应的属性名,该方法都能自动抽取到属性值并匹配相应的属性名.使用百度搜索引擎上的真实语料,针对4类商品的中文描述进行了实验.实验结果验证了对于候选属性名的自动生成,所提出的基于搜索引擎搜索属性值,并在包含属性值的搜索结果中抽取一般名词的候选属性名生成方法与只在描述句中抽取一般名词的候选属性名生成方法相比,查全率提高了20%以上;对于非量化类属性,所提出的基于相对不选取条件概率场的属性值-属性名匹配方法与基于依存关联的方法相比,Rank-1的准确率提高了30%以上,平均MRR提高了0.3以上.

    Abstract:

    Extracting attribute names and values from textual product descriptions is important for many e-business applications such as user demand forecasting and product comparison and recommendation. The existing approaches first use supervised or semi-supervised classification techniques to extract attribute names and values, and then match them by analyzing their grammatical dependency. However, those methods have following limitations:(1) They require human intervention to label some attributes, values and the matching relationship between them; (2) The matching accuracy may be greatly affected by language habits, semantic logic, and the quality of corpus and candidates sets. To address these issues, this paper proposes an unsupervised approach for attribute name and value extraction and matching in Chinese textual merchandise descriptions. Taking advantage of search engine, it extracts the candidate set of attribute names with respect to a value by analyzing grammatical relation based on the principle of small probability event. A new algorithm for computing the matching probability between attribute names and values is also designed based on relative conditional deselect probability and Page Rank. The proposed approach can effectively extract attribute names and values from Chinese textual merchandise descriptions and match them without any human intervention, no matter whether the attribute name appears in the textual description or not. Finally, the performance of the proposed approach is evaluated on the textual descriptions of 4 types of merchandise using the search engine of Baidu. The experimental results show that the new approach for attribute name extraction can improve recall by 20%, compared with the approach of directly extracting attribute names from textual descriptions. Moreover, the new approach achieves considerably higher matching accuracy (above 30% if measured by the percentage of rank-1, above 0.3 if measured by MRR) than the existing techniques based on grammatical dependency analysis for non-quantization attributes.

    参考文献
    相似文献
    引证文献
引用本文

侯博议,陈群,杨婧颖,李战怀.无监督的中文商品属性结构化方法.软件学报,2017,28(2):262-277

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2015-08-15
  • 最后修改日期:2015-12-02
  • 录用日期:
  • 在线发布日期: 2017-01-24
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号