会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明申请
    • DATA EXTRACTION METHOD, COMPUTER PROGRAM PRODUCT AND SYSTEM
    • 数据提取方法,计算机程序产品和系统
    • WO2011063561A1
    • 2011-06-03
    • PCT/CN2009/075117
    • 2009-11-25
    • HEWLETT-PACKARD DEVELOPMENT COMPANY, L. P.JIAO, Li-MeiXIONG, Yuhong
    • JIAO, Li-MeiXIONG, Yuhong
    • G06F17/30
    • G06F17/30896
    • Disclosed is a method of automatically extracting data from a target web page, comprising selecting (302) data in a source web page; determining (304) the respective DOM (document object model) trees of the source and target web page, and identifying the one or more nodes comprising the selected data in the source web page DOM tree; determining (306) matching paths in the respective DOM trees; for selected data in a node of an unmatched branch of the source web page DOM tree, identifying (308) the nearest matched path in the source web page; identifying (310) the unmatched branch nearest to the corresponding matched path in the target web page; determining (312) if said identified unmatched branch in the target web page DOM tree comprises a target node matching the selected data node; and if so: extracting (322) data from the target node if the mismatch between the respective unmatched branches does not exceed a predefined threshold. A computer program product and system implementing this method are also disclosed.
    • 公开了一种从目标网页自动提取数据的方法,包括在源网页中选择(302)数据; 确定(304)源和目标网页的相应DOM(文档对象模型)树,以及在源网页DOM树中标识包括所选数据的一个或多个节点; 确定(306)相应DOM树中的匹配路径; 对于源网页DOM树的不匹配分支的节点中的选定数据,识别(308)源网页中最近的匹配路径; 识别(310)最接近目标网页中相应匹配路径的不匹配分支; 确定(312)如果所述目标网页DOM树中的所述识别的不匹配分支包括与所选数据节点匹配的目标节点; 如果是:如果各个不匹配的分支之间的不匹配没有超过预定义的阈值,则从目标节点提取(322)数据。 还公开了一种实现该方法的计算机程序产品和系统。
    • 8. 发明申请
    • SEED SET EXPANSION
    • 种子膨胀
    • WO2012061983A1
    • 2012-05-18
    • PCT/CN2010/078595
    • 2010-11-10
    • HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P.YAO, Cong-LeiXIONG, YuhongZHENG, Li-Wei
    • YAO, Cong-LeiXIONG, YuhongZHENG, Li-Wei
    • G06F17/27
    • G06F17/30442G06F17/30896
    • Systems and methods for seed set expansion are provided. A context-based extractor (22) generates a set of context-based candidate members of a seed set from a set of web pages associated with an organization as words connected with a seed set member by a contextual pattern and a context confidence value for each candidate member. A list-based extractor (24) generates a set of list-based candidate members from elements within a plurality of lists in the set of web pages and a list confidence value associated with each candidate member. A confidence arbitrator (26) determines an intersection set of candidate members present in both sets of candidate members and determines a final confidence value for each of the intersection set of candidate members based on their respective context confidence value and list confidence value. A candidate selector (28) selects a candidate member for inclusion in a seed set (21).
    • 提供种子扩张的系统和方法。 基于上下文的提取器(22)从与组织相关联的一组网页生成一组基于上下文的候选成员,作为通过上下文模式和每个上下文置信度值与种子集成员连接的单词 候选人。 基于列表的提取器(24)从网页集合中的多个列表内的元素和与每个候选成员相关联的列表置信度值生成一组基于列表的候选成员。 置信度仲裁器(26)确定存在于两组候选成员中的候选成员的交集,并且基于它们各自的上下文置信度值和列表置信度值来确定候选成员的交集中的每一个的最终置信度值。 候选选择器(28)选择候选成员包括在种子集(21)中。