会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 3. 发明申请
    • ADAPTIVE DOCUMENT SAMPLING FOR INFORMATION EXTRACTION
    • 采集信息提取的自适应文件
    • US20100228738A1
    • 2010-09-09
    • US12398162
    • 2009-03-04
    • Rupesh R. MehtaSrinivasan H. Sengamedu
    • Rupesh R. MehtaSrinivasan H. Sengamedu
    • G06F17/30G06F17/21
    • G06F17/241G06F16/951G06F16/9558
    • A method and apparatus for improved sampling documents for training sets input to information extraction systems is provided, which improves the recall and robustness of wrapper extraction. A passive sampling technique provides a list of documents to present for human annotation ordered by representativeness of the document based on structural and content statistics. Thus, the document with the most interesting attributes and which is most representative of the cluster of structurally similar documents to which the document pertains is presented for annotation first. The problem is mapped to classical ‘Set-Cover’ problem and solved using greedy approach. An active sampling technique refines and reorders the sample list produced by the passive sampling technique after initial annotations, based on the human annotation, spatial boundaries of the documents, and structural and content statistics. The proposed techniques work at a site level and perform page-level structural analysis using XPath-term frequency, XPath-document frequency, and XPath-importance.
    • 提供了一种用于改进对信息提取系统输入的训练集的抽样文档的方法和装置,其提高了包装提取的召回和鲁棒性。 被动采样技术提供了基于结构和内容统计的文档代表性排序的人体注释的文档列表。 因此,首先提出了具有最有趣属性,最具代表性的结构相似的文档集合的文档。 问题被映射到古典的“封面”问题,并使用贪心的方法来解决。 基于人的注释,文档的空间边界以及结构和内容统计,主动采样技术在初始注释之后,对被动采样技术产生的样本列表进行优化和重新排序。 所提出的技术在现场级别工作,并使用XPath项目频率,XPath文档频率和XPath重要性进行页面级结构分析。
    • 6. 发明申请
    • ADAPTIVE SAMPLING OF WEB PAGES FOR EXTRACTION
    • 自动采样网页提取
    • US20090204889A1
    • 2009-08-13
    • US12030301
    • 2008-02-13
    • Rupesh R. MehtaV.G. Vinod Vydiswaran
    • Rupesh R. MehtaV.G. Vinod Vydiswaran
    • G06F17/00
    • G06F16/00
    • Techniques are provided for improving the recall rate of an information extraction system by automatically selecting pages to surface to a user for annotation based on variation data. Techniques are provided for generating the variation data during the construction of the template that is to be used for extraction. During template construction, data is stored to indicate which template-construction pages saw or made changes to nodes in the template. After interesting nodes have been identified in the template, the data stored during template construction is used to determine which pages made changes to interesting-variation nodes. Techniques are also provided for generating the variation data during the extraction phase, when the template is being used to extract information from pages. During the extraction phase, variation data is generated in response to detecting that extraction for a given page resulted in one or more empty attributes.
    • 提供了一种技术,用于通过基于变化数据自动地向用户选择用于注释的页面来提高信息提取系统的回调率。 提供了在构建用于提取的模板期间生成变化数据的技术。 在模板构建期间,存储数据以指示哪些模板构造页面看到或对模板中的节点进行了更改。 在模板中识别出有趣的节点之后,在模板构建期间存储的数据用于确定哪些页面对有趣的变化节点进行了更改。 还提供了技术,用于在提取阶段期间,当模板被用于从页面提取信息时生成变化数据。 在提取阶段期间,响应于检测到给定页面的提取导致一个或多个空属性而产生变化数据。