专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明申请

US20090216708A1 STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS 有权
标题翻译：电子文件的结构聚类和模板识别
公开(公告)号：US20090216708A1
公开(公告)日：2009-08-27
申请号：US12035948
申请日：2008-02-22
申请人： Amit Madaan , V. G. Vydiswaran , Rupesh R. Mehta
发明人： Amit Madaan , V. G. Vydiswaran , Rupesh R. Mehta
IPC分类号： G06F17/30
CPC分类号： G06F17/3089 , G06F17/3071
摘要： Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.
摘要翻译：本文公开的主题涉及电子文档的聚类，例如网页，并且还可以涉及电子文档的模板识别。

2. 发明申请

US20090248707A1 SITE-SPECIFIC INFORMATION-TYPE DETECTION METHODS AND SYSTEMS 审中-公开
标题翻译：站点特定信息类型检测方法和系统
公开(公告)号：US20090248707A1
公开(公告)日：2009-10-01
申请号：US12055222
申请日：2008-03-25
申请人： Rupesh R. Mehta , Amit Madaan
发明人： Rupesh R. Mehta , Amit Madaan
IPC分类号： G06F17/30
CPC分类号： G06F17/248 , G06F16/951 , G06F16/986 , G06F17/212
摘要： Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.
摘要翻译：本文提供的方法和系统可以允许在一个或多个文档（例如与一个或多个网站相关联的网页文档）中定位或以其他方式识别数据的相关信息类型。例如，提供了示例性方法和系统，其可以用于确定信息是否可能更可能是“信息”类型的信息，或者可能更可能是“噪声”类型的信息。

3. 发明授权

US08239387B2 Structural clustering and template identification for electronic documents 有权
标题翻译：电子文件的结构聚类和模板识别
公开(公告)号：US08239387B2
公开(公告)日：2012-08-07
申请号：US12035948
申请日：2008-02-22
申请人： Amit Madaan , V. G. Vinod Vydiswaran , Rupesh R. Mehta
发明人： Amit Madaan , V. G. Vinod Vydiswaran , Rupesh R. Mehta
IPC分类号： G06F17/30
CPC分类号： G06F17/3089 , G06F17/3071
摘要： Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.
摘要翻译：本文公开的主题涉及电子文档的聚类，例如网页，并且还可以涉及电子文档的模板识别。

4. 发明授权

US08046681B2 Techniques for inducing high quality structural templates for electronic documents 有权
标题翻译：诱导电子文件高质量结构模板的技术
公开(公告)号：US08046681B2
公开(公告)日：2011-10-25
申请号：US11945749
申请日：2007-11-27
申请人： V. G. Vinod Vydiswaran , Rupesh R. Mehta , Amit Madaan
发明人： V. G. Vinod Vydiswaran , Rupesh R. Mehta , Amit Madaan
IPC分类号： G06F17/00
CPC分类号： G06F17/3071 , G06F17/2211 , G06F17/2264 , G06F17/248 , G06F17/30896 , G06K9/6219 , Y10S707/99935
摘要： Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.
摘要翻译：本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。将模板的结构与训练集中的文档（或每个文档的至少一部分）的结构进行逐一比较，并根据模板与模板之间的差异进行一般化目前正在比较。如果任何特定文档的结构被认为与模板的结构太不相似，则不会修改该模板。将各种泛化运算符添加到模板中以推广模板。一个这样的泛化运算符是“OR”，其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。

5. 发明授权

US08832102B2 Methods and apparatuses for clustering electronic documents based on structural features and static content features 有权
标题翻译：基于结构特征和静态内容特征聚类电子文档的方法和装置
公开(公告)号：US08832102B2
公开(公告)日：2014-09-09
申请号：US12685945
申请日：2010-01-12
申请人： Rupesh R. Mehta , Srinivasan H. Sengamedu , Rajeev R. Rastogi
发明人： Rupesh R. Mehta , Srinivasan H. Sengamedu , Rajeev R. Rastogi
IPC分类号： G06F17/30 , G06F7/00
CPC分类号： G06F17/30864 , G06F17/30705
摘要： Exemplary methods and apparatuses are provided which may be implemented using one or more computing devices to allow for super clustering of clusters of electronic documents based, at least in part, on structural and static content features.
摘要翻译：提供了示例性方法和装置，其可以使用一个或多个计算设备来实现，以至少部分地基于结构和静态内容特征来允许对电子文档的集群进行超群集。

6. 发明申请

US20110173197A1 METHODS AND APPARATUSES FOR CLUSTERING ELECTRONIC DOCUMENTS BASED ON STRUCTURAL FEATURES AND STATIC CONTENT FEATURES 有权
标题翻译：基于结构特征和静态特征的电子文档聚类方法与设备
公开(公告)号：US20110173197A1
公开(公告)日：2011-07-14
申请号：US12685945
申请日：2010-01-12
申请人： Rupesh R. Mehta , Srinivasan H. Sengamedu , Rajeev R. Rastogi
发明人： Rupesh R. Mehta , Srinivasan H. Sengamedu , Rajeev R. Rastogi
IPC分类号： G06F17/30
CPC分类号： G06F17/30864 , G06F17/30705
摘要： Exemplary methods and apparatuses are provided which may be implemented using one or more computing devices to allow for super clustering of clusters of electronic documents based, at least in part, on structural and static content features.
摘要翻译：提供了示例性方法和装置，其可以使用一个或多个计算设备来实现，以至少部分地基于结构和静态内容特征来允许对电子文档的集群进行超群集。

7. 发明申请

US20090204889A1 ADAPTIVE SAMPLING OF WEB PAGES FOR EXTRACTION 审中-公开
标题翻译：自动采样网页提取
公开(公告)号：US20090204889A1
公开(公告)日：2009-08-13
申请号：US12030301
申请日：2008-02-13
申请人： Rupesh R. Mehta , V.G. Vinod Vydiswaran
发明人： Rupesh R. Mehta , V.G. Vinod Vydiswaran
IPC分类号： G06F17/00
CPC分类号： G06F16/00
摘要： Techniques are provided for improving the recall rate of an information extraction system by automatically selecting pages to surface to a user for annotation based on variation data. Techniques are provided for generating the variation data during the construction of the template that is to be used for extraction. During template construction, data is stored to indicate which template-construction pages saw or made changes to nodes in the template. After interesting nodes have been identified in the template, the data stored during template construction is used to determine which pages made changes to interesting-variation nodes. Techniques are also provided for generating the variation data during the extraction phase, when the template is being used to extract information from pages. During the extraction phase, variation data is generated in response to detecting that extraction for a given page resulted in one or more empty attributes.
摘要翻译：提供了一种技术，用于通过基于变化数据自动地向用户选择用于注释的页面来提高信息提取系统的回调率。提供了在构建用于提取的模板期间生成变化数据的技术。在模板构建期间，存储数据以指示哪些模板构造页面看到或对模板中的节点进行了更改。在模板中识别出有趣的节点之后，在模板构建期间存储的数据用于确定哪些页面对有趣的变化节点进行了更改。还提供了技术，用于在提取阶段期间，当模板被用于从页面提取信息时生成变化数据。在提取阶段期间，响应于检测到给定页面的提取导致一个或多个空属性而产生变化数据。

8. 发明申请

US20090265611A1 WEB PAGE LAYOUT OPTIMIZATION USING SECTION IMPORTANCE 审中-公开
标题翻译：使用章节重要性的网页布局优化
公开(公告)号：US20090265611A1
公开(公告)日：2009-10-22
申请号：US12116825
申请日：2008-05-07
申请人： Srinivasan H. Sengamedu , Rupesh R. Mehta
发明人： Srinivasan H. Sengamedu , Rupesh R. Mehta
IPC分类号： G06F17/00 , G06F17/20 , G06F17/21
CPC分类号： G06F17/211 , G06F16/9577
摘要： Methods and apparatus are described which enable the efficient adaptation of web pages to mobile displays. The more important or relevant sections of a web page are identified and configured into a more compact form. Both layout preserving and high compaction techniques are described.
摘要翻译：描述了使得网页能够有效地适应移动显示器的方法和装置。网页的更重要或相关部分被识别和配置成更紧凑的形式。描述了布局保留和高压缩技术。

9. 发明申请

US20100228738A1 ADAPTIVE DOCUMENT SAMPLING FOR INFORMATION EXTRACTION 审中-公开
标题翻译：采集信息提取的自适应文件
公开(公告)号：US20100228738A1
公开(公告)日：2010-09-09
申请号：US12398162
申请日：2009-03-04
申请人： Rupesh R. Mehta , Srinivasan H. Sengamedu
发明人： Rupesh R. Mehta , Srinivasan H. Sengamedu
IPC分类号： G06F17/30 , G06F17/21
CPC分类号： G06F17/241 , G06F16/951 , G06F16/9558
摘要： A method and apparatus for improved sampling documents for training sets input to information extraction systems is provided, which improves the recall and robustness of wrapper extraction. A passive sampling technique provides a list of documents to present for human annotation ordered by representativeness of the document based on structural and content statistics. Thus, the document with the most interesting attributes and which is most representative of the cluster of structurally similar documents to which the document pertains is presented for annotation first. The problem is mapped to classical ‘Set-Cover’ problem and solved using greedy approach. An active sampling technique refines and reorders the sample list produced by the passive sampling technique after initial annotations, based on the human annotation, spatial boundaries of the documents, and structural and content statistics. The proposed techniques work at a site level and perform page-level structural analysis using XPath-term frequency, XPath-document frequency, and XPath-importance.
摘要翻译：提供了一种用于改进对信息提取系统输入的训练集的抽样文档的方法和装置，其提高了包装提取的召回和鲁棒性。被动采样技术提供了基于结构和内容统计的文档代表性排序的人体注释的文档列表。因此，首先提出了具有最有趣属性，最具代表性的结构相似的文档集合的文档。问题被映射到古典的“封面”问题，并使用贪心的方法来解决。基于人的注释，文档的空间边界以及结构和内容统计，主动采样技术在初始注释之后，对被动采样技术产生的样本列表进行优化和重新排序。所提出的技术在现场级别工作，并使用XPath项目频率，XPath文档频率和XPath重要性进行页面级结构分析。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式