专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明申请

US20090248707A1 SITE-SPECIFIC INFORMATION-TYPE DETECTION METHODS AND SYSTEMS 审中-公开
标题翻译：站点特定信息类型检测方法和系统
公开(公告)号：US20090248707A1
公开(公告)日：2009-10-01
申请号：US12055222
申请日：2008-03-25
申请人： Rupesh R. Mehta , Amit Madaan
发明人： Rupesh R. Mehta , Amit Madaan
IPC分类号： G06F17/30
CPC分类号： G06F17/248 , G06F16/951 , G06F16/986 , G06F17/212
摘要： Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.
摘要翻译：本文提供的方法和系统可以允许在一个或多个文档（例如与一个或多个网站相关联的网页文档）中定位或以其他方式识别数据的相关信息类型。例如，提供了示例性方法和系统，其可以用于确定信息是否可能更可能是“信息”类型的信息，或者可能更可能是“噪声”类型的信息。

2. 发明申请

US20120084636A1 METHOD AND SYSTEM FOR WEB INFORMATION EXTRACTION 有权
标题翻译：网络信息抽取方法与系统
公开(公告)号：US20120084636A1
公开(公告)日：2012-04-05
申请号：US12896942
申请日：2010-10-04
申请人： Srinivasan Hanumantha Rao SENGAMEDU , Charu Tiwari , Amit Madaan , Rupesh Rasiklal Mehta , S. R. Jeyashankher , Rajeev Rastogi
发明人： Srinivasan Hanumantha Rao SENGAMEDU , Charu Tiwari , Amit Madaan , Rupesh Rasiklal Mehta , S. R. Jeyashankher , Rajeev Rastogi
IPC分类号： G06F17/00
CPC分类号： G06F17/2282 , G06F17/2247 , G06F17/30864 , G06F17/30911
摘要： An example of a method includes determining features of a first type for a web page of a plurality of web pages. The method also includes electronically determining a plurality of rules for an attribute of the first web page, wherein the plurality of rules are determined based on features of the first type. The method also includes electronically identifying a first rule, from the plurality of rules, which satisfies a first predefined criterion. The first predefined criteria include at least one of a first threshold for a precision parameter, a second threshold for a support parameter, a third threshold for a distance parameter and a fourth threshold for a recall parameter. The method further includes storing the first rule to enable extraction of value of the attribute from a second web page.
摘要翻译：一种方法的示例包括确定多个网页中的网页的第一类型的特征。该方法还包括电子地确定用于第一网页的属性的多个规则，其中基于第一类型的特征来确定多个规则。该方法还包括从满足第一预定准则的多个规则中电子地识别第一规则。第一预定准则包括精度参数的第一阈值，支持参数的第二阈值，距离参数的第三阈值和召回参数的第四阈值中的至少一个。该方法还包括存储第一规则以便能够从第二网页提取属性的值。

3. 发明申请

US20100185684A1 HIGH PRECISION MULTI ENTITY EXTRACTION 审中-公开
标题翻译：高精度多重实体提取
公开(公告)号：US20100185684A1
公开(公告)日：2010-07-22
申请号：US12351676
申请日：2009-01-09
申请人： Amit Madaan , Charu Tiwari
发明人： Amit Madaan , Charu Tiwari
IPC分类号： G06F17/30
CPC分类号： G06F16/986
摘要： Techniques for high precision multi entity extraction are provided. A wrapper that represents a generalized structure of a set of training web pages is accessed. The wrapper includes one or more annotations that indicate a set of attributes that are included in each of a plurality of records. Record boundaries are determined based on nodes included in the wrapper, where the record boundaries delimit the plurality of records within any training page of the set of training web pages. The wrapper is modified to include one or more boundary nodes, where the one or more boundary nodes indicate the record boundaries of the plurality of records within the set of training web pages. Multiple records are extracted from a web page, where extracting the multiple records comprises detecting record completions based at least on the wrapper and on a document object model (DOM) representation of the web page.
摘要翻译：提供了高精度多实体提取技术。访问代表一组训练网页的一般结构的包装器。包装器包括指示包括在多个记录中的每一个中的一组属性的一个或多个注释。记录边界是基于包含在包装器中的节点来确定的，其中记录边界限定训练网页集合的任何训练页面内的多个记录。包装器被修改为包括一个或多个边界节点，其中一个或多个边界节点指示训练网页集合内的多个记录的记录边界。从网页提取多个记录，其中提取多个记录包括至少基于包装器和网页的文档对象模型（DOM）表示来检测记录完成。

4. 发明授权

US09280528B2 Method and system for processing and learning rules for extracting information from incoming web pages 有权
标题翻译：用于从传入网页提取信息的处理和学习规则的方法和系统
公开(公告)号：US09280528B2
公开(公告)日：2016-03-08
申请号：US12896942
申请日：2010-10-04
申请人： Srinivasan Hanumantha Rao Sengamedu , Charu Tiwari , Amit Madaan , Rupesh Rasiklal Mehta , S R Jeyashankher , Rajeev Rastogi
发明人： Srinivasan Hanumantha Rao Sengamedu , Charu Tiwari , Amit Madaan , Rupesh Rasiklal Mehta , S R Jeyashankher , Rajeev Rastogi
IPC分类号： G06F17/00 , G06F17/22 , G06F17/30
CPC分类号： G06F17/2282 , G06F17/2247 , G06F17/30864 , G06F17/30911
摘要： An example of a method includes determining features of a first type for a web page of a plurality of web pages. The method also includes electronically determining a plurality of rules for an attribute of the first web page, wherein the plurality of rules are determined based on features of the first type. The method also includes electronically identifying a first rule, from the plurality of rules, which satisfies a first predefined criterion. The first predefined criteria include at least one of a first threshold for a precision parameter, a second threshold for a support parameter, a third threshold for a distance parameter and a fourth threshold for a recall parameter. The method further includes storing the first rule to enable extraction of value of the attribute from a second web page.
摘要翻译：一种方法的示例包括确定多个网页中的网页的第一类型的特征。该方法还包括电子地确定用于第一网页的属性的多个规则，其中基于第一类型的特征来确定多个规则。该方法还包括从满足第一预定准则的多个规则中电子地识别第一规则。第一预定标准包括精度参数的第一阈值，支持参数的第二阈值，距离参数的第三阈值和用于召回参数的第四阈值中的至少一个。该方法还包括存储第一规则以便能够从第二网页提取属性的值。

5. 发明授权

US08239387B2 Structural clustering and template identification for electronic documents 有权
标题翻译：电子文件的结构聚类和模板识别
公开(公告)号：US08239387B2
公开(公告)日：2012-08-07
申请号：US12035948
申请日：2008-02-22
申请人： Amit Madaan , V. G. Vinod Vydiswaran , Rupesh R. Mehta
发明人： Amit Madaan , V. G. Vinod Vydiswaran , Rupesh R. Mehta
IPC分类号： G06F17/30
CPC分类号： G06F17/3089 , G06F17/3071
摘要： Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.
摘要翻译：本文公开的主题涉及电子文档的聚类，例如网页，并且还可以涉及电子文档的模板识别。

6. 发明申请

US20110040770A1 ROBUST XPATHS FOR WEB INFORMATION EXTRACTION 审中-公开
标题翻译：用于WEB信息提取的稳健XPATHS
公开(公告)号：US20110040770A1
公开(公告)日：2011-02-17
申请号：US12540384
申请日：2009-08-13
申请人： Amit MADAAN , Charu TIWARI , Rupesh R. MEHTA
发明人： Amit MADAAN , Charu TIWARI , Rupesh R. MEHTA
IPC分类号： G06F17/30
CPC分类号： G06F16/95
摘要： An example of a method includes generating an attributed extensible markup language path (XPath) for an annotated entity in a web page. The method further includes determining a first node that satisfy the attributed XPath in the web page and is annotated. The method also includes identifying an attribute property that satisfies predefined criteria in the web page while traversing from the first node to a root node, the attribute property comprising an attribute value and an attribute name. Moreover, the method includes populating the attributed XPath with the attribute property that satisfies predefined criteria. The method also includes filtering the attributed XPath to generate a robust XPath, and extracting content from multiple web pages based on the robust XPath.
摘要翻译：方法的示例包括为网页中的注释实体生成属性可扩展标记语言路径（XPath）。该方法还包括确定满足网页中归属的XPath并被注释的第一节点。该方法还包括识别在从第一节点遍历到根节点时满足网页中的预定标准的属性属性，该属性属性包括属性值和属性名称。此外，该方法包括使用满足预定义条件的attribute属性填充归属的XPath。该方法还包括过滤归因的XPath以生成鲁棒的XPath，以及基于强大的XPath从多个网页提取内容。

7. 发明授权

US08046681B2 Techniques for inducing high quality structural templates for electronic documents 有权
标题翻译：诱导电子文件高质量结构模板的技术
公开(公告)号：US08046681B2
公开(公告)日：2011-10-25
申请号：US11945749
申请日：2007-11-27
申请人： V. G. Vinod Vydiswaran , Rupesh R. Mehta , Amit Madaan
发明人： V. G. Vinod Vydiswaran , Rupesh R. Mehta , Amit Madaan
IPC分类号： G06F17/00
CPC分类号： G06F17/3071 , G06F17/2211 , G06F17/2264 , G06F17/248 , G06F17/30896 , G06K9/6219 , Y10S707/99935
摘要： Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.
摘要翻译：本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。将模板的结构与训练集中的文档（或每个文档的至少一部分）的结构进行逐一比较，并根据模板与模板之间的差异进行一般化目前正在比较。如果任何特定文档的结构被认为与模板的结构太不相似，则不会修改该模板。将各种泛化运算符添加到模板中以推广模板。一个这样的泛化运算符是“OR”，其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。

8. 发明申请

US20090216708A1 STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS 有权
标题翻译：电子文件的结构聚类和模板识别
公开(公告)号：US20090216708A1
公开(公告)日：2009-08-27
申请号：US12035948
申请日：2008-02-22
申请人： Amit Madaan , V. G. Vydiswaran , Rupesh R. Mehta
发明人： Amit Madaan , V. G. Vydiswaran , Rupesh R. Mehta
IPC分类号： G06F17/30
CPC分类号： G06F17/3089 , G06F17/3071
摘要： Subject matter disclosed herein may relate to clustering electronic documents, such as, for example, web pages, and may also relate to template identification for electronic documents.
摘要翻译：本文公开的主题涉及电子文档的聚类，例如网页，并且还可以涉及电子文档的模板识别。

9. 发明申请

US20080072140A1 TECHNIQUES FOR INDUCING HIGH QUALITY STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS 有权
标题翻译：用于诱导电子文件的高质量结构模板的技术
公开(公告)号：US20080072140A1
公开(公告)日：2008-03-20
申请号：US11945749
申请日：2007-11-27
申请人： V.G. Vydiswaran , Rupesh Mehta , Amit Madaan
发明人： V.G. Vydiswaran , Rupesh Mehta , Amit Madaan
IPC分类号： G06F15/00
CPC分类号： G06F17/3071 , G06F17/2211 , G06F17/2264 , G06F17/248 , G06F17/30896 , G06K9/6219 , Y10S707/99935
摘要： Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.
摘要翻译：本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。将模板的结构与训练集中的文档（或每个文档的至少一部分）的结构进行逐一比较，并根据模板与模板之间的差异进行一般化目前正在比较。如果任何特定文档的结构被认为与模板的结构太不相似，则不会修改该模板。将各种泛化运算符添加到模板中以推广模板。一个这样的泛化运算符是“OR”，其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式