专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明授权

US07991709B2 Method and apparatus for structuring documents utilizing recognition of an ordered sequence of identifiers 有权
标题翻译：用于利用识别标识符的有序序列的识别来构造文档的方法和装置
公开(公告)号：US07991709B2
公开(公告)日：2011-08-02
申请号：US12020743
申请日：2008-01-28
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F15/18 , G06F17/21 , G06F17/22 , G06F17/27
CPC分类号： G06F17/211
摘要： A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.
摘要翻译：提供了一种用于操作计算设备以利用文档中的至少一个有序序列的识别来创建计算机可解析文本文档的文档结构模型的方法。该方法包括将任何格式的计算机可解析文本文档转换成替代结构化语言格式以形成转换的文档。转换后的文档的文本被分割成文本格式的文本片段的有序序列。枚举文本片段以获得术语序列。从术语序列中识别术语的至少一个最佳子序列，其中最佳子序列是一个或多个最长增加子序列。计算机可解析文本文档用标签注释，其中标签包括从最佳子序列的识别导出的信息。注释文档显示在图形用户界面上。

2. 发明申请

US20060156226A1 Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents 有权
公开(公告)号：US20060156226A1
公开(公告)日：2006-07-13
申请号：US11032817
申请日：2005-01-10
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/00 , G06F17/21
CPC分类号： G06F17/217 , G06F17/2745
摘要： A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

3. 发明申请

US20080114757A1 Versatile page number detector 有权
标题翻译：多功能页码检测器
公开(公告)号：US20080114757A1
公开(公告)日：2008-05-15
申请号：US11599947
申请日：2006-11-15
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F7/20 , G06F17/30
CPC分类号： G06F17/30569 , G06K9/00469
摘要： A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.
摘要翻译：用于检测文档中的页码的方法包括识别与文档的多个页面相关联的多个文本片段。从识别的文本片段中，至少识别出一个序列。每个识别的序列包括多个术语。序列的每个术语从选自多个文本片段的文本片段导出。所识别序列的术语符合至少一个定义序列中术语的形式和增量状态的预定义编号方案。计算覆盖文档的至少一些页面的识别序列的子集。所识别的序列的至少一部分子集的术语被解释为文档的页面页码。可以通过考虑所识别序列的子集中的术语的一个或多个特征来识别附加页码。

4. 发明授权

US09218326B2 Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents 有权
标题翻译：用于在传统文档中检测包括标题和页脚的分页结构的方法和装置
公开(公告)号：US09218326B2
公开(公告)日：2015-12-22
申请号：US13032996
申请日：2011-02-23
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/00 , G06F17/21 , G06F17/27
CPC分类号： G06F17/217 , G06F17/2745
摘要： A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
摘要翻译：一种用于识别文档的页眉/页脚内容的方法，以便对从文档导出的包含可识别的文本块的文本片段进行排序。分析文本块的文本变异性，包括文本块中的不同类型的文本块，对文本变异性进行评估。页眉/页脚区域由具有低文本变异性的文本内容定义。替代实施例通过比较用于相似性和邻近度的所选择的文本框并且对满足预定相似度值的文本框进行聚类来识别分页结构，其中聚类文本框被认为包括分页结构。

5. 发明申请

US20060155703A1 Method and apparatus for detecting a table of contents and reference determination 有权
标题翻译：用于检测目录和参考确定的方法和装置
公开(公告)号：US20060155703A1
公开(公告)日：2006-07-13
申请号：US11032814
申请日：2005-01-10
申请人： Herve Dejean , Jean-Luc Meunier , Olivier Fambon
发明人： Herve Dejean , Jean-Luc Meunier , Olivier Fambon
IPC分类号： G06F17/30
CPC分类号： G06F17/2745 , G06F17/2247
摘要： In a method for identifying a table of contents in a document, an ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.
摘要翻译：在用于识别文档中的目录的方法中，从文档导出文本片段的有序序列。选择内容表作为满足标准的文本片段的有序序列的连续子序列：（i）由内容表的文本片段定义的条目各自具有到具有与文本相似性的目标文本片段的链接条目; （ii）目录文本片段不在目录内; 和（iii）目标文本片段具有对应于定义目标文本片段的条目的上升顺序的升序。

6. 发明授权

US09224041B2 Table of contents extraction based on textual similarity and formal aspects 有权
标题翻译：基于文本相似性和形式方面的目录提取
公开(公告)号：US09224041B2
公开(公告)日：2015-12-29
申请号：US11923904
申请日：2007-10-25
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/27 , G06K9/00
CPC分类号： G06K9/00469 , G06F17/27
摘要： An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.
摘要翻译：文档的初始组织表基于组织表和目标文本片段的条目之间的文本相似度而不考虑文本格式来确定。训练分类器以至少部分地基于文本格式化特征来识别由组织表的条目和对应的目标文本片段组成的文本片段对。培训采用基于初始组织表注释的一组示例。使用训练有素的分类器更新初始组织表。

7. 发明授权

US08706475B2 Method and apparatus for detecting a table of contents and reference determination 有权
标题翻译：用于检测目录和参考确定的方法和装置
公开(公告)号：US08706475B2
公开(公告)日：2014-04-22
申请号：US11032814
申请日：2005-01-10
申请人： Herve Dejean , Jean-Luc Meunier , Olivier Fambon
发明人： Herve Dejean , Jean-Luc Meunier , Olivier Fambon
IPC分类号： G06F17/20 , G06F17/27 , G06F3/00 , G06F7/00
CPC分类号： G06F17/2745 , G06F17/2247
摘要： In a method for identifying a table of contents in a document, an ordered sequence of text fragments is derived from the document. A table of contents is selected as a contiguous sub-sequence of the ordered sequence of text fragments satisfying the criteria: (i) entries defined by text fragments of the table of contents each have a link to a target text fragment having textual similarity with the entry; (ii) no target text fragment lies within the table of contents; and (iii) the target text fragments have an ascending ordering corresponding to an ascending ordering of the entries defining the target text fragments.
摘要翻译：在用于识别文档中的目录的方法中，从文档导出文本片段的有序序列。选择内容表作为满足标准的文本片段的有序序列的连续子序列：（i）由内容表的文本片段定义的条目各自具有到具有与文本相似性的目标文本片段的链接条目; （ii）目录文本片段不在目录内; 和（iii）目标文本片段具有对应于定义目标文本片段的条目的上升顺序的升序。

8. 发明授权

US08340425B2 Optical character recognition with two-pass zoning 有权
标题翻译：光学字符识别与双程分区
公开(公告)号：US08340425B2
公开(公告)日：2012-12-25
申请号：US12853461
申请日：2010-08-10
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06K9/34
CPC分类号： G06K9/03 , G06K9/2054 , G06K2209/01
摘要： An image of a paginated document is zoned to identify text zones. First-pass character recognition is performed on the text zones to generate textual content corresponding to the paginated document. The image of the paginated document is re-zoned based on the textual content to identify one or more new text zones. Second-pass character recognition is performed on at least the new text zones to generate updated textual content corresponding to the paginated document.
摘要翻译：分页文件的图像被划为识别文本区域。在文本区域上执行一次通过字符识别，以生成与分页文档相对应的文本内容。基于文本内容重新划分分页文档的图像，以识别一个或多个新的文本区域。至少在新的文本区域执行二次通过字符识别，以生成与分页文档相对应的更新的文本内容。

9. 发明授权

US08302002B2 Structuring document based on table of contents 有权
标题翻译：根据目录构建文档
公开(公告)号：US08302002B2
公开(公告)日：2012-10-30
申请号：US11116100
申请日：2005-04-27
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/00
CPC分类号： G06F17/30616 , G06F17/2241
摘要： A document is organized as a plurality of nodes associated with a table of contents. The nodes are clustered into a plurality of clusters based on a similarity criterion. One of the clusters is identified as corresponding to a highest or lowest level of the table of contents based on a selection criterion. The highest or lowest level is assigned to the nodes belonging to the identified cluster. The identifying and assigning are repeated to assign levels to the nodes belonging to each next highest or lowest level of the table of contents. The repeated identifying is based on the selection criteria applied disregarding nodes that have already been assigned a level. The document is structured based at least in part on the levels assigned to the table of contents nodes.
摘要翻译：文档被组织为与内容表相关联的多个节点。基于相似性标准，将节点聚类成多个聚类。基于选择标准，将一个集群识别为对应于内容表的最高或最低级别。最高或最低级别被分配给属于所识别的集群的节点。重复识别和分配以将属性分配给属于内容表的每个下一个最高或最低级别的节点。重复的识别是基于应用的选择标准，而不考虑已经被分配了一个级别的节点。该文档至少部分地基于分配给目录节点的级别而被构造。

10. 发明申请

US20070196015A1 Table of contents extraction with improved robustness 失效
标题翻译：目录提取具有改进的鲁棒性
公开(公告)号：US20070196015A1
公开(公告)日：2007-08-23
申请号：US11360963
申请日：2006-02-23
申请人： Jean-Luc Meunier , Herve Dejean
发明人： Jean-Luc Meunier , Herve Dejean
IPC分类号： G06K9/46 , G06F7/00 , G06K9/34 , G06F17/00
CPC分类号： G06F17/2745
摘要： In a method for identifying a table of contents in a document (10), text fragments are extracted (12) from the document. There are identified (20, 30, 34, 38): (i) a substantially contiguous group of text fragments as table of content entries and (ii) a different group of text fragments as linked text fragments linked with corresponding table of content entries. During the identifying, a number of text fragments that are candidates for identification as linked text fragments is reduced based on at least one reduction criterion (130). The identified table of contents entries and linked text fragments (110) are validated based on at least one validation criterion (162) related to distribution of the linked text fragments.
摘要翻译：在用于识别文档（10）中的目录的方法中，从文档中提取文本片段（12）。确定（20,30,34,38）：（i）作为内容条目表的基本连续的文本片段组，以及（ii）与相应的内容条目链接的链接的文本片段的不同的文本片段组。在识别期间，基于至少一个简化标准（130），减少作为链接文本片段的识别的候选者的多个文本片段。基于与链接的文本片段的分布相关的至少一个验证标准（162），验证所识别的目录条目和链接的文本片段（110）。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式