专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

11. 发明申请

US20060248070A1 Structuring document based on table of contents 有权
标题翻译：根据目录构建文档
公开(公告)号：US20060248070A1
公开(公告)日：2006-11-02
申请号：US11116100
申请日：2005-04-27
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/00
CPC分类号： G06F17/30616 , G06F17/2241
摘要： A document is organized as a plurality of nodes associated with a table of contents. The nodes are clustered into a plurality of clusters based on a similarity criterion. One of the clusters is identified as corresponding to a highest or lowest level of the table of contents based on a selection criterion. The highest or lowest level is assigned to the nodes belonging to the identified cluster. The identifying and assigning are repeated to assign levels to the nodes belonging to each next highest or lowest level of the table of contents. The repeated identifying is based on the selection criteria applied disregarding nodes that have already been assigned a level. The document is structured based at least in part on the levels assigned to the table of contents nodes.
摘要翻译：文档被组织为与内容表相关联的多个节点。基于相似性标准，将节点聚类成多个聚类。基于选择标准，将一个集群识别为对应于内容表的最高或最低级别。最高或最低级别被分配给属于所识别的集群的节点。重复识别和分配以将属性分配给属于内容表的每个下一个最高或最低级别的节点。重复的识别是基于应用的选择标准，而不考虑已经被分配了一个级别的节点。该文档至少部分地基于分配给目录节点的级别而被构造。

12. 发明申请

US20090192956A1 METHOD AND APPARATUS FOR STRUCTURING DOCUMENTS UTILIZING RECOGNITION OF AN ORDERED SEQUENCE OF IDENTIFIERS 有权
标题翻译：用于结构化文档的方法和装置，使用标识符的顺序序列的识别
公开(公告)号：US20090192956A1
公开(公告)日：2009-07-30
申请号：US12020743
申请日：2008-01-28
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/27 , G06F15/18
CPC分类号： G06F17/211
摘要： A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.
摘要翻译：提供了一种用于操作计算设备以利用文档中的至少一个有序序列的识别来创建计算机可解析文本文档的文档结构模型的方法。该方法包括将任何格式的计算机可解析文本文档转换成替代结构化语言格式以形成转换的文档。转换后的文档的文本被分割成文本格式的文本片段的有序序列。枚举文本片段以获得术语序列。从术语序列中识别术语的至少一个最佳子序列，其中最佳子序列是一个或多个最长增加子序列。计算机可解析文本文档用标签注释，其中标签包括从最佳子序列的识别导出的信息。注释文档显示在图形用户界面上。

13. 发明申请

US20090110268A1 TABLE OF CONTENTS EXTRACTION BASED ON TEXTUAL SIMILARITY AND FORMAL ASPECTS 有权
标题翻译：目录基于文本相似性和形式方面的提取
公开(公告)号：US20090110268A1
公开(公告)日：2009-04-30
申请号：US11923904
申请日：2007-10-25
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06K9/62
CPC分类号： G06K9/00469 , G06F17/27
摘要： An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.
摘要翻译：文档的初始组织表基于组织表和目标文本片段的条目之间的文本相似度而不考虑文本格式来确定。训练分类器以至少部分地基于文本格式化特征来识别由组织表的条目和对应的目标文本片段组成的文本片段对。培训采用基于初始组织表注释的一组示例。使用训练有素的分类器更新初始组织表。

14. 发明申请

US20080065671A1 Methods and apparatuses for detecting and labeling organizational tables in a document 审中-公开
标题翻译：用于检测和标记文档中的组织表的方法和装置
公开(公告)号：US20080065671A1
公开(公告)日：2008-03-13
申请号：US11517092
申请日：2006-09-07
申请人： Herve Dejean , Jean-Luc Meunier
发明人： Herve Dejean , Jean-Luc Meunier
IPC分类号： G06F17/00
CPC分类号： G06F17/2229 , G06F17/2241
摘要： A document (10) includes one or more organizational tables (40). Each organizational table includes a substantially contiguous sub-set of text fragments of the document identified as entries of the organizational table, and each entry has an associated linked text fragment. An organizational tables scorer (42) assigns a score to each of the one or more organizational tables respective to at least one object type based on a scoring criterion for that object type. An organizational tables labeler (44) assigns a table type label to each of the one or more organizational tables based on the scores.
摘要翻译：文档（10）包括一个或多个组织表（40）。每个组织表包括标识为组织表的条目的文档的基本连续的文本片段子集，并且每个条目具有相关联的链接文本片段。组织表得分手（42）基于对该对象类型的评分标准，将分数分配给相应于至少一个对象类型的一个或多个组织表中的每一个。组织表标签器（44）根据分数为一个或多个组织表中的每一个分配表类型标签。

15. 发明授权

US08352857B2 Methods and apparatuses for intra-document reference identification and resolution 有权
标题翻译：文件内参考识别和解析的方法和装置
公开(公告)号：US08352857B2
公开(公告)日：2013-01-08
申请号：US12258627
申请日：2008-10-27
申请人： Katja Filippova , Herve Dejean
发明人： Katja Filippova , Herve Dejean
IPC分类号： G06F17/00
CPC分类号： G06F17/2235
摘要： Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.
摘要翻译：参考标识和分辨率识别文档中的参考文本片段，并将文档中引用的对象文本片段与所标识的引用文本片段相关联。参考资料从文件中抽象出来。每个参考配置文件至少指定一个参考号和一个对象类型标识符。参考资料与包含参考资料的参考编号的文件的对象文本片段配对。重复配对以将参考简档与对象文本片段相关联。满足一个参考简档的文档的参考文本片段与与满足的参考简档配对的对象文本片段相关联。重复关联，将文档的引用文本片段与对象文本片段相关联。

16. 发明授权

US07165216B2 Systems and methods for converting legacy and proprietary documents into extended mark-up language format 失效
标题翻译：将传统和专有文档转换为扩展标记语言格式的系统和方法
公开(公告)号：US07165216B2
公开(公告)日：2007-01-16
申请号：US10756313
申请日：2004-01-14
申请人： Boris Chidlovskii , Herve Dejean
发明人： Boris Chidlovskii , Herve Dejean
IPC分类号： G06F15/00
CPC分类号： G06F17/30914 , G06F17/227
摘要： A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.
摘要翻译：将传统和专有文档转换为扩展标记语言格式的系统和方法，该格式将转换视为将一个模式和/或模型的有序树转换为另一模式和/或模型的有序树。在实施例中，使用将转换任务分解为包括路径重新标记，结构组合和输入树遍历的三个组件的学习方法对树型变换器进行编码，每个组件涉及学习方法。将输入树转换为输出树可能涉及使用来自特定输出模式的有效标签或路径来标注输入树中的组件，使用有效结构将标记的元素组合成输出树，并且找到输入的遍历树，实现输出树的正确组合并应用结构规则。

17. 发明申请

US20130321867A1 TYPOGRAPHICAL BLOCK GENERATION 审中-公开
标题翻译：柱形生成
公开(公告)号：US20130321867A1
公开(公告)日：2013-12-05
申请号：US13484708
申请日：2012-05-31
申请人： Herve Dejean
发明人： Herve Dejean
IPC分类号： G06K15/02
CPC分类号： G06F17/211
摘要： Embodiments of a computer-implemented method for grouping one or more token elements comprising one or more characters in an input file. The method comprises computing a first leading distance between a first baseline of a first token element, and a second baseline of a second token element. The method further comprises defining a block with the first token element and the second token element, and characterizing the first leading distance as a leading distance of the block. The method further comprises computing a second leading distance between the second baseline and a third baseline of a third token element. The method furthermore comprises, grouping the third token element in to the block based on a first difference between the second leading distance and the leading distance of the block lying within a first predefined threshold value.
摘要翻译：用于对包括输入文件中的一个或多个字符的一个或多个令牌元素进行分组的计算机实现的方法的实施例。该方法包括计算第一令牌元素的第一基线和第二令牌元件的第二基线之间的第一前导距离。该方法还包括使用第一令牌元素和第二令牌元素定义块，并且将第一前导距离表征为块的前导距离。该方法还包括计算第三令牌元素的第二基线和第三基线之间的第二前导距离。该方法还包括：基于位于第一预定阈值内的块的第二前导距离和前导距离之间的第一差异，将第三令牌元素分组到块中。

18. 发明申请

US20100107045A1 METHODS AND APPARATUSES FOR INTRA-DOCUMENT REFERENCE IDENTIFICATION AND RESOLUTION 有权
标题翻译：文献参考标识和分辨率的方法和设备
公开(公告)号：US20100107045A1
公开(公告)日：2010-04-29
申请号：US12258627
申请日：2008-10-27
申请人： Katja Filippova , Herve Dejean
发明人： Katja Filippova , Herve Dejean
IPC分类号： G06F17/00 , G06F17/21 , G06F17/30
CPC分类号： G06F17/2235
摘要： Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.
摘要翻译：参考标识和分辨率识别文档中的参考文本片段，并将文档中引用的对象文本片段与所标识的引用文本片段相关联。参考资料从文件中抽象出来。每个参考配置文件至少指定一个参考号和一个对象类型标识符。参考资料与包含参考资料的参考编号的文件的对象文本片段配对。重复配对以将参考简档与对象文本片段相关联。满足一个参考简档的文档的参考文本片段与与满足的参考简档配对的对象文本片段相关联。重复关联，将文档的引用文本片段与对象文本片段相关联。

19. 发明申请

US20080077847A1 Captions detector 有权
标题翻译：字幕检测器
公开(公告)号：US20080077847A1
公开(公告)日：2008-03-27
申请号：US11528261
申请日：2006-09-27
申请人： Herve Dejean
发明人： Herve Dejean
IPC分类号： G06F17/00 , G06F17/20 , G06F17/21 , G06F17/22 , G06F17/24 , G06F17/25 , G06F17/26 , G06F17/27 , G06F17/28
CPC分类号： G06F17/2745
摘要： To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.
摘要翻译：要检测包含文本片段和感兴趣对象的文档中的标题，将为每个文本片段分配一个签名。签名是包含至少一个文本片段属性的文本片段表示的文本片段的值。字幕签名被识别为分配给文档中至少一个感兴趣对象附近的大量文本片段的签名。一个或多个标题被检测为一个或多个文本片段，每个文本片段分配了字幕签名。

20. 发明申请

US20060155700A1 Method and apparatus for structuring documents based on layout, content and collection 失效
标题翻译：基于布局，内容和收集构建文档的方法和装置
公开(公告)号：US20060155700A1
公开(公告)日：2006-07-13
申请号：US11033016
申请日：2005-01-10
申请人： Herve Dejean , Veronika Lux , Sandrine Ribeau
发明人： Herve Dejean , Veronika Lux , Sandrine Ribeau
IPC分类号： G06F17/24
CPC分类号： G06F17/30914
摘要： A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.
摘要翻译：提供了一种方法和装置，用于根据从输入格式识别的预定属性将基本上包括平面布局结构的第一格式的文档以分层形式转换成结构化文档。该过程包括根据从输入文档格式可识别的预定文档属性集，将输入文档分段成多个文档内容元素。内容元素被聚集成具有相似文档属性的选择集。参考集合中的文档中常见的常见文本属性组织内容来验证集群集。然后，将集群集合分类为包括结构化文档格式的结构化元素的预定类别，并且文档内容元素由来自预定类别的分层依赖性组织，其中组织的文档元素包括期望的结构化文档格式。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式