会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明授权
    • Duplicate data elimination system
    • 重复数据消除系统
    • US07287019B2
    • 2007-10-23
    • US10453992
    • 2003-06-04
    • Rahul KapoorVenkatesh GantiSurajit Chaudhuri
    • Rahul KapoorVenkatesh GantiSurajit Chaudhuri
    • G06F17/30
    • G06F17/30303G06F2216/03Y10S707/99932Y10S707/99933Y10S707/99934Y10S707/99935Y10S707/99936
    • A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
    • 从一组数据记录中查找类似数据记录的过程。 数据库表或表提供了一些数据记录,从中可以识别一个或多个规范数据记录。 在数据记录中识别令牌,并根据属性字段进行分类。 基于数据记录的令牌之间的相似度,将相似性得分分配给与其他数据记录有关的数据记录。 其相似度相对于彼此的数据记录大于阈值形成一组或多组数据记录。 记录或元组形成图的节点,其中节点之间的边表示组之间的相似性得分。 在每个组内,基于数据记录在组内的彼此的相似性来识别规范记录。
    • 5. 发明授权
    • Identifying synonyms of entities using a document collection
    • 使用文档集合识别实体的同义词
    • US08533203B2
    • 2013-09-10
    • US12478120
    • 2009-06-04
    • Surajit ChaudhuriVenkatesh GantiDong Xin
    • Surajit ChaudhuriVenkatesh GantiDong Xin
    • G06F17/30G06F7/00
    • G06F17/2795G06F17/278
    • Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS's) that are subsets of both the hit sequences and the entity names. The DTS's are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.
    • 本文公开了使用文档集合识别实体的同义词。 在一些方面,可以分析来自文档集合的文档以识别包括一个或多个令牌(例如,单词,数字等)的命中序列。 然后可以使用命中序列来生成作为命中序列和实体名称的子集的识别令牌集(DTS's)。 DTS与相应的实体名称相匹配,然后用于通过选择靠近DTS的文档中的相邻文本来创建DTS短语。 可以分析DTS短语以确定对应的DTS是否是实体名称的同义词。 在各方面,使用存在于DTS短语中的关联实体名称的令牌来产生DTS的得分。 当分数至少达到阈值时,DTS可以被指定为同义词。 可以为每个实体名称生成同义词列表。
    • 6. 发明授权
    • Finding related entity results for search queries
    • 查找搜索查询的相关实体结果
    • US08195655B2
    • 2012-06-05
    • US11758024
    • 2007-06-05
    • Sanjay AgrawalKaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • Sanjay AgrawalKaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • G06F17/30
    • G06F17/278G06F17/30864
    • Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.
    • 用于查找网络搜索查询的相关实体的架构。 提取组件将文档作为输入并输出所有实体的所有提及(或出现),例如文档中的人员,组织,位置和产品的名称以及实体元数据。 索引组件采用文档标识符(docID)和命名实体的提及集合,并存储和索引信息进行检索。 基于文档的搜索组件接受关键字查询,并返回与查询匹配的顶级文档的docID。 检索组件将docID作为输入,访问由索引组件存储的信息,并返回文档中命名实体的提及集。 然后将该信息传递给实体计分和阈值组件,该组件计算每个实体的聚合分数,并选择要返回给用户的实体。
    • 7. 发明申请
    • Pushing Search Query Constraints Into Information Retrieval Processing
    • 将搜索查询约束推送到信息检索处理中
    • US20110320446A1
    • 2011-12-29
    • US12823124
    • 2010-06-25
    • Kaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • Kaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • G06F17/30
    • G06F16/90335
    • This patent application relates to interval-based information retrieval (IR) search techniques for efficiently and correctly answering keyword search queries. In some embodiments, a range of information-containing blocks for a search query can be identified. Each of these blocks, and thus the range, can include document identifiers that identify individual corresponding documents that contain a term found in the search query. From the range, a subrange(s) having a smaller number of blocks than the range can be selected. This can be accomplished without decompressing the blocks by partitioning the range into intervals and evaluating the intervals. The smaller number of blocks in the subranges(s) can then be decompressed and processed to identify a doc ID(s) and thus document(s) that satisfies the query.
    • 该专利申请涉及用于有效和正确地回答关键词搜索查询的基于间隔的信息检索(IR)搜索技术。 在一些实施例中,可以识别用于搜索查询的一系列含有信息的块。 这些块中的每个以及因此的范围可以包括识别包含在搜索查询中找到的术语的各个对应文档的文档标识符。 从该范围可以选择具有比该范围少的块数量的子范围。 这可以在不通过将范围划分成间隔并且评估间隔来解压缩块的情况下实现。 然后可以解压缩和处理子范围中较小数量的块,以识别文档ID,从而识别符合查询的文档。
    • 8. 发明申请
    • Finding Related Entities For Search Queries
    • 查找搜索查询的相关实体
    • US20080306908A1
    • 2008-12-11
    • US11758024
    • 2007-06-05
    • Sanjay AgrawalKaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • Sanjay AgrawalKaushik ChakrabartiSurajit ChaudhuriVenkatesh Ganti
    • G06F17/30
    • G06F17/278G06F17/30864
    • Architecture for finding related entities for web search queries. An extraction component takes a document as input and outputs all the mentions (or occurrences) of named entities such as names of people, organizations, locations, and products in the document, as well as entity metadata. An indexing component takes a document identifier (docID) and the set of mentions of named entities and, stores and indexes the information for retrieval. A document-based search component takes a keyword query and returns the docIDs of the top documents matching with the query. A retrieval component takes a docID as input, accesses the information stored by the indexing component and returns the set of mentions of named entities in the document. This information is then passed to an entity scoring and thresholding component that computes an aggregate score of each entity and selects the entities to return to the user.
    • 用于查找网络搜索查询的相关实体的架构。 提取组件将文档作为输入并输出所有实体的所有提及(或出现),例如文档中的人员,组织,位置和产品的名称以及实体元数据。 索引组件采用文档标识符(docID)和命名实体的提及集合,并存储和索引信息进行检索。 基于文档的搜索组件接受关键字查询,并返回与查询匹配的顶级文档的docID。 检索组件将docID作为输入,访问由索引组件存储的信息,并返回文档中命名实体的提及集。 然后将该信息传递给实体计分和阈值组件,该组件计算每个实体的聚合分数,并选择要返回给用户的实体。
    • 10. 发明授权
    • Detecting duplicate records in database
    • 检测数据库中的重复记录
    • US06961721B2
    • 2005-11-01
    • US10186031
    • 2002-06-28
    • Surajit ChaudhuriVenkatesh GantiRohit Ananthakrishna
    • Surajit ChaudhuriVenkatesh GantiRohit Ananthakrishna
    • G06F17/30G06F7/00
    • G06F17/30303Y10S707/99931Y10S707/99942
    • The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
    • 本发明涉及对数据库中的重复元组的检测。 复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数(例如,编辑距离,余弦度量)。 然而,如果这些现有技术的方法用于识别领域特定的缩写和惯例,则会产生大量的假阳性。 根据本发明,基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程,数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。 本发明利用表层次结构中可用的额外知识来开发高质量,可扩展的重复检测过程。