专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

11. 发明授权

US07406479B2 Primitive operator for similarity joins in data cleaning 有权
标题翻译：数据清理中相似性的原始运算符
公开(公告)号：US07406479B2
公开(公告)日：2008-07-29
申请号：US11352141
申请日：2006-02-10
申请人： Kaushik Shriraghav , Surajit Chaudhuri , Venkatesh Ganti
发明人： Kaushik Shriraghav , Surajit Chaudhuri , Venkatesh Ganti
IPC分类号： G06F17/00
CPC分类号： G06F17/30442 , Y10S707/99942 , Y10S707/99943
摘要： A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing.The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.
摘要翻译：提供了一种集合相似性连接系统和方法。可以通过识别“关闭”元组（例如，记录和/或行）来基于相似性来促进系统的数据清理。可以使用选择适合域和/或应用程序的相似性函数来评估“接近度”。因此，该系统便于通用的域无关数据清理。该系统可以与基本原语，即相似性连接（SSJoin）运算符一起使用，其可以用作构建块来实现各种各样的相似性概念（例如，编辑相似性，Jaccard相似性，广义编辑相似性，汉明距离，声音等）以及基于共同出现的相似性。 SSJoin算子可以利用设置重叠的观察结果有效地用于支持各种相似度函数。 SSJoin操作符根据与其中每一个相关联（或明确构建的）的“集合”来比较值。

12. 发明授权

US07287019B2 Duplicate data elimination system 有权
标题翻译：重复数据消除系统
公开(公告)号：US07287019B2
公开(公告)日：2007-10-23
申请号：US10453992
申请日：2003-06-04
申请人： Rahul Kapoor , Venkatesh Ganti , Surajit Chaudhuri
发明人： Rahul Kapoor , Venkatesh Ganti , Surajit Chaudhuri
IPC分类号： G06F17/30
CPC分类号： G06F17/30303 , G06F2216/03 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935 , Y10S707/99936
摘要： A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
摘要翻译：从一组数据记录中查找类似数据记录的过程。数据库表或表提供了一些数据记录，从中可以识别一个或多个规范数据记录。在数据记录中识别令牌，并根据属性字段进行分类。基于数据记录的令牌之间的相似度，将相似性得分分配给与其他数据记录有关的数据记录。其相似度相对于彼此的数据记录大于阈值形成一组或多组数据记录。记录或元组形成图的节点，其中节点之间的边表示组之间的相似性得分。在每个组内，基于数据记录在组内的彼此的相似性来识别规范记录。

13. 发明申请

US20060282436A1 Systems and methods for estimating functional relationships in a database 有权
公开(公告)号：US20060282436A1
公开(公告)日：2006-12-14
申请号：US11123901
申请日：2005-05-06
申请人： Surajit Chaudhuri , Venkatesh Ganti , Kaushik Shriraghav
发明人： Surajit Chaudhuri , Venkatesh Ganti , Kaushik Shriraghav
IPC分类号： G06F7/00
CPC分类号： G06F17/30536 , Y10S707/99932
摘要： A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.

14. 发明授权

US07149735B2 String predicate selectivity estimation 失效
标题翻译：字符串谓词选择性估计
公开(公告)号：US07149735B2
公开(公告)日：2006-12-12
申请号：US10603035
申请日：2003-06-24
申请人： Surajit Chaudhuri , Venkatesh Ganti , Luis Gravano
发明人： Surajit Chaudhuri , Venkatesh Ganti , Luis Gravano
IPC分类号： G06F17/30
CPC分类号： G06F17/30985 , Y10S707/99936
摘要： A method of estimating selectivity of a given string predicate in a database query. In the method selectivities of substrings of various substring lengths are estimated. For example, the selectivity of substrings between length l (or some constant q) to the length of the given string predicate may be estimated. The method then selects a candidate sub string for each sub string length based on estimated selectivities of the substrings. The estimated selectivities of the candidate substrings are combined. The combined estimated selectivity of the candidate substrings is returned as the estimated selectivity of the given string predicate.
摘要翻译：在数据库查询中估计给定字符串谓词的选择性的方法。在方法中，估计各种子串长度的子串的选择性。例如，可以估计长度l（或一些常数q）与给定字符串谓词的长度之间的子串的选择性。然后，该方法基于所估计的子串的选择性来选择每个子串长度的候选子串。合并候选子串的估计选择性。候选子串的组合估计选择性作为给定字符串谓词的估计选择性返回。

15. 发明申请

US20100313258A1 IDENTIFYING SYNONYMS OF ENTITIES USING A DOCUMENT COLLECTION 有权
标题翻译：使用文件收集识别实体的同义词
公开(公告)号：US20100313258A1
公开(公告)日：2010-12-09
申请号：US12478120
申请日：2009-06-04
申请人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
发明人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
IPC分类号： H04L9/32
CPC分类号： G06F17/2795 , G06F17/278
摘要： Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS's) that are subsets of both the hit sequences and the entity names. The DTS's are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.
摘要翻译：本文公开了使用文档集合识别实体的同义词。在一些方面，可以分析来自文档集合的文档以识别包括一个或多个令牌（例如，单词，数字等）的命中序列。然后可以使用命中序列来生成作为命中序列和实体名称的子集的识别令牌集（DTS's）。 DTS与相应的实体名称相匹配，然后用于通过选择靠近DTS的文档中的相邻文本来创建DTS短语。可以分析DTS短语以确定对应的DTS是否是实体名称的同义词。在各方面，使用存在于DTS短语中的关联实体名称的令牌来产生DTS的得分。当分数至少达到阈值时，DTS可以被指定为同义词。可以为每个实体名称生成同义词列表。

16. 发明申请

US20100293179A1 IDENTIFYING SYNONYMS OF ENTITIES USING WEB SEARCH 审中-公开
标题翻译：使用WEB搜索识别实体的同步
公开(公告)号：US20100293179A1
公开(公告)日：2010-11-18
申请号：US12465832
申请日：2009-05-14
申请人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
发明人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
IPC分类号： G06F17/30
CPC分类号： G06F16/951
摘要： Identifying synonyms of entities using web search results is disclosed herein. In some aspects, a candidate string of tokens of an entity name is selected as a search term. The search term is transmitted by a server to a search engine, which in turn, transmits search results back to the server after performing a search. The server analyzes the search results, generates a score based on the search results, and then determines a status (synonym or not a synonym) of the candidate string based on the score. In further aspects, additional candidate strings are designated as synonyms or not synonyms based on status of the searched candidate string by using relationships of a lattice formed from all possible candidate strings of the entity name.
摘要翻译：本文公开了使用网络搜索结果识别实体的同义词。在某些方面，选择实体名称的令牌候选字符串作为搜索项。搜索项由服务器发送到搜索引擎，搜索引擎又在执行搜索之后将搜索结果发送回服务器。服务器分析搜索结果，根据搜索结果生成分数，然后根据分数确定候选字符串的状态（同义词或不是同义词）。在另外的方面，通过使用由实体名称的所有可能候选字符串形成的格子的关系，基于搜索到的候选字符串的状态，将附加候选字符串指定为同义词或不是同义词。

17. 发明授权

US07562067B2 Systems and methods for estimating functional relationships in a database 有权
标题翻译：用于估计数据库中的功能关系的系统和方法
公开(公告)号：US07562067B2
公开(公告)日：2009-07-14
申请号：US11123901
申请日：2005-05-06
申请人： Surajit Chaudhuri , Venkatesh Ganti , Kaushik Shriraghav
发明人： Surajit Chaudhuri , Venkatesh Ganti , Kaushik Shriraghav
IPC分类号： G06F17/30 , G06F7/00 , G06F17/00
CPC分类号： G06F17/30536 , Y10S707/99932
摘要： A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.
摘要翻译：便于估计与数据库中的一个或多个列相关联的功能关系的系统包括接收数据库内的记录的随机抽样的采样组件。估计生成器组件至少部分地基于所接收的样本来计算功能关系的强度的估计。例如，估计生成器组件可以至少部分地基于所接收的样本来计算作为关键列的列的强度的估计。

18. 发明申请

US20050262044A1 Detecting duplicate records in databases 有权
标题翻译：检测数据库中的重复记录
公开(公告)号：US20050262044A1
公开(公告)日：2005-11-24
申请号：US11182590
申请日：2005-07-14
申请人： Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
发明人： Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
IPC分类号： G06F17/30 , G06F7/00
CPC分类号： G06F17/30303 , Y10S707/99931 , Y10S707/99942
摘要： The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
摘要翻译：本发明涉及对数据库中的重复元组的检测。复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数（例如，编辑距离，余弦度量）。然而，如果这些现有技术的方法用于识别领域特定的缩写和惯例，则会产生大量的假阳性。根据本发明，基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程，数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。本发明利用表层次结构中可用的额外知识来开发高质量，可扩展的重复检测过程。

19. 发明授权

US08533203B2 Identifying synonyms of entities using a document collection 有权
标题翻译：使用文档集合识别实体的同义词
公开(公告)号：US08533203B2
公开(公告)日：2013-09-10
申请号：US12478120
申请日：2009-06-04
申请人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
发明人： Surajit Chaudhuri , Venkatesh Ganti , Dong Xin
IPC分类号： G06F17/30 , G06F7/00
CPC分类号： G06F17/2795 , G06F17/278
摘要： Identifying synonyms of entities using a collection of documents is disclosed herein. In some aspects, a document from a collection of documents may be analyzed to identify hit sequences that include one or more tokens (e.g., words, number, etc.). The hit sequences may then be used to generate discriminating token sets (DTS's) that are subsets of both the hit sequences and the entity names. The DTS's are matched with corresponding entity names, and then used to create DTS phrases by selecting adjacent text in the document that is proximate to the DTS. The DTS phrases may be analyzed to determine whether the corresponding DTS is synonyms of the entity name. In various aspects, the tokens of an associated entity name that are present in the DTS phrases are used to generate a score for the DTS. When the score at least reaches a threshold, the DTS may be designated as a synonym. A list of synonyms may be generated for each entity name.
摘要翻译：本文公开了使用文档集合识别实体的同义词。在一些方面，可以分析来自文档集合的文档以识别包括一个或多个令牌（例如，单词，数字等）的命中序列。然后可以使用命中序列来生成作为命中序列和实体名称的子集的识别令牌集（DTS's）。 DTS与相应的实体名称相匹配，然后用于通过选择靠近DTS的文档中的相邻文本来创建DTS短语。可以分析DTS短语以确定对应的DTS是否是实体名称的同义词。在各方面，使用存在于DTS短语中的关联实体名称的令牌来产生DTS的得分。当分数至少达到阈值时，DTS可以被指定为同义词。可以为每个实体名称生成同义词列表。

20. 发明申请

US20110320446A1 Pushing Search Query Constraints Into Information Retrieval Processing 审中-公开
标题翻译：将搜索查询约束推送到信息检索处理中
公开(公告)号：US20110320446A1
公开(公告)日：2011-12-29
申请号：US12823124
申请日：2010-06-25
申请人： Kaushik Chakrabarti , Surajit Chaudhuri , Venkatesh Ganti
发明人： Kaushik Chakrabarti , Surajit Chaudhuri , Venkatesh Ganti
IPC分类号： G06F17/30
CPC分类号： G06F16/90335
摘要： This patent application relates to interval-based information retrieval (IR) search techniques for efficiently and correctly answering keyword search queries. In some embodiments, a range of information-containing blocks for a search query can be identified. Each of these blocks, and thus the range, can include document identifiers that identify individual corresponding documents that contain a term found in the search query. From the range, a subrange(s) having a smaller number of blocks than the range can be selected. This can be accomplished without decompressing the blocks by partitioning the range into intervals and evaluating the intervals. The smaller number of blocks in the subranges(s) can then be decompressed and processed to identify a doc ID(s) and thus document(s) that satisfies the query.
摘要翻译：该专利申请涉及用于有效和正确地回答关键词搜索查询的基于间隔的信息检索（IR）搜索技术。在一些实施例中，可以识别用于搜索查询的一系列含有信息的块。这些块中的每个以及因此的范围可以包括识别包含在搜索查询中找到的术语的各个对应文档的文档标识符。从该范围可以选择具有比该范围少的块数量的子范围。这可以在不通过将范围划分成间隔并且评估间隔来解压缩块的情况下实现。然后可以解压缩和处理子范围中较小数量的块，以识别文档ID，从而识别符合查询的文档。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式