专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明申请

US20090006392A1 DATA PROFILE COMPUTATION 有权
标题翻译：数据配置文件计算
公开(公告)号：US20090006392A1
公开(公告)日：2009-01-01
申请号：US11769050
申请日：2007-06-27
申请人： Zhimin Chen , Venkatesh Ganti , Gunjan Jha , Shriraghav Kaushik , Vivek Narasayya
发明人： Zhimin Chen , Venkatesh Ganti , Gunjan Jha , Shriraghav Kaushik , Vivek Narasayya
IPC分类号： G06F7/06 , G06F17/30
CPC分类号： G06F17/30536
摘要： Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.
摘要翻译：提供采用关键轮廓计算和数据模式轮廓计算的数据轮廓计算技术的架构。数据表中的关键轮廓计算包括精密键和近似键，并且基于关键优点。 100％的关键优势是一个确切的关键，其中一个关键的任何其他百分比。基于具有重复的属性值的表行的数量来估计关键强度。只返回超过阈值的列集。模式分析标识一组最佳描述一组给定属性值中的模式的正则表达式模式。模式分析包括三个阶段：用于确定令牌正则表达式的第一阶段，用于确定候选正则表达式的第二阶段，以及用于识别与属性值匹配的候选的最佳正则表达式的第三阶段。

2. 发明授权

US07720883B2 Key profile computation and data pattern profile computation 有权
标题翻译：关键轮廓计算和数据模式轮廓计算
公开(公告)号：US07720883B2
公开(公告)日：2010-05-18
申请号：US11769050
申请日：2007-06-27
申请人： Zhimin Chen , Venkatesh Ganti , Gunjan Jha , Shriraghav Kaushik , Vivek Narasayya
发明人： Zhimin Chen , Venkatesh Ganti , Gunjan Jha , Shriraghav Kaushik , Vivek Narasayya
IPC分类号： G06F7/00 , G06F17/30
CPC分类号： G06F17/30536
摘要： Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.
摘要翻译：提供采用关键轮廓计算和数据模式轮廓计算的数据轮廓计算技术的架构。数据表中的关键轮廓计算包括精密键和近似键，并且基于关键优点。 100％的关键优势是一个确切的关键，其中一个关键的任何其他百分比。基于具有重复的属性值的表行的数量来估计关键强度。只返回超过阈值的列集。模式分析标识一组最佳描述一组给定属性值中的模式的正则表达式模式。模式分析包括三个阶段：用于确定令牌正则表达式的第一阶段，用于确定候选正则表达式的第二阶段，以及用于识别与属性值匹配的候选的最佳正则表达式的第三阶段。

3. 发明申请

US20080313128A1 Disk-Based Probabilistic Set-Similarity Indexes 有权
标题翻译：基于磁盘的概率集相似性指标
公开(公告)号：US20080313128A1
公开(公告)日：2008-12-18
申请号：US11761425
申请日：2007-06-12
申请人： Arvind Arasu , Venkatesh Ganti , Shriraghav Kaushik
发明人： Arvind Arasu , Venkatesh Ganti , Shriraghav Kaushik
IPC分类号： G06F7/06 , G06F17/30
CPC分类号： G06F17/30312 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99937
摘要： Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
摘要翻译：用于集合相似性查找的输入集索引。该体系结构为索引过程提供输入，可以对大数据集（例如，基于磁盘）进行更有效的查找，而无需对输入进行全面扫描。提供了一个新的索引结构，其输出是精确的，而不是近似的。使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。如果数字相似性分数高于阈值，则基于阈值的查找被解决为其中两个集合被认为是相似的。该结构有效地识别查询集合的距离k（例如，汉明距离）内的所有输入集合。使用元素频率（元素发生的输入集合的数量）的形式的附加信息用于提高索引性能。

4. 发明申请

US20080306945A1 EXAMPLE-DRIVEN DESIGN OF EFFICIENT RECORD MATCHING QUERIES 有权
标题翻译：实例 - 有效记录匹配查询的驱动设计
公开(公告)号：US20080306945A1
公开(公告)日：2008-12-11
申请号：US11758202
申请日：2007-06-05
申请人： Surajit Chaudhuri , Bee-Chung Chen , Venkatesh Ganti , Shriraghav Kaushik
发明人： Surajit Chaudhuri , Bee-Chung Chen , Venkatesh Ganti , Shriraghav Kaushik
IPC分类号： G06F17/30
CPC分类号： G06F17/30533 , G06F17/30495
摘要： Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.
摘要翻译：示例驱动创建记录匹配查询。所公开的架构采用利用正（或匹配）和否定（不匹配）示例的可用性来搜索该空间并提出初始记录匹配查询的技术。记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。该架构通过多个（例如，两个）关系（例如，R和S）连接记录。该架构利用了关系中记录匹配的相似度函数的单调性，因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。

5. 发明授权

US08204866B2 Leveraging constraints for deduplication 有权
标题翻译：利用重复数据删除的约束
公开(公告)号：US08204866B2
公开(公告)日：2012-06-19
申请号：US11804400
申请日：2007-05-18
申请人： Surajit Chaudhuri , Venkatesh Ganti , Shriraghav Kaushik , Anish Das Sarma
发明人： Surajit Chaudhuri , Venkatesh Ganti , Shriraghav Kaushik , Anish Das Sarma
IPC分类号： G06F17/30
CPC分类号： G06F17/30489
摘要： A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.
摘要翻译：重复数据删除算法，通过使用聚合和/或分组约束来提高重复数据删除的精度。重复数据删除使用只有这些约束满足的约束才能实现，而不是将其作为硬约束条件强制强加。此外，利用元组之间的文本相似性来限制搜索空间。该算法以数据记录的粗略初始分区开始，并通过提高相似性阈值继续，直到阈值分裂给定分区。这个拆分序列定义了丰富的替代空间。在这个空间上，一个算法找到了一个最大化约束满足度的输入分区。在重复数据消除的分组聚合约束的上下文中，允许所有SQL（结构化查询语言）聚合，包括求和。

6. 发明授权

US08046339B2 Example-driven design of efficient record matching queries 有权
标题翻译：高效记录匹配查询的示例驱动设计
公开(公告)号：US08046339B2
公开(公告)日：2011-10-25
申请号：US11758202
申请日：2007-06-05
申请人： Surajit Chaudhuri , Bee Chung Chen , Venkatesh Ganti , Shriraghav Kaushik
发明人： Surajit Chaudhuri , Bee Chung Chen , Venkatesh Ganti , Shriraghav Kaushik
IPC分类号： G06F17/30
CPC分类号： G06F17/30533 , G06F17/30495
摘要： Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.
摘要翻译：示例驱动创建记录匹配查询。所公开的架构采用利用正（或匹配）和否定（不匹配）示例的可用性来搜索该空间并提出初始记录匹配查询的技术。记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。该架构通过多个（例如，两个）关系（例如，R和S）连接记录。该架构利用了关系中记录匹配的相似度函数的单调性，因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。

7. 发明授权

US07610283B2 Disk-based probabilistic set-similarity indexes 有权
标题翻译：基于磁盘的概率集相似性指标
公开(公告)号：US07610283B2
公开(公告)日：2009-10-27
申请号：US11761425
申请日：2007-06-12
申请人： Arvind Arasu , Venkatesh Ganti , Shriraghav Kaushik
发明人： Arvind Arasu , Venkatesh Ganti , Shriraghav Kaushik
IPC分类号： G06F17/30 , G06F7/06 , G06F7/08 , G06F7/10
CPC分类号： G06F17/30312 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935 , Y10S707/99937
摘要： Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
摘要翻译：用于集合相似性查找的输入集索引。该体系结构为索引过程提供输入，可以对大数据集（例如，基于磁盘）进行更有效的查找，而无需对输入进行全面扫描。提供了一个新的索引结构，其输出是精确的，而不是近似的。使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。如果数字相似性分数高于阈值，则基于阈值的查找被解决为其中两个集合被认为是相似的。该结构有效地识别查询集合的距离k（例如，汉明距离）内的所有输入集合。使用元素频率（元素发生的输入集合的数量）的形式的附加信息用于提高索引性能。

8. 发明申请

US20080288482A1 Leveraging constraints for deduplication 有权
标题翻译：利用重复数据删除的约束
公开(公告)号：US20080288482A1
公开(公告)日：2008-11-20
申请号：US11804400
申请日：2007-05-18
申请人： Surajit Chaudhuri , Venkatesh Ganti , Shriraghav Kaushik
发明人： Surajit Chaudhuri , Venkatesh Ganti , Shriraghav Kaushik
IPC分类号： G06F17/30
CPC分类号： G06F17/30489
摘要： A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.
摘要翻译：重复数据删除算法，通过使用聚合和/或分组约束来提高重复数据删除的精度。重复数据删除使用只有这些约束满足的约束才能实现，而不是将其作为硬约束条件强制强加。此外，利用元组之间的文本相似性来限制搜索空间。该算法以数据记录的粗略初始分区开始，并通过提高相似性阈值继续，直到阈值分裂给定分区。这个拆分序列定义了丰富的替代空间。在这个空间上，一个算法找到了一个最大化约束满足度的输入分区。在重复数据消除的分组聚合约束的上下文中，允许所有SQL（结构化查询语言）聚合，包括求和。

9. 发明授权

US08249336B2 Learning string transformations from examples 有权
标题翻译：从示例中学习字符串变换
公开(公告)号：US08249336B2
公开(公告)日：2012-08-21
申请号：US12492311
申请日：2009-08-14
申请人： Arvind Arasu , Surajit Chaudhuri , Shriraghav Kaushik
发明人： Arvind Arasu , Surajit Chaudhuri , Shriraghav Kaushik
IPC分类号： G06K9/00
CPC分类号： G06F17/2765
摘要： Techniques are described to leverage a set of sample or example matched pairs of strings to learn string transformation rules, which may be used to match data records that are semantically equivalent. In one embodiment, matched pairs of input strings are accessed. For a set of matched pairs, a set of one or more string transformation rules are learned. A transformation rule may include two strings determined to be semantically equivalent. The transformation rules are used to determine whether a first and second string match each other.
摘要翻译：描述技术来利用一组样本或示例匹配的字符串对来学习字符串转换规则，其可以用于匹配语义等同的数据记录。在一个实施例中，访问匹配的输入串对。对于一组匹配的对，学习一组或多个字符串转换规则。转换规则可以包括确定为在语义上相等的两个字符串。变换规则用于确定第一个和第二个字符串是否彼此匹配。

10. 发明申请

US20100325136A1 ERROR TOLERANT AUTOCOMPLETION 审中-公开
标题翻译：错误的自动化
公开(公告)号：US20100325136A1
公开(公告)日：2010-12-23
申请号：US12490288
申请日：2009-06-23
申请人： Surajit Chaudhuri , Shriraghav Kaushik
发明人： Surajit Chaudhuri , Shriraghav Kaushik
IPC分类号： G06F17/30 , G06F3/048
CPC分类号： G06F17/276
摘要： Techniques for error-tolerant autocompletion are described. While displaying characters of an input string as they are inputted by a user, when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input string within a given edit distance of the input string.
摘要翻译：描述了容错自动完成技术。当用户输入输入字符串的字符时，当用户将字符添加到输入字符串时，可以通过确定哪个候选字符串具有前缀来从一组候选字符串中选择匹配字符串其字符与输入字符串的给定编辑距离内的输入字符串的字符匹配。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式