会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明授权
    • Active learning of record matching packages
    • 积极学习记录匹配包
    • US09081817B2
    • 2015-07-14
    • US13084527
    • 2011-04-11
    • Arvind ArasuMichaela GötzShriraghav Kaushik
    • Arvind ArasuMichaela GötzShriraghav Kaushik
    • G06F17/30G06N99/00
    • G06F17/30507G06N99/005
    • An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.
    • 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。
    • 2. 发明申请
    • ACTIVE LEARNING OF RECORD MATCHING PACKAGES
    • 主动学习记录匹配包
    • US20120259802A1
    • 2012-10-11
    • US13084527
    • 2011-04-11
    • Arvind ArasuMichaela GötzShriraghav Kaushik
    • Arvind ArasuMichaela GötzShriraghav Kaushik
    • G06F15/18
    • G06F17/30507G06N99/005
    • An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.
    • 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。
    • 3. 发明授权
    • Disk-based probabilistic set-similarity indexes
    • 基于磁盘的概率集相似性指标
    • US07610283B2
    • 2009-10-27
    • US11761425
    • 2007-06-12
    • Arvind ArasuVenkatesh GantiShriraghav Kaushik
    • Arvind ArasuVenkatesh GantiShriraghav Kaushik
    • G06F17/30G06F7/06G06F7/08G06F7/10
    • G06F17/30312Y10S707/99931Y10S707/99932Y10S707/99933Y10S707/99935Y10S707/99937
    • Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
    • 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。
    • 5. 发明申请
    • TRANSFORMATION-BASED FRAMEWORK FOR RECORD MATCHING
    • 用于记录匹配的基于变换的框架
    • US20090210418A1
    • 2009-08-20
    • US12031715
    • 2008-02-15
    • Arvind ArasuSurajit ChaudhuriShriraghav Kaushik
    • Arvind ArasuSurajit ChaudhuriShriraghav Kaushik
    • G06F17/30
    • G06F17/30569G06F17/30675G06F17/30985
    • A transformation-based record matching technique. The technique provides a flexible way to account for synonyms and more general forms of string equivalences when performing record matching by taking as explicit input user-defined transformation rules (such as, for example, the fact that “Robert” and “Bob” that are synonymous). The input string and user-defined transformation rules are used to generate a larger set of strings which are used when performing record matching. Both the input string and data elements in a database can be transformed using the user-defined transformation rules in order to generate a larger set of potential record matches. These potential record matches can then be subjected to a threshold test in order to determine one or more best matches. Additionally, signature-based similarity functions are used to improve the computational efficiency of the technique.
    • 基于变换的记录匹配技术。 当通过采用显式输入用户定义的转换规则(例如,“Robert”和“Bob”)这样的事实来执行记录匹配时,该技术提供了一种灵活的方式来解释同义词和更一般的字符串等同形式 同义词)。 输入字符串和用户定义的转换规则用于生成在执行记录匹配时使用的较大的一组字符串。 可以使用用户定义的变换规则来转换数据库中的输入字符串和数据元素,以便生成更大的潜在记录匹配集合。 然后可以对这些潜在的记录匹配进行阈值测试,以确定一个或多个最佳匹配。 另外,使用基于签名的相似度函数来提高该技术的计算效率。
    • 7. 发明申请
    • Disk-Based Probabilistic Set-Similarity Indexes
    • 基于磁盘的概率集相似性指标
    • US20080313128A1
    • 2008-12-18
    • US11761425
    • 2007-06-12
    • Arvind ArasuVenkatesh GantiShriraghav Kaushik
    • Arvind ArasuVenkatesh GantiShriraghav Kaushik
    • G06F7/06G06F17/30
    • G06F17/30312Y10S707/99931Y10S707/99932Y10S707/99933Y10S707/99935Y10S707/99937
    • Input set indexing for set-similarity lookups. The architecture provides input to an indexing process that enables more efficient lookups for large data sets (e.g., disk-based) without requiring a full scan of the input. A new index structure is provided, the output of which is exact, rather than approximate. The similarity of two sets is specified using a similarity function that maps two sets to a numeric value that represents similarity of the two sets. Threshold-based lookups are addressed where two sets are considered similar if the numeric similarity score is above a threshold. The structure efficiently identifies all input sets within a distance k (e.g., a hamming distance) of the query set. Additional information in the form of frequency of elements (the number of input sets in which an element occurs) is used to improve index performance.
    • 用于集合相似性查找的输入集索引。 该体系结构为索引过程提供输入,可以对大数据集(例如,基于磁盘)进行更有效的查找,而无需对输入进行全面扫描。 提供了一个新的索引结构,其输出是精确的,而不是近似的。 使用将两组映射到表示两组相似度的数值的相似度函数来指定两组的相似度。 如果数字相似性分数高于阈值,则基于阈值的查找被解决为其中两个集合被认为是相似的。 该结构有效地识别查询集合的距离k(例如,汉明距离)内的所有输入集合。 使用元素频率(元素发生的输入集合的数量)的形式的附加信息用于提高索引性能。
    • 9. 发明申请
    • STOP-AND-RESTART STYLE EXECUTION FOR LONG RUNNING DECISION SUPPORT QUERIES
    • 用于长时间运行的决策支持查询的停止和重新启动方式执行
    • US20090083238A1
    • 2009-03-26
    • US11859046
    • 2007-09-21
    • Surajit ChaudhuriShriraghav KaushikAbhijit PolRavishankar Ramamurthy
    • Surajit ChaudhuriShriraghav KaushikAbhijit PolRavishankar Ramamurthy
    • G06F17/30
    • G06F16/24561
    • Stop-and-restart query execution that partially leverages the work already performed during the initial execution of the query to reduce the execution time during a restart. The technique selectively saves information from a previous execution of the query so that the overhead associated with restarting the query execution can be bounded. Despite saving only limited information, the disclosed technique substantially reduces the running time of the restarted query. The stop-and-restart query execution technique is constrained to save and reuse only a bounded number of records (intermediate records or output records) thereby releasing all other resources, rather than some of the resources. The technique chooses a subset of the records to save that were found during normal execution and then skipping the corresponding records when performing a scan during restart to prevent the duplication of execution. A skip-scan operator is employed to facilitate the disclosed restart technique.
    • 停止和重新启动的查询执行,部分利用在初始执行查询期间已经执行的工作,以减少重新启动期间的执行时间。 该技术选择性地保存来自查询的先前执行的信息,使得与重新启动查询执行相关联的开销可以被界定。 尽管仅节省有限的信息,但是所公开的技术大大减少了重新启动的查询的运行时间。 停止和重启查询执行技术被限制为只保存和重用有限数量的记录(中间记录或输出记录),从而释放所有其他资源,而不是一些资源。 该技术选择在正常执行期间发现的记录的子集,然后在重新启动期间执行扫描时跳过相应的记录,以防止重复执行。 采用跳过扫描运算符来促进公开的重启技术。
    • 10. 发明申请
    • DATA PROFILE COMPUTATION
    • 数据配置文件计算
    • US20090006392A1
    • 2009-01-01
    • US11769050
    • 2007-06-27
    • Zhimin ChenVenkatesh GantiGunjan JhaShriraghav KaushikVivek Narasayya
    • Zhimin ChenVenkatesh GantiGunjan JhaShriraghav KaushikVivek Narasayya
    • G06F7/06G06F17/30
    • G06F17/30536
    • Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.
    • 提供采用关键轮廓计算和数据模式轮廓计算的数据轮廓计算技术的架构。 数据表中的关键轮廓计算包括精密键和近似键,并且基于关键优点。 100%的关键优势是一个确切的关​​键,其中一个关键的任何其他百分比。 基于具有重复的属性值的表行的数量来估计关键强度。 只返回超过阈值的列集。 模式分析标识一组最佳描述一组给定属性值中的模式的正则表达式模式。 模式分析包括三个阶段:用于确定令牌正则表达式的第一阶段,用于确定候选正则表达式的第二阶段,以及用于识别与属性值匹配的候选的最佳正则表达式的第三阶段。