会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 31. 发明授权
    • Leveraging constraints for deduplication
    • 利用重复数据删除的约束
    • US08204866B2
    • 2012-06-19
    • US11804400
    • 2007-05-18
    • Surajit ChaudhuriVenkatesh GantiShriraghav KaushikAnish Das Sarma
    • Surajit ChaudhuriVenkatesh GantiShriraghav KaushikAnish Das Sarma
    • G06F17/30
    • G06F17/30489
    • A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.
    • 重复数据删除算法,通过使用聚合和/或分组约束来提高重复数据删除的精度。 重复数据删除使用只有这些约束满足的约束才能实现,而不是将其作为硬约束条件强制强加。 此外,利用元组之间的文本相似性来限制搜索空间。 该算法以数据记录的粗略初始分区开始,并通过提高相似性阈值继续,直到阈值分裂给定分区。 这个拆分序列定义了丰富的替代空间。 在这个空间上,一个算法找到了一个最大化约束满足度的输入分区。 在重复数据消除的分组聚合约束的上下文中,允许所有SQL(结构化查询语言)聚合,包括求和。
    • 32. 发明授权
    • Example-driven design of efficient record matching queries
    • 高效记录匹配查询的示例驱动设计
    • US08046339B2
    • 2011-10-25
    • US11758202
    • 2007-06-05
    • Surajit ChaudhuriBee Chung ChenVenkatesh GantiShriraghav Kaushik
    • Surajit ChaudhuriBee Chung ChenVenkatesh GantiShriraghav Kaushik
    • G06F17/30
    • G06F17/30533G06F17/30495
    • Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.
    • 示例驱动创建记录匹配查询。 所公开的架构采用利用正(或匹配)和否定(不匹配)示例的可用性来搜索该空间并提出初始记录匹配查询的技术。 记录匹配任务被建模为设计通过组合几个原始算子获得的运算符树的记录匹配任务。 这确保了记录匹配程序可以在大的输入关系上有效和可扩展地执行。 该架构通过多个(例如,两个)关系(例如,R和S)连接记录。 该架构利用了关系中记录匹配的相似度函数的单调性,因为任何一对匹配记录具有比至少一个相似度函数上的非匹配记录对更高的相似度值。
    • 33. 发明授权
    • Detecting duplicate records in databases
    • 检测数据库中的重复记录
    • US07685090B2
    • 2010-03-23
    • US11182590
    • 2005-07-14
    • Surajit ChaudhuriVenkatesh GantiRohit Ananthakrishna
    • Surajit ChaudhuriVenkatesh GantiRohit Ananthakrishna
    • G06F17/30
    • G06F17/30303Y10S707/99931Y10S707/99942
    • The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
    • 本发明涉及对数据库中的重复元组的检测。 复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数(例如,编辑距离,余弦度量)。 然而,如果这些现有技术的方法用于识别领域特定的缩写和惯例,则会产生大量的假阳性。 根据本发明,基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程,数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。 本发明利用表层次结构中可用的额外知识来开发高质量,可扩展的重复检测过程。
    • 35. 发明申请
    • Leveraging constraints for deduplication
    • 利用重复数据删除的约束
    • US20080288482A1
    • 2008-11-20
    • US11804400
    • 2007-05-18
    • Surajit ChaudhuriVenkatesh GantiShriraghav Kaushik
    • Surajit ChaudhuriVenkatesh GantiShriraghav Kaushik
    • G06F17/30
    • G06F17/30489
    • A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.
    • 重复数据删除算法,通过使用聚合和/或分组约束来提高重复数据删除的精度。 重复数据删除使用只有这些约束满足的约束才能实现,而不是将其作为硬约束条件强制强加。 此外,利用元组之间的文本相似性来限制搜索空间。 该算法以数据记录的粗略初始分区开始,并通过提高相似性阈值继续,直到阈值分裂给定分区。 这个拆分序列定义了丰富的替代空间。 在这个空间上,一个算法找到了一个最大化约束满足度的输入分区。 在重复数据消除的分组聚合约束的上下文中,允许所有SQL(结构化查询语言)聚合,包括求和。
    • 36. 发明授权
    • Efficient fuzzy match for evaluating data records
    • 用于评估数据记录的高效模糊匹配
    • US07296011B2
    • 2007-11-13
    • US10600083
    • 2003-06-20
    • Surajit ChaudhuriKris GanjamVenkatesh GantiRajeev Motwani
    • Surajit ChaudhuriKris GanjamVenkatesh GantiRajeev Motwani
    • G06F7/00G06F17/30
    • G06F17/30542G06F17/30303Y10S707/99933
    • To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
    • 为了帮助确保高数据质量,数据仓库验证和清理,如果需要外部来源的传入数据元组。 在许多情况下,输入元组或输入元组的一部分必须匹配参考表中可接受的元组。 例如,分销商的销售记录中的产品名称和描述字段必须与产品参考关系中的预先记录的名称和描述字段相匹配。 所公开的系统实现有效和准确的近似或模糊匹配操作,其可以有效地清除传入元组,如果它不能与参考关系中的任何多个元组完全匹配。 使用称为q-gram的令牌子串的公开的相似度函数克服了现有技术相似度功能的限制,同时有效地执行模糊匹配过程。
    • 38. 发明申请
    • LEVERAGING CROSS-DOCUMENT CONTEXT TO LABEL ENTITY
    • 将交叉文档引向标签实体
    • US20090282012A1
    • 2009-11-12
    • US12114824
    • 2008-05-05
    • Arnd Christian KonigVenkatesh Ganti
    • Arnd Christian KonigVenkatesh Ganti
    • G06F7/06G06F17/30
    • G06F17/278G06F17/2785Y10S707/962
    • Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity's context may take forms such as specific word sequences or membership in a particular list.
    • 诸如人物,地点和事物等实体根据可能大量文件收集的信息进行标注。 扫描一个或多个文档以识别实体,并且从文档中出现这些实体的上下文提取特征。 观察到的实体特征对存储在内存存储或外部存储中。 存储管理器通过确定哪个存储放置实体特征对,以及何时从存储器内存存储器中删除特征以为新的对腾出空间来优化对存储器存储器中的有限数量的空间的使用。 可能在实体的上下文中观察到的特征可以采取诸如特定单词序列或特定列表中的成员资格的形式。
    • 39. 发明申请
    • DATA PROFILE COMPUTATION
    • 数据配置文件计算
    • US20090006392A1
    • 2009-01-01
    • US11769050
    • 2007-06-27
    • Zhimin ChenVenkatesh GantiGunjan JhaShriraghav KaushikVivek Narasayya
    • Zhimin ChenVenkatesh GantiGunjan JhaShriraghav KaushikVivek Narasayya
    • G06F7/06G06F17/30
    • G06F17/30536
    • Architecture that provides a data profile computation technique which employs key profile computation and data pattern profile computation. Key profile computation in a data table includes both exact keys as well as approximate keys, and is based on key strengths. A key strength of 100% is an exact key, and any other percentage in an approximate key. The key strength is estimated based on the number of table rows that have duplicated attribute values. Only column sets that exceed a threshold value are returned. Pattern profiling identifies a small set of regular expression patterns which best describe the patterns within a given set of attribute values. Pattern profiling includes three phases: a first phases for determining token regular expressions, a second phase for determining candidate regular expressions, and a third phase for identifying the best regular expressions of the candidates that match the attribute values.
    • 提供采用关键轮廓计算和数据模式轮廓计算的数据轮廓计算技术的架构。 数据表中的关键轮廓计算包括精密键和近似键,并且基于关键优点。 100%的关键优势是一个确切的关​​键,其中一个关键的任何其他百分比。 基于具有重复的属性值的表行的数量来估计关键强度。 只返回超过阈值的列集。 模式分析标识一组最佳描述一组给定属性值中的模式的正则表达式模式。 模式分析包括三个阶段:用于确定令牌正则表达式的第一阶段,用于确定候选正则表达式的第二阶段,以及用于识别与属性值匹配的候选的最佳正则表达式的第三阶段。
    • 40. 发明申请
    • Segmentation of strings into structured records
    • 将字符串分割成结构化记录
    • US20050234906A1
    • 2005-10-20
    • US10825488
    • 2004-04-14
    • Venkatesh GantiTheodore VassilakisYevgeny Agichtein
    • Venkatesh GantiTheodore VassilakisYevgeny Agichtein
    • G06F7/00G06F17/30
    • G06F17/30569Y10S707/99933Y10S707/99935
    • An system for segmenting strings into component parts for use with a database management system. A reference table of string records are segmented into multiple substrings corresponding to database attributes. The substrings within an attribute are analyzed to provide a state model that assumes a beginning, a middle and an ending token topology for that attribute. A null token takes into account an empty attribute component and copying of states allows for erroneous token insertions and misordering. Once the model is created from the clean data, the process breaks or parses an input record into a sequence of tokens. The process then determines a most probable segmentation of the input record by comparing the tokens of the input record with a state models derived for attributes from the reference table.
    • 用于将字符串分割成用于数据库管理系统的组件的系统。 字符串记录的引用表被分割成与数据库属性对应的多个子字符串。 分析属性中的子串以提供假定该属性的开始,中间和结束令牌拓扑的状态模型。 空标记考虑了空属性组件,状态复制允许错误的标记插入和错误。 一旦从干净的数据创建了模型,该过程会将输入记录分解或解析成令牌序列。 该过程然后通过将输入记录的令牌与从参考表导出的属性的状态模型进行比较来确定输入记录的最可能的分割。