会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 1. 发明授权
    • Multi-dimensional database record compression utilizing optimized cluster models
    • 利用优化的集群模型进行多维数据库记录压缩
    • US06633882B1
    • 2003-10-14
    • US09606964
    • 2000-06-29
    • Usama FayyadJayavel Shanmugasundaram
    • Usama FayyadJayavel Shanmugasundaram
    • G06F1730
    • G06F17/30592Y10S707/99936Y10S707/99942
    • Apparatus and method for use in querying a database containing data records. The database is characterized by a compression scheme to provide data clustering information. In accordance with a exemplary embodiment of the invention a functional representation of data clustering is a Gaussian and the queries are performing by integrating the Gaussian corresponding to each of the data clusters over the ranges to determine the sum or the count of data records from the database that fall within the selected ranges. The process chooses a value for the cluster number K. The cluster model is next broken up into areas (tiles) based on user defined parameters. Data from the database is then classified based on the tiling information. A sorted version of the classified data, ordered by cluster number and then by the tile number within the cluster is generated. This data is then evaluated to test the sufficiency of the model created during the clustering.
    • 用于查询包含数据记录的数据库的装置和方法。 数据库的特征在于提供数据聚类信息的压缩方案。 根据本发明的示例性实施例,数据聚类的功能表示是高斯,并且通过在范围上对与每个数据集合相对应的高斯进行积分来执行查询,以确定来自数据库的数据记录的和或计数 在所选范围内。 该过程选择簇号K的值。根据用户定义的参数,将聚类模型分解为区域(瓦片)。 然后根据平铺信息对来自数据库的数据进行分类。 生成分类数据的排序版本,按簇号排序,然后由群集中的瓦片号排序。 然后评估该数据以测试在聚类期间创建的模型的充分性。
    • 2. 发明授权
    • Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions
    • 多维数据库和数据立方体压缩,用于在数字维度上的聚合查询支持
    • US06549907B1
    • 2003-04-15
    • US09296831
    • 1999-04-22
    • Usama FayyadJayavel Shanmugasundaram
    • Usama FayyadJayavel Shanmugasundaram
    • G06F1730
    • G06F17/30489G06F17/30324G06F17/30592G06F17/30598Y10S707/99933Y10S707/99942Y10S707/99953
    • An apparatus and method for efficiently compressing contents of a database system to support ad hoc querying and OLAP type aggregation queries. This invention consists of a new compressed representation of the data cube that (a) drastically reduces storage requirements, (b) does not require the discretization hierarchy along each query dimension to be fixed beforehand and (c) treats each dimension as a potential target measure and supports multiple aggregation functions without additional storage costs. The tradeoff is approximate, yet relatively accurate, answers to queries. We outline mechanisms to reduce the error in the approximation. Our performance evaluation indicates that our compression technique effectively addresses the limitation of existing approaches. The basic method relies on representing the contents of the database by a probability distribution consisting of a mixture of Gaussians. Aggregation queries, be they multi-dimensional, conjunctive, or disjunctive, can be answered by performing integration over the probability distribution.
    • 一种用于高效地压缩数据库系统的内容以支持即席查询和OLAP类型聚合查询的装置和方法。 本发明由数据立方体的新的压缩表示形式组成,(a)大大减少存储需求,(b)不需要预先固定每个查询维度的离散化层级,并且(c)将每个维度视为潜在的目标度量 并支持多个聚合功能,无需额外的存储成本。 权衡是大概的,但相对准确的是查询的答案。 我们概述了减少近似误差的机制。 我们的绩效评估表明,我们的压缩技术有效地解决了现有方法的局限性。 基本方法依赖于通过由高斯混合组成的概率分布来表示数据库的内容。 可以通过对概率分布进行集成来回答聚合查询(无论是多维的,连接的还是分离的)。
    • 3. 发明授权
    • Varying cluster number in a scalable clustering system for use with large databases
    • 可扩展集群系统中的更改集群号,用于大型数据库
    • US06449612B1
    • 2002-09-10
    • US09607365
    • 2000-06-30
    • Paul S. BradleyUsama Fayyad
    • Paul S. BradleyUsama Fayyad
    • G06F704
    • G06K9/6221G06F17/3061G06F2216/03G06K9/6262Y10S707/99936Y10S707/99945Y10S707/99952
    • In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K′ to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.
    • 在一个示例性实施例中,本发明提供了一种用于在数据库或任何其他数据存储介质中查找数据项的集群的数据挖掘系统。 从存储介质读取数据库中的一部分数据,并将其带入快速访问存储器缓冲器,其大小取决于可用的存储器资源由用户或操作系统确定。 包含在数据缓冲器中的数据用于更新聚类模型中每个K个簇中的原始模型数据分布。 属于集群的一些数据被汇总或压缩并存储为表示数据的足够统计数据的数据的简化形式。 从数据库访问更多数据,更新模型。 从汇总的数据(足够的统计数据)和新获取的数据确定集群的更新的一组参数。 评估停止条件以确定是否应从数据库访问进一步的数据。 每次从数据库中读取数据时,将使用一组数据来评估模型,然后评估当前以及从候选集群模型中选择的其他可能的集群模型。 保持数据集的评估允许选择具有不同簇号K'的群集模型,如果该模型基于保持集合的评估更准确地建模数据。
    • 7. 发明授权
    • Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
    • 基于密度的索引方法用于在大型数据库上高效执行高维近邻查询
    • US06263334B1
    • 2001-07-17
    • US09189229
    • 1998-11-11
    • Usama FayyadKristin P. BennettDan Geiger
    • Usama FayyadKristin P. BennettDan Geiger
    • G06F1730
    • G06F17/30333G06F17/30707G06K9/6226G06K9/6273Y10S707/99935Y10S707/99936Y10S707/99943
    • Method and apparatus for efficiently performing nearest neighbor queries on a database of records wherein each record has a large number of attributes by automatically extracting a multidimensional index from the data. The method is based on first obtaining a statistical model of the content of the data in the form of a probability density function. This density is then used to decide how data should be reorganized on disk for efficient nearest neighbor queries. At query time, the model decides the order in which data should be scanned. It also provides the means for evaluating the probability of correctness of the answer found so far in the partial scan of data determined by the model. In this invention a clustering process is performed on the database to produce multiple data clusters. Each cluster is characterized by a cluster model. The set of clusters represent a probability density function in the form of a mixture model. A new database of records is built having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster number for each record based on the clustering step. The cluster model uses a probability density function for each cluster so that the process of augmenting the attributes of each record is accomplished by evaluating each record's probability with respect to each cluster. Once the augmented records are used to build a database the augmented attribute is used as an index into the database so that nearest neighbor query analysis can be very efficiently conducted using an indexed look up process. As the database is queried, the probability density function is used to determine the order clusters or database pages are scanned. The probability density function is also used to determine when scanning can stop because the nearest neighbor has been found with high probability.
    • 用于在记录数据库上有效执行最近邻查询的方法和装置,其中通过从数据中自动提取多维索引,每个记录具有大量的属性。 该方法基于首先以概率密度函数的形式获得数据内容的统计模型。 然后使用该密度来确定如何在磁盘上重新组织数据以实现高效的最近邻查询。 在查询时,模型决定扫描数据的顺序。 它还提供了用于评估由模型确定的数据的部分扫描中迄今发现的答案的正确性的概率的方法。 在本发明中,对数据库执行聚类处理以产生多个数据簇。 每个集群的特点是集群模型。 集合集合以混合模型的形式表示概率密度函数。 构建新的记录数据库,其具有包含原始记录属性的增强记录格式以及包含基于聚类步骤的每个记录的集群编号的附加记录属性。 集群模型使用每个集群的概率密度函数,以便通过评估每个记录相对于每个集群的概率来实现增加每个记录的属性的过程。 一旦扩充记录用于构建数据库,扩充属性就被用作数据库的索引,从而可以使用索引查找过程非常有效地进行最近邻查询分析。 当查询数据库时,概率密度函数用于确定顺序簇或数据库页被扫描。 概率密度函数也用于确定何时扫描可以停止,因为已经发现最近的邻居具有很高的概率。
    • 8. 发明授权
    • Identifying and reporting on frequent sequences of events in usage data
    • 识别和报告使用数据中的频繁事件序列
    • US07051029B1
    • 2006-05-23
    • US09755971
    • 2001-01-05
    • Usama FayyadNeal RothlederCheng Yang
    • Usama FayyadNeal RothlederCheng Yang
    • G06F17/30
    • G06F17/30876Y10S707/99943
    • A method, system, and computer-readable medium for identifying sequences of interaction events of interest that frequently occur is described. In particular, techniques are described for receiving multiple groups each having related interaction events in serial or sequential order, and for determining sequences of interaction events that frequently occur in the multiple groups. Reports can also be generated and provided that include information about the determined frequent sequences. The techniques can at times be used to provide a service to customers in which logs containing data about interaction events related to that customer (e.g., usage events for a provided service or of a provided Website) are received or obtained, in which frequent sequences in the log data are identified, and in which reports are provided to representatives of the customer about the frequent sequences (e.g., remotely over the Web based on interactive specifications).
    • 描述了用于识别经常发生的感兴趣的交互事件序列的方法,系统和计算机可读介质。 具体地,描述了用于接收多个组的技术,每个组具有串行或顺序的相关交互事件,并且用于确定多个组中经常发生的交互事件的序列。 还可以生成报告,并提供包括关于确定的频繁序列的信息。 这些技术有时可用于向客户提供服务,其中包含与该客户相关的交互事件的数据的日志(例如,所提供的服务或所提供的网站的使用事件)被接收或获得,其中频繁的序列 记录日志数据,并且向客户的代表提供有关频繁序列的报告(例如,基于交互式规范的Web远程访问)。
    • 9. 发明授权
    • Method of reducing dimensionality of a set of attributes used to characterize a sparse data set
    • 减少用于表征稀疏数据集的一组属性的维度的方法
    • US06735589B2
    • 2004-05-11
    • US09876321
    • 2001-06-07
    • Paul S. BradleyDemetrios AchlioptasChristos FaloutsosUsama Fayyad
    • Paul S. BradleyDemetrios AchlioptasChristos FaloutsosUsama Fayyad
    • G06F1730
    • G06K9/6228Y10S707/99936Y10S707/99942Y10S707/99943
    • A dimensionality reduction method of generating a reduced dimension matrix data set Dnew of dimension m×k from an original matrix data set D of dimension m×k wherein n>k. The method selects a subset of k columns from a set of n columns in the original data set D where the m rows correspond to observations Ri where i=1, . . . , m and the n columns correspond to attributes Aj where j=1, . . . , n and dij is the data value associated with observation Ri and attribute Aj. The data values in the reduced data set Dnew for each of the selected k attributes is identical to the data values of the corresponding attributes in the original data set. The steps of the method include: for each of the attributes Aj in the original data set D, calculating a value of variance of the data values associated with attribute Aj, where the variance value, Var(Aj), of the attribute Aj is calculated as follows: Var ⁡ ( Aj ) = [ 1 / m ] * ∑ i = 1 m ⁢   ⁢ ( dij - Mean ⁡ ( Aj ) ) 2 , where Mean(Aj) is the mean value of the data values corresponding to attribute Aj; selecting the k attributes having the greatest variance values; and generating the reduced data set Dnew by selecting data values in the original data set D corresponding to the selected k attributes.
    • 从维数mxk的原始矩阵数据集合D生成维度矩阵数据集D维的维数降低方法,其中n> k。 该方法从原始数据集合D中的一组n列中选择k列的子集,其中m行对应于其中i = 1的观察值Ri。 。 。 ,m和n列对应于属性Aj,其中j = 1,。 。 。 ,n和dij是与观察Ri和属性Aj相关联的数据值。 所选择的k个属性中的每一个的缩减数据集D new中的数据值与原始数据集中相应属性的数据值相同。 该方法的步骤包括:对于原始数据集D中的每个属性Aj,计算与属性Aj相关联的数据值的方差值,其中计算属性Aj的方差值Var(Aj) 如下:其中Mean(Aj)是与属性Aj对应的数据值的平均值; 选择具有最大方差值的k个属性; 并且通过选择与所选择的k个属性相对应的原始数据集D中的数据值来生成缩减数据集D new。
    • 10. 发明授权
    • Scalable system for K-means clustering of large databases
    • 大型数据库的K均值聚类的可扩展系统
    • US6012058A
    • 2000-01-04
    • US42540
    • 1998-03-17
    • Usama FayyadPaul S. BradleyCory Reina
    • Usama FayyadPaul S. BradleyCory Reina
    • G06F17/30G06F17/00
    • G06F17/30705G06K9/6223Y10S707/99932Y10S707/99933Y10S707/99934Y10S707/99935Y10S707/99936
    • In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.
    • 在一个示例性实施例中,本发明提供了一种用于评估数据库中的数据的数据挖掘系统。 在数据挖掘开始之前,选择用于将数据库中的数据分类为K个不同集群的集群号K,并且提供每个集群的装置或中心点的初始猜测。 然后从存储介质中读取数据库中的一部分数据,并将其引入快速存取存储器。 包含在数据部分中的数据用于更新每个K个簇的质心的原始猜测。 属于集群的一些数据被汇总或压缩并作为数据的汇总存储。 从数据库访问更多数据并将其分配给集群。 根据汇总数据和新获取的数据确定簇的更新均值。 评估停止条件以确定是否应该从数据库访问进一步的数据。 如果需要进一步的数据来表征集群,则从数据库收集更多的数据,并与已压缩的数据组合使用,直到达到停止条件为止。