专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

1. 发明授权

US06633882B1 Multi-dimensional database record compression utilizing optimized cluster models 有权
标题翻译：利用优化的集群模型进行多维数据库记录压缩
公开(公告)号：US06633882B1
公开(公告)日：2003-10-14
申请号：US09606964
申请日：2000-06-29
申请人： Usama Fayyad , Jayavel Shanmugasundaram
发明人： Usama Fayyad , Jayavel Shanmugasundaram
IPC分类号： G06F1730
CPC分类号： G06F17/30592 , Y10S707/99936 , Y10S707/99942
摘要： Apparatus and method for use in querying a database containing data records. The database is characterized by a compression scheme to provide data clustering information. In accordance with a exemplary embodiment of the invention a functional representation of data clustering is a Gaussian and the queries are performing by integrating the Gaussian corresponding to each of the data clusters over the ranges to determine the sum or the count of data records from the database that fall within the selected ranges. The process chooses a value for the cluster number K. The cluster model is next broken up into areas (tiles) based on user defined parameters. Data from the database is then classified based on the tiling information. A sorted version of the classified data, ordered by cluster number and then by the tile number within the cluster is generated. This data is then evaluated to test the sufficiency of the model created during the clustering.
摘要翻译：用于查询包含数据记录的数据库的装置和方法。数据库的特征在于提供数据聚类信息的压缩方案。根据本发明的示例性实施例，数据聚类的功能表示是高斯，并且通过在范围上对与每个数据集合相对应的高斯进行积分来执行查询，以确定来自数据库的数据记录的和或计数在所选范围内。该过程选择簇号K的值。根据用户定义的参数，将聚类模型分解为区域（瓦片）。然后根据平铺信息对来自数据库的数据进行分类。生成分类数据的排序版本，按簇号排序，然后由群集中的瓦片号排序。然后评估该数据以测试在聚类期间创建的模型的充分性。

2. 发明授权

US06549907B1 Multi-dimensional database and data cube compression for aggregate query support on numeric dimensions 有权
标题翻译：多维数据库和数据立方体压缩，用于在数字维度上的聚合查询支持
公开(公告)号：US06549907B1
公开(公告)日：2003-04-15
申请号：US09296831
申请日：1999-04-22
申请人： Usama Fayyad , Jayavel Shanmugasundaram
发明人： Usama Fayyad , Jayavel Shanmugasundaram
IPC分类号： G06F1730
CPC分类号： G06F17/30489 , G06F17/30324 , G06F17/30592 , G06F17/30598 , Y10S707/99933 , Y10S707/99942 , Y10S707/99953
摘要： An apparatus and method for efficiently compressing contents of a database system to support ad hoc querying and OLAP type aggregation queries. This invention consists of a new compressed representation of the data cube that (a) drastically reduces storage requirements, (b) does not require the discretization hierarchy along each query dimension to be fixed beforehand and (c) treats each dimension as a potential target measure and supports multiple aggregation functions without additional storage costs. The tradeoff is approximate, yet relatively accurate, answers to queries. We outline mechanisms to reduce the error in the approximation. Our performance evaluation indicates that our compression technique effectively addresses the limitation of existing approaches. The basic method relies on representing the contents of the database by a probability distribution consisting of a mixture of Gaussians. Aggregation queries, be they multi-dimensional, conjunctive, or disjunctive, can be answered by performing integration over the probability distribution.
摘要翻译：一种用于高效地压缩数据库系统的内容以支持即席查询和OLAP类型聚合查询的装置和方法。本发明由数据立方体的新的压缩表示形式组成，（a）大大减少存储需求，（b）不需要预先固定每个查询维度的离散化层级，并且（c）将每个维度视为潜在的目标度量并支持多个聚合功能，无需额外的存储成本。权衡是大概的，但相对准确的是查询的答案。我们概述了减少近似误差的机制。我们的绩效评估表明，我们的压缩技术有效地解决了现有方法的局限性。基本方法依赖于通过由高斯混合组成的概率分布来表示数据库的内容。可以通过对概率分布进行集成来回答聚合查询（无论是多维的，连接的还是分离的）。

3. 发明授权

US06449612B1 Varying cluster number in a scalable clustering system for use with large databases 有权
标题翻译：可扩展集群系统中的更改集群号，用于大型数据库
公开(公告)号：US06449612B1
公开(公告)日：2002-09-10
申请号：US09607365
申请日：2000-06-30
申请人： Paul S. Bradley , Usama Fayyad
发明人： Paul S. Bradley , Usama Fayyad
IPC分类号： G06F704
CPC分类号： G06K9/6221 , G06F17/3061 , G06F2216/03 , G06K9/6262 , Y10S707/99936 , Y10S707/99945 , Y10S707/99952
摘要： In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K′ to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.
摘要翻译：在一个示例性实施例中，本发明提供了一种用于在数据库或任何其他数据存储介质中查找数据项的集群的数据挖掘系统。从存储介质读取数据库中的一部分数据，并将其带入快速访问存储器缓冲器，其大小取决于可用的存储器资源由用户或操作系统确定。包含在数据缓冲器中的数据用于更新聚类模型中每个K个簇中的原始模型数据分布。属于集群的一些数据被汇总或压缩并存储为表示数据的足够统计数据的数据的简化形式。从数据库访问更多数据，更新模型。从汇总的数据（足够的统计数据）和新获取的数据确定集群的更新的一组参数。评估停止条件以确定是否应从数据库访问进一步的数据。每次从数据库中读取数据时，将使用一组数据来评估模型，然后评估当前以及从候选集群模型中选择的其他可能的集群模型。保持数据集的评估允许选择具有不同簇号K'的群集模型，如果该模型基于保持集合的评估更准确地建模数据。

4. 发明授权

US07194477B1 Optimized a priori techniques 有权
标题翻译：优化了先验技术
公开(公告)号：US07194477B1
公开(公告)日：2007-03-20
申请号：US10187392
申请日：2002-06-28
申请人： Paul Bradley , Stella Chan , Usama Fayyad , Neal Rothleder , Radha Krishna Uppala
发明人： Paul Bradley , Stella Chan , Usama Fayyad , Neal Rothleder , Radha Krishna Uppala
IPC分类号： G06F7/00 , G06Q30/00
CPC分类号： G06Q30/02 , Y10S707/99942 , Y10S707/99943 , Y10S707/99945
摘要： A facility for identifying groups of items that co-occur in more than a threshold number of instances is described. Each such group of items has a size reflecting the number of items in the group. The facility uses a data structure comprising, for each of a plurality of group sizes, a single map identifying groups of that group size that co-occur in more than a threshold number of instances.
摘要翻译：描述用于识别在多于阈值数量的实例中共同出现的项目组的设施。每个这样的项目组具有反映组中项目数量的大小。该设施使用数据结构，其包括针对多个组大小中的每一个的单个映射，其识别在超过阈值数量的实例中共同发生的组大小的组。

5. 发明授权

US06871196B1 Visualizing automatically generated segments 有权
标题翻译：可视化自动生成的段
公开(公告)号：US06871196B1
公开(公告)日：2005-03-22
申请号：US09751366
申请日：2000-12-29
申请人： Stella Chan , Usama Fayyad , Neal Rothleder
发明人： Stella Chan , Usama Fayyad , Neal Rothleder
IPC分类号： G06N5/00 , G06Q10/00
CPC分类号： G06Q10/10 , G06Q10/06
摘要： A software facility for analyzing each of a plurality of groups of items is described. The facility retrieves information identifying, for each of a plurality of groups, items that are members of the group. For each group, the facility analyzes attributes of the items of the group to identify attributes that distinguish items that are members of the group from items that are not members of the group.
摘要翻译：描述用于分析多组项目中的每一个的软件设施。该设施检索为多个组中的每一组识别作为该组的成员的项目的信息。对于每个组，设施分析组的项目的属性，以识别将组成员的项目与不是组成员的项目区分开的属性。

6. 发明申请

US20050021499A1 Cluster-and descriptor-based recommendations 审中-公开
标题翻译：基于群集和基于描述符的建议
公开(公告)号：US20050021499A1
公开(公告)日：2005-01-27
申请号：US10926691
申请日：2004-08-26
申请人： Paul Bradley , Usama Fayyad , Bassel Ojjeh
发明人： Paul Bradley , Usama Fayyad , Bassel Ojjeh
IPC分类号： G06F17/30 , G06F7/00
CPC分类号： G06F16/35 , G06F16/9535
摘要： Cluster- and descriptor-based recommender systems are disclosed which can, for example, scale to voluminous data. The data is generally organized into records and items. In one embodiment, a method first consolidates the data into groups, such as clusters or descriptors. The method determines a predicted vote for a particular record and a particular item, using a similarity scoring approach, such as a likelihood similarity approach, or a correlation similarity approach, based on the groups. The predicted vote can then be output.
摘要翻译：公开了基于簇和描述符的推荐器系统，其可以例如扩展到大量数据。数据通常组织成记录和项目。在一个实施例中，一种方法首先将数据合并成诸如群集或描述符的组。该方法基于这些组，使用诸如似然相似性方法或相关相似性方法之类的相似性评分方法确定特定记录和特定项目的预测投票。然后可以输出预测的投票。

7. 发明授权

US06263334B1 Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases 失效
标题翻译：基于密度的索引方法用于在大型数据库上高效执行高维近邻查询
公开(公告)号：US06263334B1
公开(公告)日：2001-07-17
申请号：US09189229
申请日：1998-11-11
申请人： Usama Fayyad , Kristin P. Bennett , Dan Geiger
发明人： Usama Fayyad , Kristin P. Bennett , Dan Geiger
IPC分类号： G06F1730
CPC分类号： G06F17/30333 , G06F17/30707 , G06K9/6226 , G06K9/6273 , Y10S707/99935 , Y10S707/99936 , Y10S707/99943
摘要： Method and apparatus for efficiently performing nearest neighbor queries on a database of records wherein each record has a large number of attributes by automatically extracting a multidimensional index from the data. The method is based on first obtaining a statistical model of the content of the data in the form of a probability density function. This density is then used to decide how data should be reorganized on disk for efficient nearest neighbor queries. At query time, the model decides the order in which data should be scanned. It also provides the means for evaluating the probability of correctness of the answer found so far in the partial scan of data determined by the model. In this invention a clustering process is performed on the database to produce multiple data clusters. Each cluster is characterized by a cluster model. The set of clusters represent a probability density function in the form of a mixture model. A new database of records is built having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster number for each record based on the clustering step. The cluster model uses a probability density function for each cluster so that the process of augmenting the attributes of each record is accomplished by evaluating each record's probability with respect to each cluster. Once the augmented records are used to build a database the augmented attribute is used as an index into the database so that nearest neighbor query analysis can be very efficiently conducted using an indexed look up process. As the database is queried, the probability density function is used to determine the order clusters or database pages are scanned. The probability density function is also used to determine when scanning can stop because the nearest neighbor has been found with high probability.
摘要翻译：用于在记录数据库上有效执行最近邻查询的方法和装置，其中通过从数据中自动提取多维索引，每个记录具有大量的属性。该方法基于首先以概率密度函数的形式获得数据内容的统计模型。然后使用该密度来确定如何在磁盘上重新组织数据以实现高效的最近邻查询。在查询时，模型决定扫描数据的顺序。它还提供了用于评估由模型确定的数据的部分扫描中迄今发现的答案的正确性的概率的方法。在本发明中，对数据库执行聚类处理以产生多个数据簇。每个集群的特点是集群模型。集合集合以混合模型的形式表示概率密度函数。构建新的记录数据库，其具有包含原始记录属性的增强记录格式以及包含基于聚类步骤的每个记录的集群编号的附加记录属性。集群模型使用每个集群的概率密度函数，以便通过评估每个记录相对于每个集群的概率来实现增加每个记录的属性的过程。一旦扩充记录用于构建数据库，扩充属性就被用作数据库的索引，从而可以使用索引查找过程非常有效地进行最近邻查询分析。当查询数据库时，概率密度函数用于确定顺序簇或数据库页被扫描。概率密度函数也用于确定何时扫描可以停止，因为已经发现最近的邻居具有很高的概率。

8. 发明授权

US07051029B1 Identifying and reporting on frequent sequences of events in usage data 有权
标题翻译：识别和报告使用数据中的频繁事件序列
公开(公告)号：US07051029B1
公开(公告)日：2006-05-23
申请号：US09755971
申请日：2001-01-05
申请人： Usama Fayyad , Neal Rothleder , Cheng Yang
发明人： Usama Fayyad , Neal Rothleder , Cheng Yang
IPC分类号： G06F17/30
CPC分类号： G06F17/30876 , Y10S707/99943
摘要： A method, system, and computer-readable medium for identifying sequences of interaction events of interest that frequently occur is described. In particular, techniques are described for receiving multiple groups each having related interaction events in serial or sequential order, and for determining sequences of interaction events that frequently occur in the multiple groups. Reports can also be generated and provided that include information about the determined frequent sequences. The techniques can at times be used to provide a service to customers in which logs containing data about interaction events related to that customer (e.g., usage events for a provided service or of a provided Website) are received or obtained, in which frequent sequences in the log data are identified, and in which reports are provided to representatives of the customer about the frequent sequences (e.g., remotely over the Web based on interactive specifications).
摘要翻译：描述了用于识别经常发生的感兴趣的交互事件序列的方法，系统和计算机可读介质。具体地，描述了用于接收多个组的技术，每个组具有串行或顺序的相关交互事件，并且用于确定多个组中经常发生的交互事件的序列。还可以生成报告，并提供包括关于确定的频繁序列的信息。这些技术有时可用于向客户提供服务，其中包含与该客户相关的交互事件的数据的日志（例如，所提供的服务或所提供的网站的使用事件）被接收或获得，其中频繁的序列记录日志数据，并且向客户的代表提供有关频繁序列的报告（例如，基于交互式规范的Web远程访问）。

9. 发明授权

US06735589B2 Method of reducing dimensionality of a set of attributes used to characterize a sparse data set 有权
标题翻译：减少用于表征稀疏数据集的一组属性的维度的方法
公开(公告)号：US06735589B2
公开(公告)日：2004-05-11
申请号：US09876321
申请日：2001-06-07
申请人： Paul S. Bradley , Demetrios Achlioptas , Christos Faloutsos , Usama Fayyad
发明人： Paul S. Bradley , Demetrios Achlioptas , Christos Faloutsos , Usama Fayyad
IPC分类号： G06F1730
CPC分类号： G06K9/6228 , Y10S707/99936 , Y10S707/99942 , Y10S707/99943
摘要： A dimensionality reduction method of generating a reduced dimension matrix data set Dnew of dimension m×k from an original matrix data set D of dimension m×k wherein n>k. The method selects a subset of k columns from a set of n columns in the original data set D where the m rows correspond to observations Ri where i=1, . . . , m and the n columns correspond to attributes Aj where j=1, . . . , n and dij is the data value associated with observation Ri and attribute Aj. The data values in the reduced data set Dnew for each of the selected k attributes is identical to the data values of the corresponding attributes in the original data set. The steps of the method include: for each of the attributes Aj in the original data set D, calculating a value of variance of the data values associated with attribute Aj, where the variance value, Var(Aj), of the attribute Aj is calculated as follows: Var ⁡ ( Aj ) = [ 1 / m ] * ∑ i = 1 m ⁢ ⁢ ( dij - Mean ⁡ ( Aj ) ) 2 , where Mean(Aj) is the mean value of the data values corresponding to attribute Aj; selecting the k attributes having the greatest variance values; and generating the reduced data set Dnew by selecting data values in the original data set D corresponding to the selected k attributes.
摘要翻译：从维数mxk的原始矩阵数据集合D生成维度矩阵数据集D维的维数降低方法，其中n> k。该方法从原始数据集合D中的一组n列中选择k列的子集，其中m行对应于其中i = 1的观察值Ri。。。，m和n列对应于属性Aj，其中j = 1，。。。，n和dij是与观察Ri和属性Aj相关联的数据值。所选择的k个属性中的每一个的缩减数据集D new中的数据值与原始数据集中相应属性的数据值相同。该方法的步骤包括：对于原始数据集D中的每个属性Aj，计算与属性Aj相关联的数据值的方差值，其中计算属性Aj的方差值Var（Aj）如下：其中Mean（Aj）是与属性Aj对应的数据值的平均值; 选择具有最大方差值的k个属性; 并且通过选择与所选择的k个属性相对应的原始数据集D中的数据值来生成缩减数据集D new。

10. 发明授权

US6012058A Scalable system for K-means clustering of large databases 失效
标题翻译：大型数据库的K均值聚类的可扩展系统
公开(公告)号：US6012058A
公开(公告)日：2000-01-04
申请号：US42540
申请日：1998-03-17
申请人： Usama Fayyad , Paul S. Bradley , Cory Reina
发明人： Usama Fayyad , Paul S. Bradley , Cory Reina
IPC分类号： G06F17/30 , G06F17/00
CPC分类号： G06F17/30705 , G06K9/6223 , Y10S707/99932 , Y10S707/99933 , Y10S707/99934 , Y10S707/99935 , Y10S707/99936
摘要： In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.
摘要翻译：在一个示例性实施例中，本发明提供了一种用于评估数据库中的数据的数据挖掘系统。在数据挖掘开始之前，选择用于将数据库中的数据分类为K个不同集群的集群号K，并且提供每个集群的装置或中心点的初始猜测。然后从存储介质中读取数据库中的一部分数据，并将其引入快速存取存储器。包含在数据部分中的数据用于更新每个K个簇的质心的原始猜测。属于集群的一些数据被汇总或压缩并作为数据的汇总存储。从数据库访问更多数据并将其分配给集群。根据汇总数据和新获取的数据确定簇的更新均值。评估停止条件以确定是否应该从数据库访问进一步的数据。如果需要进一步的数据来表征集群，则从数据库收集更多的数据，并与已压缩的数据组合使用，直到达到停止条件为止。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式