会员体验
专利管家(专利管理)
工作空间(专利管理)
风险监控(情报监控)
数据分析(专利分析)
侵权分析(诉讼无效)
联系我们
交流群
官方交流:
QQ群: 891211   
微信请扫码    >>>
现在联系顾问~
热词
    • 31. 发明授权
    • Microprocessor including an efficient implementation of extreme value
instructions
    • US06029244A
    • 2000-02-22
    • US948679
    • 1997-10-10
    • Stuart ObermanNorbert Juffa
    • Stuart ObermanNorbert Juffa
    • G06F9/30G06F9/302G06F9/318G06F9/38G06F17/16G06F9/305
    • G06F17/16G06F9/30014G06F9/30021G06F9/30036G06F9/30167G06F9/3017G06F9/3804G06F9/3885
    • An execution unit is provided for executing a first instruction which includes an opcode field, a first operand field, and a second operand field. The execution unit includes a first input register for receiving a first operand specified by a value of the first operand field, and a second input register for receiving a second operand specified by a value of the second operand field. The execution unit further includes a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator unit is also coupled to receive the first and second operand values from the first and second input registers, respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs include a first constant value, a second constant value, and the values of the first and second operand. If the decoded opcode value received by the comparator indicates that the first instruction is either a compare or extreme value function, the comparator conveys one or more control signals to the multiplexer for the purpose of selecting an output of the multiplexer as the result of the first instruction. If the first instruction is one of a plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select between the first operand and second operand to determine the result of the first instruction. If the first instruction is one of a plurality of compare instructions, the one or more control signals conveyed by the comparator unit select between the first and second constant value to determine the result of the first instruction. In another embodiment, a similar execution unit is provided which handles vector operands.
    • 32. 发明授权
    • Efficient matrix multiplication on a parallel processing device
    • 在并行处理设备上有效的矩阵乘法
    • US08589468B2
    • 2013-11-19
    • US12875961
    • 2010-09-03
    • Norbert JuffaRadoslav Danilak
    • Norbert JuffaRadoslav Danilak
    • G06F7/52
    • G06F17/16
    • The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
    • 本发明使得能够对并行处理装置进行有效的矩阵乘法运算。 一个实施例是用于将CTA映射到用于矩阵乘法运算的矩阵瓦片的方法。 另一个实施例是用于将CTA映射到结果瓦片的第二种方法。 其他实施例是用于将CTA的各个线程映射到块的元素以用于结果瓦片计算,源瓦片复制操作以及源瓦片复制和转置操作的方法。 本发明有利地使结果矩阵元素可以使用在不同的流式多处理器上同时执行的多个CTA来逐个瓦片地计算,使得能够将源瓦片复制到本地存储器,以减少当计算一个 结果图块,并且启用来自全局存储器的合并的读取操作以及对本地存储器的写入操作,而没有存储体冲突。
    • 33. 发明授权
    • Maximized memory throughput on parallel processing devices
    • 最大化并行处理设备的内存吞吐量
    • US08327123B2
    • 2012-12-04
    • US13069384
    • 2011-03-23
    • Norbert JuffaBrett W. Coon
    • Norbert JuffaBrett W. Coon
    • G06F9/30
    • G06F9/3887G06F9/3455G06F9/3851G06F9/3889
    • In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    • 在用于流计算的并行处理装置中,流的每个数据元素的处理可能不是计算密集的,因此与读取流并写入结果所需的存储器访问时间相比,处理可能需要相对较少的时间来计算。 因此,内存吞吐量通常会限制流计算的性能。 一般来说,提供了用于在这种存储器吞吐量限制的流计算中实现改进的,优化的或最终最大化的存储器吞吐量的方法。 通过提高跨多个处理元件和线程的聚合内存吞吐量,最大化流计算性能。 通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。
    • 34. 发明申请
    • MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES
    • 最大化的并行处理器件的存储器
    • US20110173414A1
    • 2011-07-14
    • US13069384
    • 2011-03-23
    • Norbert JuffaBrett W. Coon
    • Norbert JuffaBrett W. Coon
    • G06F9/38
    • G06F9/3887G06F9/3455G06F9/3851G06F9/3889
    • In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices.
    • 在用于流计算的并行处理装置中,流的每个数据元素的处理可能不是计算密集的,因此与读取流并写入结果所需的存储器访问时间相比,处理可能需要相对较少的时间来计算。 因此,内存吞吐量通常会限制流计算的性能。 一般来说,提供了用于在这种存储器吞吐量限制的流计算中实现改进的,优化的或最终最大化的存储器吞吐量的方法。 通过提高跨多个处理元件和线程的聚合内存吞吐量,最大化流计算性能。 通过平衡线程和线程组之间的处理负载以及耦合到并行处理设备的硬件存储器接口来实现高聚合内存吞吐量。
    • 35. 发明授权
    • Apparatus and method for superforwarding load operands in a microprocessor
    • 用于在微处理器中超载负载操作数的装置和方法
    • US06442677B1
    • 2002-08-27
    • US09329497
    • 1999-06-10
    • Derrick R. MeyerStephan G. MeierNorbert Juffa
    • Derrick R. MeyerStephan G. MeierNorbert Juffa
    • G06F9312
    • G06F9/30043G06F9/3826
    • An apparatus and method for superforwarding load operands in a microprocessor are provided. An execution unit in a microprocessor is configured to receive a load instruction and a subsequent instruction. If the load instruction corresponds to a simple load instruction, a destination operand of the load instruction can be superforwarded to a subsequent instruction if the subsequent instruction specifies a source operand that depends on the destination operand of the load instruction. The subsequent instruction is not required to wait until a load instruction executes or completes and can be scheduled and/or executed prior to or at the same time as the load instruction. Consequently, latencies associated with operand dependencies may be reduced.
    • 提供了一种用于在微处理器中超载负载操作数的装置和方法。 微处理器中的执行单元被配置为接收加载指令和后续指令。 如果加载指令对应于简单的加载指令,则如果后续指令指定依赖于加载指令的目的地操作数的源操作数,则加载指令的目标操作数可以被超前给后续指令。 后续指令不需要等待加载指令执行或完成,并且可以在加载指令之前或同时进行调度和/或执行。 因此,可以减少与操作数相关性相关联的延迟。
    • 36. 发明授权
    • Apparatus and method for executing floating-point store instructions in a microprocessor
    • 在微处理器中执行浮点存储指令的装置和方法
    • US06408379B1
    • 2002-06-18
    • US09329718
    • 1999-06-10
    • Norbert JuffaStephan MeierStuart ObermanScott White
    • Norbert JuffaStephan MeierStuart ObermanScott White
    • G06F738
    • G06F9/3861G06F7/483G06F7/49905G06F7/4991G06F9/30014
    • An apparatus and method for executing floating-point store instructions in a microprocessor is provided. If store data of a floating-point store instruction corresponds to a tiny number and an underflow exception is masked, then a trap routine can be executed to generate corrected store data and complete the store operation. In response to detecting that store data corresponds to a tiny number and the underflow exception is masked, the store data, store address information, and opcode information can be stored prior to initiating the trap routine. The trap routine can be configured to access the store data, store address information, and opcode information. The trap routine can be configured to generate corrected store data and complete the store operation using the store data, store address information, and opcode information.
    • 提供了一种用于在微处理器中执行浮点存储指令的装置和方法。 如果浮点存储指令的存储数据对应于微数,并且下溢异常被屏蔽,则可以执行陷阱例程以生成校正的存储数据并完成存储操作。 响应于检测到存储数据对应于微小数字并且下溢异常被屏蔽,可以在启动陷阱例程之前存储存储数据,存储地址信息和操作码信息。 陷阱程序可以配置为访问存储数据,存储地址信息和操作码信息。 陷阱程序可以配置为生成更正的存储数据,并使用存储数据,存储地址信息和操作码信息完成存储操作。