专利快速检索-快速检索全球专利，免费商用专利数据库-IPRDB

21. 发明授权

US07725696B1 Method and apparatus for modulo scheduled loop execution in a processor architecture 有权
标题翻译：在处理器架构中用于模数调度循环执行的方法和装置
公开(公告)号：US07725696B1
公开(公告)日：2010-05-25
申请号：US11867127
申请日：2007-10-04
申请人： Wen-mei W. Hwu , Matthew C. Merten
发明人： Wen-mei W. Hwu , Matthew C. Merten
IPC分类号： G06F9/00
CPC分类号： G06F8/4452 , G06F9/325 , G06F9/381 , G06F9/3836 , G06F9/3857 , G06F9/3861
摘要： A processor method and apparatus that allows for the overlapped execution of multiple iterations of a loop while allowing the compiler to include only a single copy of the loop body in the code while automatically managing which iterations are active. Since the prologue and epilogue are implicitly created and maintained within the hardware in the invention, a significant reduction in code size can be achieved compared to software-only modulo scheduling. Furthermore, loops with iteration counts less than the number of concurrent iterations present in the kernel are also automatically handled. This hardware enhanced scheme achieves the same performance as the fully-specified standard method. Furthermore, the hardware reduces the power requirement as the entire fetch unit can be deactivated for a portion of the loop's execution. The basic design of the invention involves including a plurality of buffers for storing loop instructions, each of which is associated with an instruction decoder and its respective functional unit, in the dispatch stage of a processor. Control logic is used to receive loop setup parameters and to control the selective issue of instructions from the buffers to the functional units.
摘要翻译：一种处理器方法和装置，其允许重复执行循环的多次迭代，同时允许编译器在代码中仅包括循环体的单个副本，同时自动管理哪些迭代是活动的。由于在本发明的硬件内隐含地创建和维护序言和结尾语言，与仅软件模数调度相比，可以实现代码大小的显着降低。此外，迭代计数小于内核中存在的并发迭代次数的循环也会自动处理。该硬件增强方案实现与完全指定的标准方法相同的性能。此外，硬件可以减少功率需求，因为整个提取单元可以在循环执行的一部分中停用。本发明的基本设计涉及在处理器的调度阶段包括多个用于存储循环指令的缓冲器，每个循环指令与指令解码器及其各自的功能单元相关联。控制逻辑用于接收循环设置参数并控制从缓冲器到功能单元的指令的选择性发布。

22. 发明授权

US07302557B1 Method and apparatus for modulo scheduled loop execution in a processor architecture 有权
标题翻译：在处理器架构中用于模数调度循环执行的方法和装置
公开(公告)号：US07302557B1
公开(公告)日：2007-11-27
申请号：US09728441
申请日：2000-12-01
申请人： Wen-mei W. Hwu , Matthew C. Merten
发明人： Wen-mei W. Hwu , Matthew C. Merten
IPC分类号： G06F9/30 , G06F9/45
CPC分类号： G06F8/4452 , G06F9/325 , G06F9/381 , G06F9/3836 , G06F9/3857 , G06F9/3861
摘要： A processor method and apparatus that allows for the overlapped execution of multiple iterations of a loop while allowing the compiler to include only a single copy of the loop body in the code while automatically managing which iterations are active. Since the prologue and epilogue are implicitly created and maintained within the hardware in the invention, a significant reduction in code size can be achieved compared to software-only modulo scheduling. Furthermore, loops with iteration counts less than the number of concurrent iterations present in the kernel are also automatically handled. This hardware enhanced scheme achieves the same performance as the fully-specified standard method. Furthermore, the hardware reduces the power requirement as the entire fetch unit can be deactivated for a portion of the loop's execution. The basic design of the invention involves including a plurality of buffers for storing loop instructions, each of which is associated with an instruction decoder and its respective functional unit, in the dispatch stage of a processor. Control logic is used to receive loop setup parameters and to control the selective issue of instructions from the buffers to the functional units.
摘要翻译：一种处理器方法和装置，其允许重复执行循环的多次迭代，同时允许编译器在代码中仅包括循环体的单个副本，同时自动管理哪些迭代是活动的。由于在本发明的硬件内隐含地创建和维护序言和结尾语言，与仅软件模数调度相比，可以实现代码大小的显着降低。此外，迭代计数小于内核中存在的并发迭代次数的循环也会自动处理。该硬件增强方案实现与完全指定的标准方法相同的性能。此外，硬件可以减少功率需求，因为整个提取单元可以在循环执行的一部分中停用。本发明的基本设计涉及在处理器的调度阶段包括多个用于存储循环指令的缓冲器，每个循环指令与指令解码器及其各自的功能单元相关联。控制逻辑用于接收循环设置参数并控制从缓冲器到功能单元的指令的选择性发布。

23. 发明授权

US11080045B2 Addition instructions with independent carry chains 有权
公开(公告)号：US11080045B2
公开(公告)日：2021-08-03
申请号：US13993483
申请日：2011-12-22
申请人： Vinodh Gopal , James D. Guilford , Gilbert M. Wolrich , Wajdi K. Feghali , Erdinc Ozturk , Martin G. Dixon , Sean P. Mirkes , Matthew C. Merten , Tong Li , Bret T. Toll, I
发明人： Vinodh Gopal , James D. Guilford , Gilbert M. Wolrich , Wajdi K. Feghali , Erdinc Ozturk , Martin G. Dixon , Sean P. Mirkes , Matthew C. Merten , Tong Li , Bret T. Toll, I
IPC分类号： G06F9/30 , G06F9/38
摘要： A number of addition instructions are provided that have no data dependency between each other. A first addition instruction stores its carry output in a first flag of a flags register without modifying a second flag in the flags register. A second addition instruction stores its carry output in the second flag of the flags register without modifying the first flag in the flags register.

24. 发明授权

US10235180B2 Scheduler implementing dependency matrix having restricted entries 有权
公开(公告)号：US10235180B2
公开(公告)日：2019-03-19
申请号：US13723684
申请日：2012-12-21
申请人： Srikanth T. Srinivasan , Matthew C. Merten , Bambang Sutanto , Rahul R. Kulkarni , Justin M. Deinlein , James D. Hadley
发明人： Srikanth T. Srinivasan , Matthew C. Merten , Bambang Sutanto , Rahul R. Kulkarni , Justin M. Deinlein , James D. Hadley
IPC分类号： G06F9/30 , G06F9/38
摘要： A scheduler implementing a dependency matrix having restricted entries is disclosed. A processing device of the disclosure includes a decode unit to decode an instruction and a scheduler communicably coupled to the decode unit. In one embodiment, the scheduler is configured to receive the decoded instruction, determine that the decoded instruction qualifies for allocation as a restricted reservation station (RS) entry type in a dependency matrix maintained by the scheduler, identify RS entries in the dependency matrix that are free for allocation, allocate one of the identified free RS entries with information of the decoded instruction in the dependency matrix, and update a row of the dependency matrix corresponding to the claimed RS entry with source dependency information of the decoded instruction.

25. 发明授权

US09542191B2 Hardware profiling mechanism to enable page level automatic binary translation 有权
标题翻译：硬件分析机制，实现页面级自动二进制翻译
公开(公告)号：US09542191B2
公开(公告)日：2017-01-10
申请号：US13993792
申请日：2012-03-30
申请人： Paul Caprioli , Matthew C. Merten , Muawya M. Al-Otoom , Omar M. Shaikh , Abhay S. Kanhere , Suresh Srinivas , Koichi Yamada , Vivek Thakkar , Pawel Osciak
发明人： Paul Caprioli , Matthew C. Merten , Muawya M. Al-Otoom , Omar M. Shaikh , Abhay S. Kanhere , Suresh Srinivas , Koichi Yamada , Vivek Thakkar , Pawel Osciak
IPC分类号： G06F9/38 , G06F9/30 , G06F9/45 , G06F9/455
CPC分类号： G06F11/3466 , G06F8/40 , G06F8/52 , G06F9/3017 , G06F9/3842 , G06F9/4552 , G06F11/073 , G06F11/3616 , G06F11/3652
摘要： A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.
摘要翻译：通过性能监控硬件实现的硬件剖析机制可实现页面级自动二进制翻译。运行期间的硬件标识内存中包含潜在优化指令的代码页。硬件请求与代码页相关联的内存中的新页面的分配，其中新页面包含计数器的集合，并且每个计数器对应于代码页中的指令之一。当硬件检测到在代码页内具有分支目标的分支指令时，它增加与代码页中的分支目标相同的新页面中具有相同位置的计数器之一。代码页的执行被重复，并且当分支目标落在代码页内时计数器递增。然后硬件将新页面中的计数器值提供给用于二进制转换的二进制转换器。

26. 发明授权

US09354875B2 Enhanced loop streaming detector to drive logic optimization 有权
标题翻译：增强循环流检测器驱动逻辑优化
公开(公告)号：US09354875B2
公开(公告)日：2016-05-31
申请号：US13728273
申请日：2012-12-27
申请人： Matthew C. Merten , Justin M. Deinlein , Yury N. Ilin , Alexandre J. Farcy , Tong Li , Srikanth T. Srinivasan
发明人： Matthew C. Merten , Justin M. Deinlein , Yury N. Ilin , Alexandre J. Farcy , Tong Li , Srikanth T. Srinivasan
IPC分类号： G06F9/30
CPC分类号： G06F9/30065 , G06F1/3243 , G06F1/3287 , G06F9/3836 , G06F9/3885
摘要： An enhanced loop streaming detection mechanism is provided in a processor to reduce power consumption. The processor includes a decoder to decode instructions in a loop into micro-operations, and a loop streaming detector to detect the presence of the loop in the micro-operations. The processor also includes a loop characteristic tracker unit to identify hardware components downstream from the decoder that are not to be used by the micro-operations in the loop, and to disable the identified hardware components. The processor also includes execution circuitry to execute the micro-operations in the loop with the identified hardware components disabled.
摘要翻译：在处理器中提供增强的循环流检测机制以降低功耗。处理器包括解码器，用于将循环中的指令解码为微操作，以及循环流检测器，用于检测微操作中环路的存在。处理器还包括循环特性跟踪器单元，用于识别解码器下游的不被循环中的微操作使用的硬件组件，以及禁用所识别的硬件组件。该处理器还包括执行电路，以在所识别的硬件组件被禁用的情况下执行循环中的微操作。

27. 发明申请

US20150032998A1 METHOD, APPARATUS, AND SYSTEM FOR TRANSACTIONAL SPECULATION CONTROL INSTRUCTIONS 审中-公开
标题翻译：方法，装置和系统的交互式分析控制指令
公开(公告)号：US20150032998A1
公开(公告)日：2015-01-29
申请号：US13997243
申请日：2012-02-02
申请人： Ravi Rajwar , Martin G. Dixon , Konrad K. Lai , Alexandre J. Farcy , Bret L. Toll , Robert S. Chappell , Matthew C. Merten , Rajesh S. Parthasarathy , Per Hammarlund
发明人： Ravi Rajwar , Martin G. Dixon , Konrad K. Lai , Alexandre J. Farcy , Bret L. Toll , Robert S. Chappell , Matthew C. Merten , Rajesh S. Parthasarathy , Per Hammarlund
IPC分类号： G06F9/30
CPC分类号： G06F9/30181 , G06F9/3004 , G06F9/30087 , G06F9/30145 , G06F9/3834 , G06F9/3851 , G06F9/3859
摘要： An apparatus and method is described herein for providing speculation control instructions. An xAcquire and xRelease instruction are provided to define a critical section. In one embodiment, the xAcquire instruction includes a lock instruction with an elision prefix and the xRelease instruction includes a lock release instruction with an elision prefix. As a result, a processor is able to elide locks and transactionally execute a critical section defined in software by xAcquire and xRelease. But by adding only prefix hints, legacy processor are able to execute the same code by just ignoring the hints and executing the critical section traditionally with locks to guarantee mutual exclusion. Moreover, xBegin and xEnd are similarly provided for in an Instruction Set Architecture (ISA) to define a transactional code region. In addition, other control speculation instructions, such as xAbort to enable explicit abort of a critical or transactional code section and xTest to test a state of speculative execution is also provided in the ISA.
摘要翻译：这里描述了一种用于提供猜测控制指令的装置和方法。提供xAcquire和xRelease指令来定义关键部分。在一个实施例中，xAcquire指令包括具有检验前缀的锁定指令，并且xRelease指令包括具有检验前缀的锁定释放指令。因此，处理器能够通过xAcquire和xRelease来删除锁定和事务性地执行在软件中定义的关键部分。但是通过仅添加前缀提示，传统处理器能够通过忽略提示并执行传统的锁定关键部分来保证互斥，从而执行相同的代码。此外，xBegin和xEnd在指令集架构（ISA）中类似地提供以定义事务代码区域。此外，还在ISA中提供了其他控制推测指令，例如xAbort，以实现关键或事务代码段的显示中止，以及xTest测试推测执行状态。

28. 发明申请

US20150006868A1 MINIMIZING BANDWITH TO COMPRESS OUTPUT STREAM IN INSTRUCTION TRACING SYSTEMS 有权
标题翻译：在指令跟踪系统中最小化压缩输出流
公开(公告)号：US20150006868A1
公开(公告)日：2015-01-01
申请号：US13930501
申请日：2013-06-28
申请人： Ilya Wagner , Matthew C. Merten , Frank Binns , Christine E. Wang , Mayank Bomb , Tong Li , Thilo Schmitt , MD A. Rahman
发明人： Ilya Wagner , Matthew C. Merten , Frank Binns , Christine E. Wang , Mayank Bomb , Tong Li , Thilo Schmitt , MD A. Rahman
IPC分类号： G06F9/38
CPC分类号： G06F11/3466 , G06F11/348 , G06F2201/81
摘要： In accordance with embodiments disclosed herein, there is provided systems and methods for minimizing bandwidth to compress an output stream of an instruction tracing system. For example, the method may include identifying a current instruction in a trace of the IT module as a conditional branch (CB) instruction. The method includes executing one of generating a CB packet including a byte pattern with an indication of outcome of the CB instruction, or adding an indication of the outcome of the CB instruction to the byte pattern of an existing CB packet. The method includes generating a packet when a subsequent instruction in the trace is not the CB instruction. The packet is different from the CB packet. The method also includes adding the packet into a deferred queue when the packet is deferrable. The method further includes outputting the CB packet followed by the deferred packet into a packet log.
摘要翻译：根据本文公开的实施例，提供了用于最小化带宽以压缩指令跟踪系统的输出流的系统和方法。例如，该方法可以包括将IT模块的跟踪中的当前指令识别为条件分支（CB）指令。该方法包括执行以下步骤：生成包括具有CB指令的结果的指示的字节模式的CB分组，或者将CB指令的结果的指示添加到现有CB分组的字节模式。该方法包括当跟踪中的后续指令不是CB指令时产生分组。该分组与CB分组不同。该方法还包括当分组可延迟时将分组添加到延迟队列中。该方法还包括将后续的延迟分组的CB分组输出到分组日志中。

29. 发明申请

US20140337604A1 MINIMIZING BANDWIDTH TO TRACK RETURN TARGETS BY AN INSTRUCTION TRACING SYSTEM 有权
标题翻译：通过指令跟踪系统最小化带宽跟踪返回目标
公开(公告)号：US20140337604A1
公开(公告)日：2014-11-13
申请号：US13890654
申请日：2013-05-09
申请人： Beeman C. Strong , Matthew C. Merten , Tong Li
发明人： Beeman C. Strong , Matthew C. Merten , Tong Li
IPC分类号： G06F9/30
CPC分类号： G06F9/30145 , G06F9/3806 , G06F9/3857 , G06F11/3476 , G06F11/3636
摘要： A processing device implementing minimizing bandwidth to track return targets by an instruction tracing system is disclosed. A processing device of the disclosure an instruction fetch unit comprising a return stack buffer (RSB) to predict a target address of a return (RET) instruction corresponding to a call (CALL) instruction. The processing device further includes a retirement unit comprising an instruction tracing module to initiate instruction tracing for instructions executed by the processing device, determine whether the target address of the RET instruction was mispredicted, determine a value of call depth counter (CDC) maintained by the instruction tracing module, and when the target address of the RET instruction was not mispredicted and when the value of the CDC is greater than zero, generate an indication that the RET instruction branches to a next linear instruction after the corresponding CALL instruction.
摘要翻译：公开了一种通过指令跟踪系统实现最小化带宽以跟踪返回目标的处理设备。本公开的处理装置包括一个指令提取单元，该单元包括用于预测与一个调用（CALL）指令相对应的返回（RET）指令的目标地址的返回栈缓冲器（RSB）。所述处理装置还包括退出单元，所述退出单元包括指令跟踪模块，用于启动由所述处理设备执行的指令的指令跟踪，确定所述RET指令的目标地址是否被错误预测，确定由所述处理设备维护的所述呼叫深度计数器指令跟踪模块，并且当RET指令的目标地址未被错误预测时，并且当CDC的值大于零时，生成指令在相应的CALL指令之后分支到下一个线性指令。

30. 发明申请

US20140201505A1 PREDICTION-BASED THREAD SELECTION IN A MULTITHREADING PROCESSOR 审中-公开
标题翻译：在多处理器中基于预测的螺纹选择
公开(公告)号：US20140201505A1
公开(公告)日：2014-07-17
申请号：US13997837
申请日：2012-03-30
申请人： Matthew C. Merten , Tong Li , Vijaykumar B. Kadgi , Srikanth T. Srinivasan , Christine E. Wang
发明人： Matthew C. Merten , Tong Li , Vijaykumar B. Kadgi , Srikanth T. Srinivasan , Christine E. Wang
IPC分类号： G06F9/30
CPC分类号： G06F9/30145 , G06F9/3851 , G06F9/4843 , Y02D10/24
摘要： A processor includes one or more execution units to execute instructions of a plurality of threads and thread control logic coupled to the execution units to predict whether a first of the plurality of threads is ready for selection in a current cycle based on readiness of instructions of the first thread in one or more previous cycles, to predict whether a second of the plurality of threads is ready for selection in the current cycle based on readiness of instructions of the second thread in the one or more previous cycles, and to select one of the first and second threads in the current cycle based on the predictions.
摘要翻译：处理器包括一个或多个执行单元，用于执行多个线程的指令和与执行单元耦合的线程控制逻辑，以基于当前周期的指令的准备就绪来预测多个线程中的第一个线程是否准备好在当前周期中进行选择在一个或多个先前循环中的第一线程，以基于所述一个或多个先前循环中的第二线程的指令的准备来预测多个线程中的第二线程是否准备好在当前周期中进行选择，并且选择基于预测的当前循环中的第一和第二个线程。

你已经成功收藏专利！

检索式保存成功!

IPRDB

热门服务

关于我们

友情链接

联系方式