abstract.txt

﻿Dynamic random-access memory (DRAM) is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent loss of data. These refresh operations waste energy and degrade system performance by interfering with memory accesses. The negative effects of DRAM refresh increase as DRAM device capacity increases. Existing DRAM devices refresh all cells at a rate determined by the leakiest cell in the device. However, most DRAM cells can retain data for significantly longer. Therefore, many of these refreshes are unnecessary.In this paper, we propose RAIDR (Retention-Aware Intelligent DRAM Refresh), a low-cost mechanism that can identify and skip unnecessary refreshes using knowledge of cell retention times. Our key idea is to group DRAM rows into retention time bins and apply a different refresh rate to each bin. As a result, rows containing leaky cells are refreshed as frequently as normal, while most rows are refreshed less frequently. RAIDR uses Bloom filters to efficiently implement retention time bins. RAIDR requires no modification to DRAM and minimal modification to the memory controller. In an 8-core system with 32 GB DRAM, RAIDR achieves a 74.6% refresh reduction, an average DRAM power reduction of 16.1%, and an average system performance improvement of 8.6% over existing systems, at a modest storage overhead of 1.25 KB in the memory controller. RAIDR's benefits are robust to variation in DRAM system configuration, and increase as memory capacity increases. 
******
Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable---a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware.This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6--17% and reduces DRAM energy by 9--22% over four existing memory controllers.
******
To address the real-time processing needs of large and growing amounts of data, modern software increasingly uses main memory as the primary data store for critical information. This trend creates a new emphasis on high-capacity, high-bandwidth, and high-reliability main memory systems. Conventional and recently-proposed server memory techniques can satisfy these requirements, but at the cost of significantly increased memory power, a key constraint for future memory systems. In this paper, we exploit the low-power nature of another high volume memory component---mobile DRAM---while improving its bandwidth and reliability shortcomings with a new DIMM architecture. We propose Buffered Output On Module (BOOM) that buffers the data outputs from multiple ranks of low-frequency mobile DRAM devices, which in aggregation provide high bandwidth and achieve chipkill-correct or even stronger reliability. Our evaluation shws that BOOM can reduce main memory power by more than 73% relative to the baseline chipkill system, while improving average performance by 5% and providing strong reliability. For memory-intensive applications, BOOM can improve performance by 30--40%.  
******
To increase datacenter energy efficiency, we need memory systems that keep pace with processor efficiency gains. Currently, servers use DDR3 memory, which is designed for high bandwidth but not for energy proportionality. A system using 20% of the peak DDR3 bandwidth consumes 2.3x the energy per bit compared to the energy consumed by a system with fully utilized memory bandwidth. Nevertheless, many datacenter applications stress memory capacity and latency but not memory bandwidth. In response, we architect server memory systems using mobile DRAM devices, trading peak bandwidth for lower energy consumption per bit and more efficient idle modes. We demonstrate 3-5x lower memory power, better proportionality, and negligible performance penalties for datacenter workloads.  ******  Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.
******
Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compaction-effectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.
******
Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs.First, we observe that GPU programs can be broken into code regions that contain little or no live register state at their entry point. We then also recognize that it is simple to generate these regions in such a way that they are idempotent, allowing their entry points to function as program recovery points and enabling support for exception handling, fast context switches, and speculation, all with very low overhead. We call the architecture of GPUs executing these idempotent regions the iGPU architecture. The hardware extensions required are minimal and the construction of idempotent code regions is fully transparent under the typical dynamic compilation framework of GPUs. We demonstrate how iGPU exception support enables virtual memory paging with very low overhead (1% to 4%), and how speculation support enables circuit-speculation techniques that can provide over 25% reduction in energy.  
******
Smartphones represent one of the fastest growing markets, providing significant hardware/software improvements every few months. However, supporting these capabilities reduces the operating time per battery charge. The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna.In this paper, we focus on improving the energy efficiency of the GPU since graphical applications consist an important part of the existing market. Moreover, the trend towards better screens will inevitably lead to a higher demand for improved graphics rendering. We show that the main bottleneck for these applications is the texture cache and that traditional techniques for hiding memory latency (prefetching, multithreading) do not work well or come at a high energy cost.We thus propose the migration of GPU designs towards the decoupled access-execute concept. Furthermore, we significantly reduce bandwidth usage in the decoupled architecture by exploiting inter-core data sharing. Using commercial Android applications, we show that the end design can achieve 93% of the performance of a heavily multithreaded GPU while providing energy savings of 34%.
******
Code reuse attacks (CRAs) are recent security exploits that allow attackers to execute arbitrary code on a compromised machine. CRAs, exemplified by return-oriented and jump-oriented programming approaches, reuse fragments of the library code, thus avoiding the need for explicit injection of attack code on the stack. Since the executed code is reused existing code, CRAs bypass current hardware and software security measures that prevent execution from data or stack regions of memory. While software-based full control flow integrity (CFI) checking can protect against CRAs, it includes significant overhead, involves non-trivial effort of constructing a control flow graph, relies on proprietary tools and has potential vulnerabilities due to the presence of unintended branch instructions in architectures such as x86---those branches are not checked by the software CFI. We propose branch regulation (BR), a lightweight hardware-supported protection mechanism against the CRAs that addresses all limitations of software CFI. BR enforces simple control flow rules in hardware at the function granularity to disallow arbitrary control flow transfers from one function into the middle of another function. This prevents common classes of CRAs without the complexity and run-time overhead of full CFI enforcement. BR incurs a slowdown of about 2% and increases the code footprint by less than 1% on the average for the SPEC 2006 benchmarks.
******
There have been many attacks that exploit side-effects of program execution to expose secret information and many proposed countermeasures to protect against these attacks. However there is currently no systematic, holistic methodology for understanding information leakage. As a result, it is not well known how design decisions affect information leakage or the vulnerability of systems to side-channel attacks.In this paper, we propose a metric for measuring information leakage called the Side-channel Vulnerability Factor (SVF). SVF is based on our observation that all side-channel attacks ranging from physical to microarchitectural to software rely on recognizing leaked execution patterns. SVF quantifies patterns in attackers' observations and measures their correlation to the victim's actual execution patterns and in doing so captures systems' vulnerability to side-channel attacks.In a detailed case study of on-chip memory systems, SVF measurements help expose unexpected vulnerabilities in whole-system designs and shows how designers can make performance-security trade-offs. Thus, SVF provides a quantitative approach to secure computer architecture.
******
Over the past two decades, several microarchitectural side channels have been exploited to create sophisticated security attacks. Solutions to this problem have mainly focused on fixing the source of leaks either by limiting the flow of information through the side channel by modifying hardware, or by refactoring vulnerable software to protect sensitive data from leaking. These solutions are reactive and not preventative: while the modifications may protect against a single attack, they do nothing to prevent future side channel attacks that exploit other microarchitectural side channels or exploit the same side channel in a novel way.In this paper we present a general mitigation strategy that focuses on the infrastructure used to measure side channel leaks rather than the source of leaks, and thus applies to all known and unknown microarchitectural side channel leaks. Our approach is to limit the fidelity of fine grain timekeeping and performance counters, making it difficult for an attacker to distinguish between different microarchitectural events, thus thwarting attacks. We demonstrate the strength of our proposed security modifications, and validate that our changes do not break existing software. Our proposed changes require minor -- or in some cases, no -- hardware modifications and do not result in any substantial performance degradation, yet offer the most comprehensive protection against microarchitectural side channels to date.
******
The ability to safely keep a secret in memory is central to the vast majority of security schemes, but storing and erasing these secrets is a difficult problem in the face of an attacker who can obtain unrestricted physical access to the underlying hardware. Depending on the memory technology, the very act of storing a 1 instead of a 0 can have physical side effects measurable even after the power has been cut. These effects cannot be hidden easily, and if the secret stored on chip is of sufficient value, an attacker may go to extraordinary means to learn even a few bits of that information. Solving this problem requires a new class of architectures that measurably increase the difficulty of physical analysis. In this paper we take a first step towards this goal by focusing on one of the backbones of any hardware system: on-chip memory. We examine the relationship between security, area, and efficiency in these architectures, and quantitatively examine the resulting systems through cryptographic analysis and microarchitectural impact. In the end, we are able to find an efficient scheme in which, even if an adversary is able to inspect the value of a stored bit with a probabilistic error of only 5%, our system will be able to prevent that adversary from learning any information about the original un-coded bits with 99.9999999999% probability.
******
Nanophontonic networks, a potential candidate for future networks on-chip, have been challenged for their reliability due to several device-level limitations. One of the main issues is that fabrication errors (a.k.a. process variations) can cause devices to malfunction, rendering communication unreliable. For example, microring resonator, a preferred optical modulator device, may not resonate at the designated wavelength under process variations (PV), leading to communication errors and bandwidth loss.This paper proposes a series of solutions to the wavelength drifting problem of microrings and subsequent bandwidth loss problem of an optical network, due to PV. The objective is to maximize network bandwidth through proper arrangement among microrings and wavelengths with minimum power requirement. Our arrangement, called "MinTrim", solves this problem using simple integer linear programming, adding supplementary microrings and allowing flexible assignment of wavelengths to network nodes as long as the resulting network presents maximal bandwidth. Each step is shown to improve bandwidth provisioning with lower power requirement. Evaluations on a sample network show that a baseline network could lose more than 40% bandwidth due to PV. Such loss can be recovered by MinTrim to produce a network with 98.4% working bandwidth. In addition, the power required in arranging microrings is 39% lower than the baseline. Therefore, MinTrim provides an efficient PV-tolerant solution to improving the reliability of on-chip phontonics.
******
Silicon photonics is a promising technology to scale offchip bandwidth in a power-efficient manner. Given equivalent bandwidth, the flexibility of switched networks often leads to the assumption that they deliver greater performance than point-to-point networks on message passing applications with low-radix traffic patterns. However, when optical losses are considered and total optical power is constrained, this assumption no longer holds.In this paper we present a power constrained method for designing photonic interconnects that uses the power characteristics and limits of optical switches, waveguide crossings, inter-layer couplers and waveguides. We apply this method to design three switched network topologies for a multi-chip system.Using synthetic and HPC benchmark-derived message patterns, we simulated the three switched networks and a WDM point-to-point network. We show that switched networks outperform point-to-point networks only when the optical losses of switches and inter-layer couplers losses are each 0.75 dB or lower; achieving this would require a major breakthrough in device development. We then show that this result extends to any switched network with similarly complex topology, through simulations of an idealized "perfect" network that supports 90% of the peak bandwidth under all traffic patterns. We conclude that given a fixed amount of input optical power, under realistic device assumptions, a point-to-point network has the best performance and energy characteristics.
******
Main-stream general-purpose microprocessors require a collection of high-performance interconnects to supply the necessary data movement. The trend of continued increase in core count has prompted designs of packet-switched network as a scalable solution for future-generation chips. However, the cost of scalability can be significant and especially hard to justify for smaller-scale chips. In contrast, a circuit-switched bus using transmission lines and corresponding circuits offers lower latencies and much lower energy costs for smaller-scale chips, making it a better choice than a full-blown network-on-chip (NoC) architecture. However, shared-medium designs are perceived as only a niche solution for small- to medium-scale chips.In this paper, we show that there are many low-cost mechanisms to enhance the effective throughput of a bus architecture. When a handful of highly cost-effective techniques are applied, the performance advantage of even the most idealistically configured NoCs becomes vanishingly small. We find transmission line-based buses to be a more compelling interconnect even for large-scale chip-multiprocessors, and thus bring into doubt the centrality of packet switching in future on-chip interconnect.
******
As the scales of parallel applications and platforms increase the negative impact of communication latencies on performance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit low-latency topologies of high-radix switches. In this context, we propose the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Using graph analysis we find that these topologies, when compared to non-random topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Using flit-level discrete event simulation we find that random shortcut topologies achieve throughput comparable to and latency lower than that of existing non-random topologies such as hypercubes and tori. Finally, we discuss and quantify practical challenges for random shortcut topologies, including routing scalability and larger physical cable lengths.
******
Languages such as C and C++ use unsafe manual memory management, allowing simple bugs (i.e., accesses to an object after deallocation) to become the root cause of exploitable security vulnerabilities. This paper proposes Watchdog, a hardware-based approach for ensuring safe and secure manual memory management. Inspired by prior software-only proposals, Watchdog generates a unique identifier for each memory allocation, associates these identifiers with pointers, and checks to ensure that the identifier is still valid on every memory access. This use of identifiers and checks enables Watchdog to detect errors even in the presence of reallocations. Watchdog stores these pointer identifiers in a disjoint shadow space to provide comprehensive protection and ensure compatibility with existing code. To streamline the implementation and reduce runtime overhead: Watchdog (1) uses micro-ops to access metadata and perform checks, (2) eliminates metadata copies among registers via modified register renaming, and (3) uses a dedicated metadata cache to reduce checking overhead. Furthermore, this paper extends Watchdog's mechanisms to detect bounds errors, thereby providing full hardware-enforced memory safety at low overheads.
******
Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only dynamic race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on dynamic race detection could provide.To resolve this trade-off, we propose Radish, a hybrid hardware-software dynamic race detector that is always-on and fully precise. In Radish, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that Radish is correct by proving equivalence to a conventional happens-before race detector.Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. Furthermore, Radish can leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core Radish processor using PARSEC benchmarks shows runtime overheads from negligible to 2x, outperforming the leading software-only race detector by 2x-37x.
******
Single-ISA heterogeneous multi-core processors are typically composed of small (e.g., in-order) power-efficient cores and big (e.g., out-of-order) high-performance cores. The effectiveness of heterogeneous multi-cores depends on how well a scheduler can map workloads onto the most appropriate core type. In general, small cores can achieve good performance if the workload inherently has high levels of ILP. On the other hand, big cores provide good performance if the workload exhibits high levels of MLP or requires the ILP to be extracted dynamically.This paper proposes Performance Impact Estimation (PIE) as a mechanism to predict which workload-to-core mapping is likely to provide the best performance. PIE collects CPI stack, MLP and ILP profile information, and estimates performance if the workload were to run on a different core type. Dynamic PIE adjusts the scheduling at runtime and thereby exploits fine-grained time-varying execution behavior. We show that PIE requires limited hardware support and can improve system performance by an average of 5.5% over recent state-of-the-art scheduling proposals and by 8.7% over a sampling-based scheduling policy.
******
On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics.On the software side, managed workloads written in languages such as C#, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application?s critical path.We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.
******
A significant portion of the energy dissipated in modern integrated circuits is consumed by the overhead associated with timing guardbands that ensure reliable execution. Timing speculation, where the pipeline operates at an unsafe voltage with any rare errors detected and resolved by the architecture, has been demonstrated to significantly improve the energy-efficiency of scalar processor designs. Unfortunately, applying the same timing-speculative approach to wide-SIMD architectures, such as those used in highly-efficient GPUs, may not provide similar gains.In this work, we make two important contributions. The first is a set of models describing a parametrized general error probability function that is based on measurements of a fabricated chip and the expected efficiency benefits of timing speculation in a SIMD context. The second contribution is a decoupled SIMD pipeline that more effectively utilizes timing speculation and recovery, when compared with a standard SIMD design that uses only conventional timing speculation. The proposed lane decoupling enables each SIMD lane to tolerate timing errors independent of other adjacent lanes, resulting in higher throughput and improved scalability. We validate our models and evaluate our design using a cycle-based GPU simulator, describe the conditions where efficiency improvements can be obtained, and explore the benefits of decoupling across a wide range of parameters. Our results show that timing speculation can achieve up to 10.3% improvement in efficiency.  ******  Power consumption is a primary concern for microprocessor designers. Lowering the supply voltage of processors is one of the most effective techniques for improving their energy efficiency. Unfortunately, low-voltage operation faces multiple challenges going forward. One such challenge is increased sensitivity to voltage fluctuations, which can trigger so-called "voltage emergencies" that can lead to errors. These fluctuations are caused by abrupt changes in power demand, triggered by processor activity variation as a function of workload.This paper examines the effects of voltage fluctuations on future many-core processors. With the increase in the number of cores in a chip, the effects of chip-wide activity fluctuation -- such as that caused by global synchronization in multithreaded applications -- overshadow the effects of core-level workload variability. Starting from this observation, we developed VRSync, a novel synchronization methodology that uses emergency-aware scheduling policies that reduce the slope of load fluctuations, eliminating emergencies. We show that VRSync is very effective at eliminating emergencies, allowing voltage guardbands to be significantly lowered, which reduces energy consumption by an average of 33%.
****** 
Bidirectional debugging and error recovery have different goals (programmer productivity and system reliability, respectively), yet they both require the ability to roll-back the program or the system to a past state. This rollback functionality is typically implemented using checkpoints that can restore the system/application to a specific point in time. There are several types of checkpoints, and bidirectional debugging and error-recovery use them in different ways. This paper presents Euripus1, a flexible hardware accelerator for memory checkpointing which can create different combinations of checkpoints needed for bidirectional debugging, error recovery, or both. In particular, Euripus is the first hardware technique to provide consolidation-friendly undo-logs (for bidirectional debugging), to allow simultaneous construction of both undo and redo logs, and to support multi-level checkpointing for the needs of error-recovery. Euripus incurs low performance overheads (&lt;5% on average), improves roll-back latency for bidirectional debugging by &gt;30%, and supports rapid multi-level error recovery that allows &gt;95% system efficiency even with very high error rates.  
******
Soft error reliability has become a first-order design criterion for modern microprocessors. Architectural Vulnerability Factor (AVF) modeling is often used to capture the probability that a radiation-induced fault in a hardware structure will manifest as an error at the program output. AVF estimation requires detailed microarchitectural simulations which are time-consuming and typically present aggregate metrics. Moreover, it requires a large number of simulations to derive insight into the impact of microarchitectural events on AVF. In this work we present a first-order mechanistic analytical model for computing AVF by estimating the occupancy of correct-path state in important microarchitecture structures through inexpensive profiling. We show that the model estimates the AVF for the reorder buffer, issue queue, load and store queue, and functional units in a 4-wide issue machine with a mean absolute error of less than 0.07. The model is constructed from the first principles of out-of-order processor execution in order to provide novel insight into the interaction of the workload with the microarchitecture to determine AVF. We demonstrate that the model can be used to perform design space explorations to understand trade-offs between soft error rate and performance, to study the impact of scaling of microarchitectural structures on AVF and performance, and to characterize workloads for AVF.  ******  Memory system reliability is a serious and growing concern in modern servers. Existing chipkill-level memory protection mechanisms suffer from several drawbacks. They activate a large number of chips on every memory access -- this increases energy consumption, and reduces performance due to the reduction in rank-level parallelism. Additionally, they increase access granularity, resulting in wasted bandwidth in the absence of sufficient access locality. They also restrict systems to use narrow-I/O x4 devices, which are known to be less energy-efficient than the wider x8 DRAM devices. In this paper, we present LOT-ECC, a localized and multi-tiered protection scheme that attempts to solve these problems. We separate error detection and error correction functionality, and employ simple checksum and parity codes effectively to provide strong fault-tolerance, while simultaneously simplifying implementation. Data and codes are localized to the same DRAM row to improve access efficiency. We use system firmware to store correction codes in DRAM data memory and modify the memory controller to handle data mapping. We thus build an effective fault-tolerance mechanism that provides strong reliability guarantees, activates as few chips as possible (reducing power consumption by up to 44.8% and reducing latency by up to 46.9%), and reduces circuit complexity, all while working with commodity DRAMs and operating systems. Finally, we propose the novel concept of a heterogeneous DIMM that enables the extension of LOT-ECC to x16 and wider DRAM parts.
******
Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker.This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.
******
In modern DDRx memory systems, memory write requests compete with read requests for available memory resources, significantly increasing the average read request service time. Caches are used to mitigate long memory read latency that limits system performance. Dirty blocks in the last-level cache (LLC) that will not be written again before they are evicted will eventually be written back to memory. We refer to these blocks as last-write blocks. In this paper, we propose an LLC writeback technique that improves DRAM efficiency by scheduling predicted last-write blocks early. We propose a low overhead last-write predictor for the LLC. The predicted last-write blocks are made available to the memory controller for scheduling. This technique effectively re-distributes the memory requests and expands writes scheduling opportunities, allowing writes to be serviced efficiently by DRAM. The technique is flexible enough to be applied to any LLC replacement policy. Our evaluation with multi-programmed workloads shows that the technique significantly improves performance by 6.5%-11.4% on average over the traditional writeback technique in an eight-core processor with various DRAM configurations running memory intensive benchmarks.
******
Exclusive last-level caches (LLCs) reduce memory accesses by effectively utilizing cache capacity. However, they require excessive on-chip bandwidth to support frequent insertions of cache lines on eviction from upper-level caches. Non-inclusive caches, on the other hand, have the advantage of using the on-chip bandwidth more effectively but suffer from a higher miss rate. Traditionally, the decision to use the cache as exclusive or non-inclusive is made at design time. However, the best option for a cache organization depends on application characteristics, such as working set size and the amount of traffic consumed by LLC insertions.This paper proposes FLEXclusion, a design that dynamically selects between exclusion and non-inclusion depending on workload behavior. With FLEXclusion, the cache behaves like an exclusive cache when the application benefits from extra cache capacity, and it acts as a non-inclusive cache when additional cache capacity is not useful, so that it can reduce on-chip bandwidth. FLEXclusion leverages the observation that both non-inclusion and exclusion rely on similar hardware support, so our proposal can be implemented with negligible hardware changes. Our evaluations show that a FLEXclusive cache reduces the on-chip LLC insertion traffic by 72.6% compared to an exclusive design and improves performance by 5.9% compared to a non-inclusive design.
******
The continuing decrease in dimensions and operating voltage of transistors has increased their sensitivity against radiation phenomena making soft errors an important challenge in future chip multiprocessors (CMPs). Hence, new techniques for detecting errors in the logic and memories that allow meeting the desired failures-in-time (FIT) budget in CMPs are required.This paper proposes a low-cost dynamic particle strike detection mechanism through acoustic wave detectors. Our results show that our mechanism can protect both the logic and the memory arrays. As a case study, we also show how this technique can be combined with error codes to protect the last-level cache at low cost.
******
The reliability of future processors is threatened by decreasing transistor robustness. Current architectures focus on delivering high performance at low cost; lifetime device reliability is a secondary concern. As the rate of permanent hardware faults increases, robustness will become a first class constraint for even low-cost systems. Current research into reliable architectures has focused on ad-hoc solutions to improve designs without altering their centralized control logic. Unfortunately, this centralized control presents a single point of failure, which limits long-term robustness.To address this issue, we introduce Viper, an architecture built from a redundant collection of fine-grained hardware components. Instructions are perceived as customers that require a sequence of services in order to properly execute. The hardware components vie to perform what services they can, dynamically forming virtual pipelines that avoid defective hardware. This is done using distributed control logic, which avoids a single point of failure by construction. Viper can tolerate a high number of permanent faults due to its inherent redundancy. As fault counts increase, its performance degrades more gracefully than traditional centralized-logic architectures. We estimate that fault rates higher than one permanent faults per 12 million transistors, on average, cause the throughput of a classic CMP design to fall below that of a Viper design of similar size.
******
Due to the evolution of technology constraints, especially energy constraints which may lead to heterogeneous multi-cores, and the increasing number of defects, the design of defect-tolerant accelerators for heterogeneous multi-cores may become a major micro-architecture research issue.Most custom circuits are highly defect sensitive, a single transistor can wreck such circuits. On the contrary, artificial neural networks (ANNs) are inherently error tolerant algorithms. And the emergence of high-performance applications implementing recognition and mining tasks, for which competitive ANN-based algorithms exist, drastically expands the potential application scope of a hardware ANN accelerator.However, while the error tolerance of ANN algorithms is well documented, there are few in-depth attempts at demonstrating that an actual hardware ANN would be tolerant to faulty transistors. Most fault models are abstract and cannot demonstrate that the error tolerance of ANN algorithms can be translated into the defect tolerance of hardware ANN accelerators.In this article, we introduce a hardware ANN geared towards defect tolerance and energy efficiency, by spatially expanding the ANN. In order to precisely assess the defect tolerance capability of this hardware ANN, we introduce defects at the level of transistors, and then assess the impact of such defects on the hardware ANN functional behavior. We empirically show that the conceptual error tolerance of neural networks does translate into the defect tolerance of hardware neural networks, paving the way for their introduction in heterogeneous multi-cores as intrinsically defect-tolerant and energy-efficient accelerators.
******
Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of off-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low cost approach. To this end, we propose three new mechanisms that overlap the latencies of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.Our proposed mechanisms (SALP-1, SALP-2, and MASA) mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure and only needs reinterpretation of some DRAM timing parameters. SALP-2 and MASA require only modest changes (&lt; 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Evaluations show that all our schemes significantly improve performance for both single-core systems and multi-core systems. Our schemes also interact positively with application-aware memory request scheduling in multi-core systems.
******
Phase Change Memory (PCM) is a promising technology for building future main memory systems. A prominent characteristic of PCM is that it has write latency much higher than read latency. Servicing such slow writes causes significant contention for read requests. For our baseline PCM system, the slow writes increase the effective read latency by almost 2X, causing significant performance degradation.This paper alleviates the problem of slow writes by exploiting the fundamental property of PCM devices that writes are slow only in one direction (SET operation) and are almost as fast as reads in the other direction (RESET operation). Therefore, a write operation to a line in which all memory cells have been SET prior to the write, will incur much lower latency. We propose PreSET, an architectural technique that leverages this property to pro-actively SET all the bits in a given memory line well in advance of the anticipated write to that memory line. Our proposed design initiates a PreSET request for a memory line as soon as that line becomes dirty in the cache, thereby allowing a large window of time for the PreSET operation to complete. Our evaluations show that PreSET is more effective and incurs lower storage overhead than previously proposed write cancellation techniques. We also describe static and dynamic throttling schemes to limit the rate of PreSET operations. Our proposal reduces effective read latency from 982 cycles to 594 cycles and increases system performance by 34%, while improving the energy-delay-product by 25%.  ******  The design and implementation of the commodity memory architecture has resulted in significant performance and capacity limitations. To circumvent these limitations, designers and vendors have begun to place intermediate logic between the CPU and DRAM. This additional logic has two functions: to control the DRAM and to communicate with the CPU over a fast and narrow bus. The benefit provided by this logic is a reduction in pin-out to the memory system and increased signal integrity to the DRAM, allowing faster clock rates while maintaining capacity. While the few vendors utilizing this design have used the same general approach, their implementations vary greatly in their nontrivial details.A hardware-verified simulation suite is developed to accurately model and evaluate the behavior of this buffer-onboard memory system. A study of this design space is used to determine optimal use of the resources involved. This includes DRAM and bus organization, queue storage, and mapping schemes. Various constraints based on implementation costs are placed on simulated configurations to confirm that these optimizations apply to viable systems. Finally, full system simulations are performed to better understand how this memory system interacts with an operating system executing an application with the goal of uncovering behaviors not present in simple limit case simulations. When applying insights gleaned from these simulations, optimal performance can be achieved while still considering outside constraints (i.e., pin-out, power, and fabrication costs).
******
NAND flash storage has proven to be a competitive alternative to traditional disk for its properties of high random-access speeds, low-power and its presumed efficacy for random-reads. Ironically, we demonstrate that when packaged in SSD format, there arise many barriers to reaching full parallelism in reads, resulting in random writes out-performing them. Motivated by this, we propose Physically Addressed Queuing (PAQ), a request scheduler that avoids resource contention resultant from shared SSD resources. PAQ makes the following major contributions: First, it exposes the physical addresses of requests to the scheduler. Second, I/O clumping is utilized to select groups of operations that can be simultaneously executed without major resource conflict. Third, inter-request NAND transaction packing empowers multi-plane-mode operations. We implement PAQ in a cycle-accurate simulator and demonstrate bandwidth and IOPS improvements greater than 62% and latency decreases as much as 41.6% for random reads, without degrading performance of other access types.
******
When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations.This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing.We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.
******
Effective sharing of the last level cache has a significant influence on the overall performance of a multicore system. We observe that existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and in some cases lack the flexibility to support a variety of performance goals.In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals. We demonstrate the flexibility of PriSM, by computing the eviction probabilities needed to achieve goals like hit-maximization, fairness and QOS.PriSM-HitMax improves performance by 18.7% over LRU and 11.8% over previously proposed schemes in a sixteen core machine. PriSM-Fairness improves fairness over existing solutions by 23.3% along with a performance improvement of 19.0%. PriSM-QOS successfully achieves the desired QOS targets.
******
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads.We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
******
Efficient execution of well-parallelized applications is central to performance in the multicore era. Program analysis tools support the hardware and software sides of this effort by exposing relevant features of multithreaded applications. This paper describes parallel block vectors, which uncover previously unseen characteristics of parallel programs. Parallel block vectors provide block execution profiles per concurrency phase (e.g., the block execution profile of all serial regions of a program). This information provides a direct and fine-grained mapping between an application's runtime parallel phases and the static code that makes up those phases. This paper also demonstrates how to collect parallel block vectors with minimal application perturbation using Harmony. Harmony is an instrumentation pass for the LLVM compiler that introduces just 16-21% overhead on average across eight Parsec benchmarks.We apply parallel block vectors to uncover several novel insights about parallel applications with direct consequences for architectural design. First, that the serial and parallel phases of execution used in Amdahl's Law are often composed of many of the same basic blocks. Second, that program features, such as instruction mix, vary based on the degree of parallelism, with serial phases in particular displaying different instruction mixes from the program as a whole. Third, that dynamic execution frequencies do not necessarily correlate with a block's parallelism.
******
Multicore architectures, with their abundant on-chip resources, are effectively collections of systems-on-a-chip. The protection system for these architectures must support multiple concurrently executing operating systems (OSes) with different needs, and manage and protect the hardware's novel communication mechanisms and hardware features. Traditional protection systems are insufficient; they protect supervisor from user code, but typically do not protect one system from another, and only support fixed assignment of resources to protection levels. In this paper, we propose an alternative to traditional protection systems which we call configurable fine-grain protection (CFP). CFP enables the dynamic assignment of in-core resources to protection levels. We investigate how CFP enables different system software stacks to utilize the same configurable protection hardware, and how differing OSes can execute at the same time on a multicore processor with CFP. As illustration, we describe an implementation of CFP in a commercial multicore, the TILE64 processor.
******
Recent improvements in architectural supports for virtualization have extended traditional hardware page walkers to traverse nested page tables. However, current two-dimensional (2D) page walkers have been designed under the assumption that the usage patterns of guest and nested page tables are similar. In this paper, we revisit the architectural supports for nested page table walks to incorporate the unique characteristics of memory management by hypervisors. Unlike page tables in native systems, nested page table sizes do not impose significant overheads on the overall memory usage. Based on this observation, we propose to use flat nested page tables to reduce unnecessary memory references for nested walks.A competing mechanism to HW 2D page walkers is shadow paging, which duplicates guest page tables but provides direct translations from guest virtual to system physical addresses. However, shadow paging has been suffering from the overheads of synchronization between guest and shadow page tables. The second mechanism we propose is a speculative shadow paging mechanism, called speculative inverted shadow paging, which is backed by non-speculative flat nested page tables. The speculative mechanism provides a direct translation with a single memory reference for common cases, and eliminates the page table synchronization overheads. We evaluate the proposed schemes with the real Xen hypervisor running on a full system simulator. The flat page tables improve a state-of-the-art 2D page walker with a page walk cache and nested TLB by 7%. The speculative shadow paging improves the same 2D page walker by 14%.
******
Power over-subscription can reduce costs for modern data centers. However, designing the power infrastructure for a lower operating power point than the aggregated peak power of all servers requires dynamic techniques to avoid high peak power costs and, even worse, tripping circuit breakers. This work presents an architecture for distributed per-server UPSs that stores energy during low activity periods and uses this energy during power spikes. This work leverages the distributed nature of the UPS batteries and develops policies that prolong the duration of their usage. The specific approach shaves 19.4% of the peak power for modern servers, at no cost in performance, allowing the installation of 24% more servers within the same power budget. More servers amortize infrastructure costs better and, hence, reduce total cost of ownership per server by 6.3%.
******
Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency.In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., inter-pod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scale-out chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.
******
Large-scale computing systems such as data centers are facing increasing pressure to cap their carbon footprint. Integrating emerging clean energy solutions into computer system design therefore gains great significance in the green computing era. While some pioneering work on tracking variable power budget show promising energy efficiency, they are not suitable for data centers due to lack of performance guarantee when renewable generation is low and fluctuant. In addition, our characterization of wind power behavior reveals that data centers designed to track the intermittent renewable power incur up to 4X performance loss due to inefficient and redundant load matching activities. As a result, mitigating operational overhead while still maintaining desired energy utilization becomes the most significant challenge in managing server clusters on intermittent renewable energy generation. In this paper we take a first step in digging into the operational overhead of renewable energy powered data center. We propose iSwitch, a lightweight server power management that follows renewable power variation characteristics, leverages existing system infrastructures, and applies supply/load cooperative scheme to mitigate the performance overhead. Comparing with state-of-the-art renewable energy driven system design, iSwitch could mitigate average network traffic by 75%, peak network traffic by 95%, and reduce 80% job waiting time while still maintaining 96% renewable energy utilization. We expect that our work can help computer architects make informed decisions on sustainable and high-performance system design.
******
Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. While a recent study has shown that a compiler can preserve SC semantics for a small performance cost, an efficient and complexity-effective SC hardware remains elusive. Past hardware solutions relied on aggressive speculation techniques, which has not yet been realized in a practical implementation.This paper exploits the observation that hardware need not enforce any memory model constraints on accesses to thread-local and shared read-only locations. A processor can easily determine a large fraction of these safe accesses with assistance from static compiler analysis and the hardware memory management unit. We discuss a low-complexity hardware design that exploits this information to reduce the overhead in ensuring SC. Our design employs an additional unordered store buffer for fast-tracking thread-local stores and allowing later memory accesses to proceed without a memory ordering related stall.Our experimental study shows that the cost of guaranteeing end-to-end SC is only 6.2% on average when compared to a system with TSO hardware executing a stock compiler's output.
******
Hybrid processors are HW/SW co-designed processors that leverage blocked-execution, the execution of regions of instructions as atomic blocks, to facilitate aggressive speculative optimization. As we move to a multicore hybrid design, fine grained conflicts for shared data can violate the atomicity requirement of these blocks and lead to expensive squashes and rollbacks. However, as these atomic regions differ from those used in checkpointing and transactional memory systems, the extent of this potentially prohibitive problem remains unclear, and mechanisms to mitigate these squashes dynamically may be critical to enable a highly per-formant multicore hybrid design.In this work, we investigate how multithreaded applications, both benchmark and commercial workloads, are affected by squashes, and present dynamic mechanisms for mitigating these squashes in hybrid processors. While the current wisdom is that there is not a significant number of squashes for smaller atomic regions, we observe this is not the case for many multithreaded workloads. With region sizes of just 200--500 instructions, we observe a performance degradation ranging from 10% to more than 50% for workloads with a mixture of shared reads and writes. By harnessing the unique flexibility provided by the software subsystem of hybrid processor design, we present BlockChop, a framework for dynamically mitigating squashes on multicore hybrid processors. We present a range of squash handling mechanisms leveraging retrials, interpretation, and retranslation, and find that BlockChop is quite effective. Over the current response to exceptions and squashes in a hybrid design, we are able to improve the performance of benchmark and commercial workloads by 1.4x and 1.2x on average for large and small region sizes respectively.
******
Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.
﻿Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.
******
As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, web-service companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications.
In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120x for all but one application (40x for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000x throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20x, depending on the composition of the workload
******
Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located.
This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.
******
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
******
In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation.
Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables---compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.
******
Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes.
To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications.
We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.
******
Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.
We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads.
We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.
******
In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications.
Still, both the energy efficiency and performance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs.
In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60&times more energy efficient than the previous state-of-the-art neural network accelerator. We present a full design down to the layout at 65 nm, with a modest footprint of 4.86mm2 and consuming only 320mW, but still about 30× faster than high-end GPUs.
******
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations.
In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.
******
This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators.
We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.
******
In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations.
We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.
******
Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.
******
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.
We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
******
Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim's fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim's observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2% overhead at a ~1200 cycle resolution. At an overhead of 61%, Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.
******
To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code "SASS" Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.
******
Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor.
We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.
******
This paper introduces a tagless cache architecture for large in-package DRAM caches. The conventional die-stacked DRAM cache has both a TLB and a cache tag array, which are responsible for virtual-to-physical and physical-to-cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and cache tag management. To this end, we introduce cache-map TLB (cTLB), which stores virtual-to-cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the cache if it is not cached yet, and updates both the page table and cTLB with the virtual-to-cache address mapping. Assuming the availability of large in-package DRAM caches, this ensures that an access to the memory region within the TLB reach always hits in the cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the cache, hence saving a tag-checking operation. The remaining cache space is used as victim cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for cache tag management, from either on-die SRAM or in-package DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for cache tags.
******
Several previous works have changed DRAM bank structure to reduce memory access latency and have shown performance improvement. However, changes in the area-optimized DRAM bank can incur large area-overhead. To solve this problem, we propose Multiple Clone Row DRAM (MCR-DRAM), which uses existing DRAM bank structure without any modification.
Our key idea is Multiple Clone Row (MCR), in which multiple rows are simultaneously turned on or off to consist of a logically single row. MCR provides two advantages which enable our low-latency mechanisms (Early-Access, Early-Precharge and Fast-Refresh). First, MCR increases the speed of the sensing process by increasing the number of sensed-cells. Thus, it enables a READ/WRITE command to an MCR to be issued earlier than possible for a normal row (Early-Access). Second, DRAM cells in an MCR exhibit more frequent refreshes without additional REFRESH commands, thereby reducing the amount of charge leakage during the refresh interval for the identical cell. The reduced amount of charge leakage enables a PRECHARGE command to be served before the activated-cells are fully restored (Early-Precharge) and a REFRESH operation to be completed before the refreshed-cells are fully restored (Fast-Refresh).
Even though MCR-DRAM sacrifices memory capacity for low-latency, it can be dynamically reconfigured from low-latency to full-capacity DRAM. MCR-DRAM improves both performance and energy efficiency for both single-core and multi-core systems.
******
DRAM cells require periodic refreshing to preserve data. In JEDEC DDRx devices, a refresh operation is performed via an auto-refresh command, which refreshes multiple rows in multiple banks simultaneously. The internal implementation of auto-refresh is completely opaque outside the DRAM --- all the memory controller can do is to instruct the DRAM to refresh itself --- the DRAM handles all else, in particular determining which rows in which banks are to be refreshed. This is in conflict with a large body of research on reducing the refresh overhead, in which the memory controller needs fine-grained control over which regions of the memory are refreshed. For example, prior works exploit the fact that a subset of DRAM rows can be refreshed at a slower rate than other rows due to access rate or retention period variations. However, such row-granularity approaches cannot use the standard auto-refresh command, which refreshes an entire batch of rows at once and does not permit skipping of rows. Consequently, prior schemes are forced to use explicit sequences of activate (ACT) and precharge (PRE) operations to mimic row-level refreshing. The drawback is that, compared to using JEDEC's auto-refresh mechanism, using explicit ACT and PRE commands is inefficient, both in terms of performance and power.
In this paper, we show that even when skipping a high percentage of refresh operations, existing row-granurality refresh techniques are mostly ineffective due to the inherent efficiency disparity between ACT/PRE and the JEDEC auto-refresh mechanism. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter. We also introduce a new "dummy refresh" command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.
******
To maximize performance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the L1 cache and dependent instructions are issued accordingly. This form of speculation -- that we refer to as speculative scheduling -- has been used for two decades in real processors, but has received little attention from the research community.
In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source(s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing performance and increasing energy consumption.
In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by L1 bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing L1 hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.
In total, in our experiments assuming a 4-cycle issue-to-execute delay, we found that the majority of instruction replays caused by L1 data cache banks conflicts -- 78.0% -- and L1 hit mispredictions -- 96.5% -- can be avoided, thus leading to a 3.4% performance gain and a 13.4% decrease in the number of issued instructions, over a baseline pipeline with speculative scheduling.
******
LaZy Superscalar is a processor architecture which delays the execution of fetched instructions until their results are needed by other instructions. This approach eliminates dead instructions and provides the necessary means to fuse dependent instructions across multiple control dependencies by explicitly tracking control and data dependencies through a matrix based scheduler. We present this novel redesign of scheduling, recovery and commit mechanisms and evaluate the performance of the proposed architecture. Our simulations using Spec 2006 benchmark suite indicate that LaZy Superscalar can achieve significant speed-ups while providing respectable power savings compared to a conventional superscalar processor.
******
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern.
Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.
******
Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a lower degree of spatial locality. This makes them less amenable to spatio-temporal prefetchers.
In this paper, we introduce the concept of Semantic Locality, which uses inherent program semantics to characterize access relations. We show how, in principle, semantic locality can capture the relationship between data elements in a manner agnostic to the actual data layout, and we argue that semantic locality transcends spatio-temporal concerns.
We further introduce the context-based memory prefetcher, which approximates semantic locality using reinforcement learning. The prefetcher identifies access patterns by applying reinforcement learning methods over machine and code attributes, that provide hints on memory access semantics.
We test our prefetcher on a variety of benchmarks that employ both regular and irregular patterns. For the SPEC 2006 suite, it delivers speedups as high as 2.8X (20% on average) over a baseline with no prefetching, and outperforms leading spatio-temporal prefetchers. Finally, we show that the context-based prefetcher makes it possible for naive, pointer-based implementations of irregular algorithms to achieve performance comparable to that of spatially optimized code.
******
General purpose processors (GPPs), from small inorder designs to many-issue out-of-order, incur large power overheads which must be addressed for future technology generations. Major sources of overhead include structures which dynamically extract the data-dependence graph or maintain precise state. Considering irregular workloads, current specialization approaches either heavily curtail performance, or provide simply too little benefit. Interestingly, well known explicit-dataflow architectures eliminate these overheads by directly executing the data-dependence graph and eschewing instruction-precise recoverability. However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. We attribute this to a lack of effective control speculation and the latency overhead of explicit communication, which is crippling for certain codes.
This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of an application's lifetime. Analysis reveals that an ideal explicit-dataflow engine could be profitable for more than half of instructions, providing significant performance and energy improvements. The challenge is to achieve these benefits without introducing excess hardware complexity. To this end, we propose the Specialization Engine for Explicit-Dataflow (SEED). Integrated with an inorder core, we see 1.67× performance and 1.65× energy benefits, with an Out-Of-Order (OOO) dual-issue core we see 1.33× and 1.70×, and with a quad-issue OOO, 1.14× and 1.54×.
******
Microprocessor manufacturers typically keep old instruction sets in modern processors to ensure backward compatibility with legacy software. The introduction of newer extensions to the ISA increases the design complexity of microprocessor front-ends, exacerbates the consumption of precious on-chip resources (e.g., silicon area and energy), and demands more efforts for hardware verification and debugging. We analyzed several x86 applications and operating systems deployed between 1995 and 2012 and observed that many instructions stop being used over time, and more than 500 instructions were never used in these applications. We also investigate the impact of including these unused instructions in the design of the x86 decoders and propose SHRINK, a mechanism to remove old instructions without breaking backward compatibility with legacy code. SHRINK allows us to remove 40% of the instructions from the x86 ISA and improve the critical path, area, and power consumption of the instruction decoder, respectively, by 23%, 48%, and 49%, on average.
******
While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.
We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.
******
Processing-in-memory (PIM) is rapidly rising as a viable solution for the memory wall crisis, rebounding from its unsuccessful attempts in 1990s due to practicality concerns, which are alleviated with recent advances in 3D stacking technologies. However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches.
In this paper, we propose a new PIM architecture that (1) does not change the existing sequential programming models and (2) automatically decides whether to execute PIM operations in memory or processors depending on the locality of data. The key idea is to implement simple in-memory computation using compute-capable memory commands and use specialized instructions, which we call PIM-enabled instructions, to invoke in-memory computation. This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification. In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute the instruction at the host processor (instead of in memory) when the instruction can benefit from large on-chip caches. Consequently, our architecture provides the illusion that PIM operations are executed as if they were host processor instructions.
We provide a case study of how ten emerging data-intensive workloads can benefit from our new PIM abstraction and its hardware implementation. Evaluations show that our architecture significantly improves system performance and, more importantly, combines the best parts of conventional and PIM architectures by adapting to data locality of applications.
******
Wire energy has become the major contributor to energy in large lower level caches. While wire energy is related to wire latency its costs are exposed differently in the memory hierarchy. We propose Sub-Level Insertion Policy (SLIP), a cache management policy which improves cache energy consumption by increasing the number of accesses from energy efficient locations while simultaneously decreasing intra-level data movement. In SLIP, each cache level is partitioned into several cache sublevels of differing sizes. Then, the recent reuse distance distribution of a line is used to choose an energy-optimized insertion and movement policy for the line. The policy choice is made by a hardware unit that predicts the number of accesses and inter-level movements.
Using a full-system simulation including OS interactions and hardware overheads, we show that SLIP saves 35% energy at the L2 and 22% energy at the L3 level and performs 0.75% better than a regular cache hierarchy in a single core system. When configured to include a bypassing policy, SLIP reduces traffic to DRAM by 2.2%. This is achieved at the cost of storing 12b metadata per cache line (2.3% overhead), a 6b policy in the PTE, and 32b distribution metadata for each page in the DRAM (a overhead of 0.1%). Using SLIP in a multiprogrammed system saves 47% LLC energy, and reduces traffic to DRAM by 5.5%.
******
Cloud customers need guarantees regarding the security of their virtual machines (VMs), operating within an Infrastructure as a Service (IaaS) cloud system. This is complicated by the customer not knowing where his VM is executing, and on the semantic gap between what the customer wants to know versus what can be measured in the cloud. We present an architecture for monitoring a VM's security health, with the ability to attest this to the customer in an unforgeable manner. We show a concrete implementation of property-based attestation and a full prototype based on the OpenStack open source cloud software.
******
Modern computers are built with increasingly complex software stack crossing multiple layers (i.e., worlds), where cross-world call has been a necessity for various important purposes like security, reliability, and reduced complexity. Unfortunately, there is currently limited cross-world call support (e.g., syscall, vmcall), and thus other calls need to be emulated by detouring multiple times to the privileged software layer (i.e., OS kernel and hypervisor). This causes not only significant performance degradation, but also unnecessary implementation complexity.
This paper argues that it is time to rethink the design of traditional cross-world call mechanisms by reviewing existing systems built upon hypervisors. Following the design philosophy of separating authentication from authorization, this paper advocates decoupling of the authorization on whether a world call is permitted (by software) from unforgeable identification of calling peers (by hardware). This results in a flexible cross-world call scheme (namely CrossOver) that allows secure, efficient and flexible cross-world calls across multiple layers not only within the same address space, but also across multiple address spaces. We demonstrate that CrossOver can be approximated by using existing hardware mechanism (namely VMFUNC) and a trivial modification of the VMFUNC mechanism can provide a full support of CrossOver. To show its usefulness, we have conducted case studies by using several recent systems such as Proxos, Hyper-Shell, Tahoma and ShadowContext. Performance measurements using full-system emulation and a real processor with VMFUNC shows that CrossOver significantly boosts the performance of the mentioned systems.
******
Architectural heterogeneity is increasing: numerous products and studies have proven the benefits of combining cores and accelerators with varying ISAs into a single system. However, an underappreciated barrier to unlocking the full potential of heterogeneity is the need to specify and to reconcile differences in memory consistency models across layers of the hardware-software stack and among on-chip components.
This paper presents ArMOR, a framework for specifying, comparing, and translating between memory consistency models. ArMOR defines MOSTs, an architecture-independent and precise format for specifying the semantics of memory ordering requirements such as preserved program order or explicit fences. MOSTs allow any two consistency models to be directly and algorithmically compared, and they help avoid many of the pitfalls of traditional consistency model analysis. As a case study, we use ArMOR to automatically generate translation modules called shims that dynamically translate code compiled for one memory model to execute on hardware implementing a different model.
******
Data races make parallel programs hard to understand. Precise race detection that stops an execution on first occurrence of a race addresses this problem, but it comes with significant overhead. In this work, we exploit the insight that precisely detecting only write-after-write (WAW) and read-after-write (RAW) races suffices to provide cleaner semantics for racy programs. We demonstrate that stopping an execution only when these races occur ensures that synchronization-free-regions appear to be executed in isolation and that their writes appear atomic. Additionally, the undetected racy executions can be given certain deterministic guarantees with efficient mechanisms.
We present Clean, a system that precisely detects WAW and RAW races and deterministically orders synchronization. We demonstrate that the combination of these two relatively inexpensive mechanisms provides cleaner semantics for racy programs. We evaluate both software-only and hardware-supported Clean. The software-only Clean runs all Pthread benchmarks from the SPLASH-2 and PARSEC suites with an average 7.8x slowdown. The overhead of precise WAW and RAW detection (5.8x) constitutes the majority of this slowdown. Simple hardware extensions reduce the slowdown of Clean's race detection to on average 10.4% and never more than 46.7%.
******
While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables.
This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.
******
Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need to track readers in a directory, and reduces the number of transient protocol states. With the addition of self-downgrade these protocols can become effectively directory-free. While this works well for race-free data, unfortunately, lack of explicit invalidations compromises the effectiveness of any synchronization that relies on races. This includes any form of spin waiting, which is employed for signaling, locking, and barrier primitives.
In this work we propose a new solution for spin-waiting in these protocols, the callback mechanism, that is simpler and more efficient than explicit invalidation. Callbacks are set by reads involved in spin waiting, and are satisfied by writes (that can even precede these reads). To implement callbacks we use a small (just a few entries) directory-cache structure that is intended to service only these "spin-waiting" races. This directory structure is self-contained and is not backed up in any way. Entries are created on demand and can be evicted without the need to preserve their information. Our evaluation shows a significant improvement both over explicit invalidation and over exponential back-off, the state-of-the-art mechanism for self-invalidation protocols to avoid spinning in the shared cache.
******
Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.
******
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity.
We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.
******
Today, an increasing number of applications and services are being hosted by large-scale data centers. The massive and irregular load surges challenge data center power infrastructures. As a result, power mismatching between supply and demand has emerged as a crucial issue in modern data centers which are either under-provisioned or powered by intermittent power sources. Recent proposals have employed energy storage devices such as the uninterruptible power supply (UPS) systems to address this issue. However, current approaches lack the capacity of efficiently handling the irregular and unpredictable power mismatches.
In this paper, we propose Hybrid Energy Buffering (HEB), the first heterogeneous and adaptive strategy that incorporates super-capacitors (SCs) into existing data centers to dynamically deal with power mismatches. Our techniques exploit diverse energy absorbing characteristics and intelligent load assignment policies to provide efficiency- and scenario- aware power mismatch management. More attractively, our management schemes make the costly energy storage devices more affordable and economical for datacenter-scale usage. We evaluate the HEB design with a real system prototype. Compared with a homogenous battery energy buffering system, HEB could improve energy efficiency by 39.7%, extend UPS lifetime by 4.7X, reduce system downtime by 41% and improve renewable energy utilization by 81.2%. Our TCO analysis shows that HEB manifests high ROI and is able to gain more than 1.9X peak shaving benefit during an 8-years period. It allows datacenters to adapt to various power supply anomalies, thereby improving operational efficiency, resiliency and economy.
******
Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached.
We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.
******
This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.
******
This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementation-efficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.
******
The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs -- the warp criticality problem.
To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.
******
GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential.
We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.
******
Spatial architectures are more efficient than traditional Out-of-Order (OOO) processors for computationally intensive programs. However, spatial architectures require mapping a program, either statically or dynamically, onto the spatial fabric. Static methods can generate efficient mappings, but they cannot adapt to changing workloads and are not compatible across hardware generations. Current dynamic methods are adaptive and compatible, but do not optimize as well due to their limited use of speculation and small mapping scopes. To overcome the limitations of existing dynamic mapping methods for spatial architectures, while minimizing the inefficiencies inherent in OOO superscalar processors, this paper presents DynaSpAM (Dynamic Spatial Architecture Mapping), a framework that tightly couples a spatial fabric with an OOO pipeline. DynaSpAM coaxes the OOO processor into producing an optimized mapping with a simple modification to the processor's scheduler. The insight behind DynaSpAM is that today's powerful OOO processors do for themselves most of the work necessary to produce a highly optimized mapping for a spatial architecture, including aggressively speculating control and memory dependences, and scheduling instructions using a large window. Evaluation of DynaSpAM shows a geomean speedup of 1.42x for 11 benchmarks from the Rodinia benchmark suite with a geomean 23.9% reduction in energy consumption compared to an 8-issue OOO pipeline.
******
Approximate computing can be employed for an emerging class of applications from various domains such as multimedia, machine learning and computer vision. The approximated output of such applications, even though not 100% numerically correct, is often either useful or the difference is unnoticeable to the end user. This opens up a new design dimension to trade off application performance and energy consumption with output correctness. However, a largely unaddressed challenge is quality control: how to ensure the user experience meets a prescribed level of quality. Current approaches either do not monitor output quality or use sampling approaches to check a small subset of the output assuming that it is representative. While these approaches have been shown to produce average errors that are acceptable, they often miss large errors without any means to take corrective actions. To overcome this challenge, we propose Rumba for online detection and correction of large approximation errors in an approximate accelerator-based computing environment. Rumba employs continuous lightweight checks in the accelerator to detect large approximation errors and then fixes these errors by exact re-computation on the host processor. Rumba employs computationally inexpensive output error prediction models for efficient detection. Computing patterns amenable for approximation (e.g., map and stencil) are usually data parallel in nature and Rumba exploits this property for selective correction. Overall, Rumba is able to achieve 2.1x reduction in output error for an unchecked approximation accelerator while maintaining the accelerator performance gains at the cost of reducing the energy savings from 3.2x to 2.2x for a set of applications from different approximate computing domains.
******
Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.
******
Applications can map data on SSDs into virtual memory to transparently scale beyond DRAM capacity, permitting them to leverage high SSD capacities with few code changes. Obtaining good performance for memory-mapped SSD content, however, is hard because the virtual memory layer, the file system and the flash translation layer (FTL) perform address translations, sanity and permission checks independently from each other. We introduce FlashMap, an SSD interface that is optimized for memory-mapped SSD-files. FlashMap combines all the address translations into page tables that are used to index files and also to store the FTL-level mappings without altering the guarantees of the file system or the FTL. It uses the state in the OS memory manager and the page tables to perform sanity and permission checks respectively. By combining these layers, FlashMap reduces critical-path latency and improves DRAM caching efficiency. We find that this increases performance for applications by up to 3.32x compared to state-of-the-art SSD file-mapping mechanisms. Additionally, latency of SSD accesses reduces by up to 53.2%.
******
While all computation generates electromagnetic (EM) side-channel signals, some of the strongest and farthest-propagating signals are created when an existing strong periodic signal (e.g. a clock signal) becomes stronger or weaker (amplitude-modulated) depending on processor or memory activity. However, modern systems create emanations at thousands of different frequencies, so it is a difficult, error-prone, and time-consuming task to find those few emanations that are AM-modulated by processor/memory activity.
This paper presents a methodology for rapidly finding such activity-modulated signals. This method creates recognizable spectral patterns generated by specially designed microbenchmarks and then processes the recorded spectra to identify signals that exhibit amplitude-modulation behavior. We apply this method to several computer systems and find several such modulated signals. To illustrate how our methodology can benefit side-channel security research and practice, we also identify the physical mechanisms behind those signals, and find that the strongest signals are created by voltage regulators, memory refreshes, and DRAM clocks. Our results indicate that each signal may carry unique information about system activity, potentially enhancing an attacker's capability to extract sensitive information. We also confirm that our methodology correctly separates emanated signals that are affected by specific processor or memory activities from those that are not.
******
Approximate computing research seeks to trade-off the accuracy of computation for increases in performance or reductions in power consumption. The observation driving approximate computing is that many applications tolerate small amounts of error which allows for an opportunistic relaxation of guard bands (e.g., clock rate and voltage). Besides affecting performance and power, reducing guard bands exposes analog properties of traditionally digital components. For DRAM, one analog property exposed by approximation is the variability of memory cell decay times.
In this paper, we show how the differing cell decay times of approximate DRAM creates an error pattern that serves as a system identifying fingerprint. To validate this observation, we build an approximate memory platform and perform experiments that show that the fingerprint due to approximation is device dependent and resilient to changes in environment and level of approximation. To identify a DRAM chip given an approximate output, we develop a distance metric that yields a two-orders-of-magnitude difference in the distance between approximate results produced by the same DRAM chip and those produced by other DRAM chips. We use these results to create a mathematical model of approximate DRAM that we leverage to explore the end-to-end deanonymizing effects of approximate memory using a commodity system running an image manipulation program. The results from our experiment show that given less than 100 approximate outputs, the fingerprint for an approximate DRAM begins to converge to a single, machine identifying fingerprint.
******
Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.
Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern.
Our simulation results show that with PrORAM, the performance of ORAM can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORAM for memory intensive benchmarks among Splash2 and 5.5% for SPEC06 workloads. The performance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.
******
As we show in this paper, I/O has become the limiting factor in scaling down size and power toward the goal of invisible computing. Achieving this goal will require composing optimized and specialized---yet reusable---components with an interconnect that permits tiny, ultra-low power systems. In contrast to today's interconnects which are limited by power-hungry pull-ups or high-overhead chip-select lines, our approach provides a superset of common bus features but at lower power, with fixed area and pin count, using fully synthesizable logic, and with surprisingly low protocol overhead.
We present MBus, a new 4-pin, 22.6 pJ/bit/chip chip-to-chip interconnect made of two "shoot-through" rings. MBus facilitates ultra-low power system operation by implementing automatic power-gating of each chip in the system, easing the integration of active, inactive, and activating circuits on a single die. In addition, we introduce a new bus primitive: power oblivious communication, which guarantees message reception regardless of the recipient's power state when a message is sent. This disentangles power management from communication, greatly simplifying the creation of viable, modular, and heterogeneous systems that operate on the order of nanowatts.
To evaluate the viability, power, performance, overhead, and scalability of our design, we build both hardware and software implementations of MBus and show its seamless operation across two FPGAs and twelve custom chips from three different semiconductor processes. A three-chip, 2.2mm3 MBus system draws 8nW of total system standby power and uses only 22.6 pJ/bit/chip for communication. This is the lowest power for any system bus with MBus's feature set.
******
Asynchronous or event-driven programming is now being used to develop a wide range of systems, including mobile and Web 2.0 applications, Internet-of-Things, and even distributed servers. We observe that these programs perform poorly on conventional processor architectures that are heavily optimized for the characteristics of synchronous programs. Execution characteristics of asynchronous programs significantly differ from synchronous programs as they interleave short events from varied tasks in a fine-grained manner.
This paper proposes the Event Sneak Peek (ESP) architecture to mitigate microarchitectural bottlenecks in asynchronous programs. ESP exploits the fact that events are posted to an event queue before they get executed. By exposing this event queue to the processor, ESP gains knowledge of the future events. Instead of stalling on long latency cache misses, ESP jumps ahead to pre-execute future events and gathers useful information that later help initiate accurate instruction and data prefetches and correct branch mispredictions. We demonstrate that ESP improves the performance of popular asynchronous Web 2.0 applications including Amazon, Google maps, and Facebook, by an average of 16%.
******
Energy-efficient user-interactive and display-oriented applications on handhelds rely heavily on multiple accelerators (termed IP cores) to meet their periodic frame processing needs. Further, these platforms are starting to host multiple applications concurrently on the multiple CPU cores. Unfortunately, today's hardware exposes an interface that forces the host software (Android drivers) to treat each IP core as an isolated device. Consequently, the host CPU has to get involved in the (i) processing of each frame, (ii) scheduling them to ensure timely progress through the IP cores to meet their QoS needs, and (iii) explicitly having to move data from one IP core to the next, with main memory serving as the common staging area.
We show in this paper through measurements on a Nexus 7 platform that the frequent invocation of the CPU for processing these frames and the involvement of main memory as a data flow conduit, are serious limitations. Instead, we propose a novel IP virtualization framework (VIP), involving three key ideas that allow several IPs to be chained together and made to appear to the software as a single device. First, chaining of IPs avoids data transfer through the memory system, enhancing the throughput of flows through the IPs. Second, by using a burst-mode, the CPU can initiate the processing of several frames through the virtual IP chain, without getting involved (and interrupted) for each frame, thereby allowing better energy saving and utilization opportunities. Removing the CPU from this loop, requires alternate orchestration of frame flows to ensure QoS guarantees for each frame of each application. Our third enhancement in VIP creates several virtual paths, one for each flow, through these IP chains with the hardware scheduling the frames to enforce QoS guarantees despite any contention for resources along the way. Our experimental evaluations demonstrate the effectiveness of VIP on energy consumption and QoS for multiple applications.
******
Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).
******
Protecting main memories from soft errors typically requires special dual-inline memory modules (DIMMs) which incorporate at least one extra chip per rank to store error-correcting codes (ECC). This increases the cost of the DIMM as well as its power consumption. To avoid these costs, some proposals have suggested protecting non-ECC DIMMs by allocating a portion of memory space to store ECC metadata. However, such proposals can significantly shrink the available memory space while degrading performance due to extra memory accesses. In this work, we propose a technique called COP which uses block-level compression to make room for ECC check bits in DRAM. Because a compressed block with check bits is the same size as an uncompressed block, no extra memory accesses are required and the memory space is not reduced. Unlike other approaches that require explicit compression-tracking metadata, COP employs a novel mechanism that relies on ECC to detect compressed data. Our results show that COP can reduce the DRAM soft error rate by 93% with no storage overhead and negligible impact on performance. We also propose a technique using COP to protect both compressible and incompressible data with minimal storage and performance overheads.
******
Racetrack memory is an emerging non-volatile memory based on spintronic domain wall technology. It can achieve ultra-high storage density. Also, its read/write speed is comparable to that of SRAM. Due to the tape-like structure of its storage cell, a "shift" operation is introduced to access racetrack memory. Thus, prior research mainly focused on minimizing shift latency/energy of racetrack memory while leveraging its ultra-high storage density. Yet the reliability issue of a shift operation, however, is not well addressed. In fact, racetrack memory suffers from unsuccessful shift due to domain misalignment. Such a problem is called "position error" in this work. It can significantly reduce mean-time-to-failure (MTTF) of racetrack memory to an intolerable level. Even worse, conventional error correction codes (ECCs), which are designed for "bit errors", cannot protect racetrack memory from the position errors.
In this work, we investigate the position error model of a shift operation and categorize position errors into two types: "stop-in-middle" error and "out-of-step" error. To eliminate the stop-in-middle error, we propose a technique called sub-threshold shift (STS) to perform a more reliable shift in two stages. To detect and recover the out-of-step error, a protection mechanism called position error correction code (p-ECC) is proposed. We first describe how to design a p-ECC for different protection strength and analyze corresponding design overhead. Then, we further propose how to reduce area cost of p-ECC by leveraging the "overhead region" in a racetrack memory stripe. With these protection mechanisms, we introduce a position-error-aware shift architecture. Experimental results demonstrate that, after using our techniques, the overall MTTF of racetrack memory is improved from 1.33μs to more than 69 years, with only 0:2% performance degradation. Trade-off among reliability, area, performance, and energy is also explored with comprehensive discussion.
******
Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches.
We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).
******
The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards.
This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.
******
Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip utilizing the appropriate accelerator for each program region. As the execution migrates, it has become increasingly challenging to retain the temporal and spatial locality of the original program as well as manage the data sharing.
We show that with the increasing energy cost of wires and caches relative to compute operations, it is imperative to optimize data movement to retain the energy benefits of accelerators. We develop FUSION, a lightweight coherent cache hierarchy for accelerators and study the tradeoffs compared to a scratchpad based architecture. We find that coherency, both between the accelerators and with the CPU, can help minimize data movement and save energy. FUSION leverages temporal coherence [32] to optimize data movement within the accelerator tile. The accelerator tile includes small per-accelerator L0 caches to minimize hit energy and a per-tile shared cache to improve localized-sharing between accelerators and minimize data exchanges with the host LLC. We find that overall FUSION improves performance by 4.3× compared to an oracle DMA that pushes data into the scratchpad. In workloads with inter-accelerator sharing we save up to 10x the dynamic energy of the cache hierarchy by minimizing the host-accelerator data ping-ponging.
******
﻿Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models---aka "realtime AI". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.
******
As the power density and power consumption of large scale datacenters continue to grow, the challenges of removing heat from these datacenters and keeping them cool is an increasingly urgent and costly. With the largest datacenters now exceeding over 200 MW of power, the cooling systems that prevent overheating cost on the order of tens of millions of dollars. Prior work proposed to deploy phase change materials (PCM) and use Thermal Time Shifting (TTS) to reshape the thermal load of a datacenter by storing heat during peak hours of high utilization and releasing it during off hours when utilization is low, enabling a smaller cooling system to handle the same peak load. The peak cooling load reduction enabled by TTS is greatly beneficial, however TTS is a passive system that cannot handle many workload mixtures or adapt to changing load or environmental characteristics. In this work we propose VMT, a thermal aware job placement technique that adds an active, tunable component to enable greater control over datacenter thermal output. We propose two different job placement algorithms for VMT and perform a scale out study of VMT in a simulated server cluster. We provide analysis of the use cases and trade-offs of each algorithm, and show that VMT reduces peak cooling load by up to 12.8% to provide over two million dollars in cost savings when a smaller cooling system is installed, or allows for over 7,000 additional servers to be added in scenarios where TTS is ineffective.
******
We present FireSim, an open-source simulation platform that enables cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation. Unlike prior FPGA-accelerated simulation tools, FireSim runs on Amazon EC2 F1, a public cloud FPGA platform, which greatly improves usability, provides elasticity, and lowers the cost of large-scale FPGA-based experiments. We describe the design and implementation of FireSim and show how it can provide sufficient performance to run modern applications at scale, to enable true hardware-software co-design. As an example, we demonstrate automatically generating and deploying a target cluster of 1,024 3.2 GHz quad-core server nodes, each with 16 GB of DRAM, interconnected by a 200 Gbit/s network with 2 microsecond latency, which simulates at a 3.4 MHz processor clock rate (less than 1,000x slowdown over real-time). In aggregate, this FireSim instantiation simulates 4,096 cores and 16 TB of memory, runs ∼14 billion instructions per second, and harnesses 12.8 million dollars worth of FPGAs---at a total cost of only ∼$100 per simulation hour to the user. We present several examples to show how FireSim can be used to explore various research directions in warehouse-scale machine design, including modeling networks with high-bandwidth and low-latency, integrating arbitrary RTL designs for a variety of commodity and specialized datacenter nodes, and modeling a variety of datacenter organizations, as well as reusing the scale-out FireSim infrastructure to enable fast, massively parallel cycle-exact single-node microarchitectural experimentation.
******
Analog/mixed-signal machine learning (ML) accelerators exploit the unique computing capability of analog/mixed-signal circuits and inherent error tolerance of ML algorithms to obtain higher energy efficiencies than digital ML accelerators. Unfortunately, these analog/mixed-signal ML accelerators lack programmability, and even instruction set interfaces, to support diverse ML algorithms or to enable essential software control over the energy-vs-accuracy tradeoffs. We propose PROMISE, the first end-to-end design of a PROgrammable MIxed-Signal accElerator from Instruction Set Architecture (ISA) to high-level language compiler for acceleration of diverse ML algorithms. We first identify prevalent operations in widely-used ML algorithms and key constraints in supporting these operations for a programmable mixed-signal accelerator. Second, based on that analysis, we propose an ISA with a PROMISE architecture built with silicon-validated components for mixed-signal operations. Third, we develop a compiler that can take a ML algorithm described in a high-level programming language (Julia) and generate PROMISE code, with an IR design that is both language-neutral and abstracts away unnecessary hardware details. Fourth, we show how the compiler can map an application-level error tolerance specification for neural network applications down to low-level hardware parameters (swing voltages for each application Task) to minimize energy consumption. Our experiments show that PROMISE can accelerate diverse ML algorithms with energy efficiency competitive even with fixed-function digital ASICs for specific ML algorithms, and the compiler optimization achieves significant additional energy savings even for only 1% extra errors.
******
In recent years, Deep Neural Networks (DNNs) have achieved tremendous success for diverse problems such as classification and decision making. Efficient support for DNNs on CPUs, GPUs and accelerators has become a prolific area of research, resulting in a plethora of techniques for energy-efficient DNN inference. However, previous proposals focus on a single execution of a DNN. Popular applications, such as speech recognition or video classification, require multiple back-to-back executions of a DNN to process a sequence of inputs (e.g., audio frames, images). In this paper, we show that consecutive inputs exhibit a high degree of similarity, causing the inputs/outputs of the different layers to be extremely similar for successive frames of speech or images of a video. Based on this observation, we propose a technique to reuse some results of the previous execution, instead of computing the entire DNN. Computations related to inputs with negligible changes can be avoided with minor impact on accuracy, saving a large percentage of computations and memory accesses. We propose an implementation of our reuse-based inference scheme on top of a state-of-the-art DNN accelerator. Results show that, on average, more than 60% of the inputs of any neural network layer tested exhibit negligible changes with respect to the previous execution. Avoiding the memory accesses and computations for these inputs results in 63% energy savings on average.
******
Genomics can transform health-care through precision medicine. Plummeting sequencing costs would soon make genome testing affordable to the masses. Compute efficiency, however, has to improve by orders of magnitude to sequence and analyze the raw genome data. Sequencing software used today can take several hundreds to thousands of CPU hours to align reads to a reference sequence. This paper presents GenAx, an accelerator for read alignment, a time-consuming step in genome sequencing. It consists of a seeding and seed-extension accelerator. The latter is based on an innovative automata design that was designed from the ground-up to enable hardware acceleration. Unlike conventional Levenshtein automata, it is string independent and scales quadratically with edit distance, instead of string length. It supports critical features commonly used in sequencing such as affine gap scoring and traceback. GenAx provides a throughput of 4,058K reads/s for Illumina 101 bp reads. GenAx achieves 31.7x speedup over the standard BWA-MEM sequence aligner running on a 56--thread dualsocket 14-core Xeon E5 server processor, while reducing power consumption by 12 x and area by 5.6 x.
******
Prefetching is a central component in most microarchitectures. Many different algorithms have been proposed with varying degrees of complexity and effectiveness. There are inherent tradeoffs among various metrics especially when we try to exploit both simpler access patterns and more complex ones simultaneously. Hypothetically, therefore, it is better to have collaboration of sub-components each specialized in exploiting a different access pattern than to have a monolithic design trying to have a similar prefetching scope. In this paper, we present some empirical evidence. We use a few components dedicated for some simple patterns such as canonical strided accesses. We show that a composite prefetcher with these components can significantly out-perform state-of-the-art prefetchers. But more importantly, the composite prefetcher achieves better performance through a more limited prefetching scope while attaining a much higher accuracy. This suggests that the design can be more readily expanded with additional components targeting other patterns.
******
On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches are typically built as a multi-level cache hierarchy. One such popular hierarchy that has been adopted by modern microprocessors is the three level cache hierarchy. Building a three level cache hierarchy enables a low average hit latency since most requests are serviced from faster inner level caches. This has motivated recent microprocessors to deploy large level-2 (L2) caches that can help further reduce the average hit latency. In this paper, we do a fundamental analysis of the popular three level cache hierarchy and understand its performance delivery using program criticality. Through our detailed analysis we show that the current trend of increasing L2 cache sizes to reduce average hit latency is, in fact, an inefficient design choice. We instead propose Criticality Aware Tiered Cache Hierarchy (CATCH) that utilizes an accurate detection of program criticality in hardware and using a novel set of inter-cache prefetchers ensures that on-die data accesses that lie on the critical path of execution are served at the latency of the fastest level-1 (L1) cache. The last level cache (LLC) serves the purpose of reducing slow memory accesses, thereby making the large L2 cache redundant for most applications. The area saved by eliminating the L2 cache can then be used to create more efficient processor configurations. Our simulation results show that CATCH outperforms the three level cache hierarchy with a large 1 MB L2 and exclusive LLC by an average of 8.4%, and a baseline with 256 KB L2 and inclusive LLC by 10.3%. We also show that CATCH enables a powerful framework to explore broad chip-level area, performance and power tradeoffs in cache hierarchy design. Supported by CATCH, we evaluate radical architecture directions such as eliminating the L2 altogether and show that such architectures can yield 4.5% performance gain over the baseline at nearly 30% lesser area or improve the performance by 7.3% at the same area while reducing energy consumption by 11%.
******
This paper shows that in the presence of data prefetchers, cache replacement policies are faced with a large unexplored design space. In particular, we observe that while Belady's MIN algorithm minimizes the total number of cache misses---including those for prefetched lines---it does not minimize the number of demand misses. To address this shortcoming, we introduce Demand-MIN, a variant of Belady's algorithm that minimizes the number of demand misses at the cost of increased prefetcher traffic. Together, MIN and Demand-MIN define the boundaries of an important design space, with many intermediate points lying between them. To reason about this design space, we introduce a simple conceptual framework, which we use to define a new cache replacement policy called Harmony. Our empirical evaluation shows that for a mix of SPEC 2006 benchmarks running on a 4-core system with a stride prefetcher, Harmony improves IPC by 7.7% over an LRU baseline, compared to 3.7% for the previous state-of-the-art. On an 8-core system, Harmony improves IPC by 9.4% compared to 4.4% for the previous state-of-the-art.
******
Weak memory models are a consequence of the desire on part of architects to preserve all the uniprocessor optimizations while building a shared memory multiprocessor. The efforts to formalize weak memory models of ARM and POWER over the last decades are mostly empirical - they try to capture empirically observed behaviors - and end up providing no insight into the inherent nature of weak memory models. This paper takes a constructive approach to find a common base for weak memory models: we explore what a weak memory would look like if we constructed it with the explicit goal of preserving all the uniprocessor optimizations. We will disallow some optimizations which break a programmer's intuition in highly unexpected ways. The constructed model, which we call General Atomic Memory Model (GAM), allows all four load/store reorderings. We give the construction procedure of GAM, and provide insights which are used to define its operational and axiomatic semantics. Though no attempt is made to match GAM to any existing weak memory model, we show by simulation that GAM has comparable performance with other models. No deep knowledge of memory models is needed to read this paper.
******
A large number of workloads are written in garbage-collected languages. These applications spend up to 10-35% of their CPU cycles on GC, and these numbers increase further for pause-free concurrent collectors. As this amounts to a significant fraction of resources in scenarios ranging from data centers to mobile devices, reducing the cost of GC would improve the efficiency of a wide range of workloads. We propose to decrease these overheads by moving GC into a small hardware accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC accelerator and describe how it can be integrated into both stop-the-world and pause-free garbage collectors. We then demonstrate an end-to-end RTL prototype of this design, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2x the performance of an in-order CPU, at just 18.5% the area (an amount equivalent to 64KB of SRAM). By prototyping our design in a real system, we show that our accelerator can be adopted without invasive changes to the SoC, and estimate its performance, area and energy.
******
As computer architecture continues to expand beyond software-agnostic microarchitecture to data center organization, reconfigurable logic, heterogeneous systems, application-specific logic, and even radically different technologies such as quantum computing, detailed cycle-level simulation is no longer presupposed. Exploring designs under such complex interacting relationships (e.g., performance, energy, thermal, cost, voltage, frequency, cooling energy, leakage, etc.) calls for a more integrative but higher-level approach. We propose Charm, a domain specific language supporting Closed-form High-level ARchitecture Modeling. Charm enables mathematical representations of mutually dependent architectural relationships to be specified, composed, checked, evaluated and reused. The language is interpreted through a combination of symbolic evaluation (e.g., restructuring) and compiler techniques (e.g., memoization and invariant hoisting), generating executable evaluation functions and optimized analysis procedures. Further supporting reuse, a type system constrains architectural quantities and ensures models operate only in a validated domain. Through two case studies, we demonstrate that Charm allows one to define high-level architecture models concisely, maximize reusability, capture unreasonable assumptions and inputs, and significantly speedup design space exploration.
******
GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem --- causing low performance and poor power-efficiency. The key issue is that it is highly application-dependent which memory address bits exhibit high variability. To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31X and power-efficiency by 1.25X compared to state-of-the-art permutation-based address mapping.
******
Recent studies on commercial hardware demonstrated that irregular GPU applications can bottleneck on virtual-to-physical address translations. In this work, we explore ways to reduce address translation overheads for such applications. We discover that the order of servicing a GPU's address translation requests (specifically, page table walks) plays a key role in determining the amount of translation overhead experienced by an application. We find that different SIMD instructions executed by an application require vastly different amounts of work to service their address translation needs, primarily depending upon the number of distinct pages they access. We show that better forward progress is achieved by prioritizing translation requests from the instructions that require less work to service their address translation needs. Further, in the GPU's Single-Instruction-Multiple-Thread (SIMT) execution paradigm, all threads that execute in lockstep (wavefront) need to finish operating on their respective data elements (and thus, finish their address translations) before the execution moves ahead. Thus, batching walk requests originating from the same SIMD instruction could reduce unnecessary stalls. We demonstrate that the reordering of translation requests based on the above principles improves the performance of several irregular GPU applications by 30% on average.
******
Hardware caches balance fast lookup, high hit rates, energy efficiency, and simplicity of implementation. For L1 caches however, achieving this balance is difficult because of constraints imposed by virtual memory. L1 caches are usually virtually-indexed and physically tagged (VIPT), but this means that they must be highly associative to achieve good capacity. Unfortunately, excessive associativity compromises performance by degrading access times without significantly boosting hit rates, and increases access energy. We propose Seesaw to overcome this problem. Seesaw leverages the increasing ubiquity of superpages1 - since super-pages have more page offset bits, they can accommodate VIPT caches with more sets than what is traditionally possible with only base page sizes. Seesaw dynamically reduces the number of ways that are looked up based on the page size, improving performance and energy. Seesaw requires modest hardware and no OS or application changes.
******
This paper makes a case for a new cross-layer interface, Expressive Memory (XMem), to communicate higher-level program semantics from the application to the system software and hardware architecture. XMem provides (i) a flexible and extensible abstraction, called an Atom, enabling the application to express key program semantics in terms of how the program accesses data and the attributes of the data itself, and (ii) new cross-layer interfaces to make the expressed higher-level information available to the underlying OS and architecture. By providing key information that is otherwise unavailable, XMem exposes a new, rich view of the program data to the OS and the different architectural components that optimize memory system performance (e.g., caches, memory controllers). By bridging the semantic gap between the application and the underlying memory resources, XMem provides two key benefits. First, it enables architectural/system-level techniques to leverage key program semantics that are challenging to predict or infer. Second, it improves the efficacy and portability of software optimizations by alleviating the need to tune code for specific hardware resources (e.g., cache space). While XMem is designed to enhance and enable a wide range of memory optimizations, we demonstrate the benefits of XMem using two use cases: (i) improving the performance portability of software-based cache optimization by expressing the semantics of data locality in the optimization and (ii) improving the performance of OS-based page placement in DRAM by leveraging the semantics of data structures and their access properties.
******
We present a non-speculative solution for a coalescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory system. Proposed solutions for coalescing in TSO resort to speculation-and-rollback or centralized arbitration to guarantee atomicity for the set of stores whose order is affected by coalescing. These solutions can suffer from scalability, complexity, resource-conflict deadlock, and livelock problems. A non-speculative solution that writes out coalesced cachelines, one at a time, over a typical directory-based MESI coherence layer, has the potential to transcend these problems if it can guarantee absence of deadlock in a practical way. There are two major problems for a non-speculative coalescing store buffer: i) how to present to the memory system a group of coalesced writes as atomic, and ii) how to not deadlock while attempting to do so. For this, we introduce a new lexicographical order. Relying on this order, conflicting atomic groups of coalesced writes can be individually performed per cache block, without speculation, rollback, or replay, and without deadlock or livelock, yet appear atomic to conflicting parties and preserve TSO. One of our major contributions is to show that lexicographical orders based on a small part of the physical address (sub-address order) are deadlock-free throughout the system when taking into account resource-conflict deadlocks. Our approach exceeds the performance and energy benefits of two baseline TSO store buffers and matches the coalescing (and energy savings) of a release-consistency store buffer, at comparable cost.
******
Store-queue-free architectures remove the store queue and use memory cloaking to communicate in-flight stores instead. In these architectures, frequent mispredictions may occur when the store to load dependencies are inconsistent. We present DMDP (Dynamic Memory Dependence Predication) which modifies the microarchitecture behavior for such loads to mitigate memory dependence mispredictions. When a given dependence is hard to predict, i.e., a given load occasionally depends on a particular store, but it is independent at other times, DMDP predicates the load so that the address of the load is compared with the address of the predicted store to compute a predicate. This predicate guides the load to obtain the value from either the cache or the colliding store. The predication provided by DMDP i) enables the loads and their dependent instructions to execute much earlier, ii) reduces the hardware complexity of store-queue-free mechanisms, and iii) reduces the number of mispredictions. DMDP outperforms a state-of-the-art store-queue-free architecture by 7.17% on Integer benchmarks and 4.48% on Float benchmarks in our Spec 2006 evaluation. We further show that despite executing extra predication instructions, DMDP is power efficient as it saves about 6.7% on EDP.
******
Designing directory cache coherence protocols is complicated because coherence transactions are not atomic in modern multicore processors. A coherence transaction comprises multiple messages, and these messages can interleave with other conflicting coherence transactions initiated by other cores. To overcome this architectural challenge, we present ProtoGen, an automated tool for taking the description of a directory protocol with atomic transactions (i.e., no concurrency) and generating the corresponding protocol for a multicore with non-atomic transactions. ProtoGen outputs the finite state machines for the cache and directory controllers, including all of the transient states that are possible with concurrent transactions. We have used ProtoGen to generate complete MSI, MESI, and MOSI protocols given their stable state protocol specifications. We have verified the generated protocols for safety and deadlock freedom using the Murϕ model checker. Our generated protocols are identical to or better than manually generated protocols, at times even discovering opportunities to reduce stalling.
******
Recent heterogeneous architectures have trended toward tighter integration and shared memory largely due to the efficient communication and programmability enabled by this shift. However, such integration is complex, because accelerators have widely disparate methods for accessing and keeping data coherent. Some processors use caches backed by hardware coherence protocols like MESI, while others prefer lightweight software coherence protocols or use specialized memories like scratchpads with differing state and communication granularities. Modern solutions tend to build interfaces that extend existing MESI-style CPU coherence protocols, often by adding hierarchical indirection through intermediate shared caches. Although functionally correct, these strategies lack flexibility and generally suffer from performance limitations that make them sub-optimal for some emerging accelerators and workloads. Instead, we need a flexible interface that can efficiently integrate existing and future devices - without requiring intrusive changes to their memory structure. We introduce Spandex, an improved coherence interface based on the simple and scalable DeNovo coherence protocol. Spandex (which takes its name from the flexible material commonly used in one-size-fits-all textiles) directly interfaces devices with diverse coherence properties and memory demands, enabling each device to communicate in a manner appropriate for its specific access properties. We demonstrate the importance of this flexibility by comparing this strategy against a more conventional MESI-based hierarchical solution for a diverse range of heterogeneous applications. On average for the applications studied, Spandex reduces execution time by 16% (max 29%) and network traffic by 27% (max 58%) relative to the MESI-based hierarchical solution.
******
Spiking Neural Networks (SNNs) play an important role in neuroscience as they help neuroscientists understand how the nervous system works. To model the nervous system, SNNs incorporate the concept of time into neurons and inter-neuron interactions called spikes; a neuron's internal state changes with respect to time and input spikes, and a neuron fires an output spike when its internal state satisfies certain conditions. As the neurons forming the nervous system behave differently, SNN simulation frameworks must be able to simulate the diverse behaviors of the neurons. To support any neuron models, some frameworks rely on general-purpose processors at the cost of inefficiency in simulation speed and energy consumption. The other frameworks employ specialized accelerators to overcome the inefficiency; however, the accelerators support only a limited set of neuron models due to their model-driven designs, making accelerator-based frameworks unable to simulate target SNNs. In this paper, we present Flexon, a flexible digital neuron which exploits the biologically common features shared by diverse neuron models, to enable efficient SNN simulations. To design Flexon, we first collect SNNs from prior work in neuroscience research and analyze the neuron models the SNNs employ. From the analysis, we observe that the neuron models share a set of biologically common features, and that the features can be combined to simulate a significantly larger set of neuron behaviors than the existing model-driven designs. Furthermore, we find that the features share a small set of computational primitives which can be exploited to further reduce the chip area. The resulting digital neurons, Flexon and spatially folded Flexon, are flexible, highly efficient, and can be easily integrated with existing hardware. Our prototyping results using TSMC 45 nm standard cell library show that a 12-neuron Flexon array improves energy efficiency by 6,186x and 422x over CPU and GPU, respectively, in a small footprint of 9.26 mm2. The results also show that a 72-neuron spatially folded Flexon array incurs a smaller footprint of 7.62 mm2 and achieves geomean speedups of 122.45x and 9.83x over CPU and GPU, respectively.
******
A proposed first step in replicating the computational methods used in the brain's neocortex is the development of a feedforward computing paradigm based on temporal relationships among inter-neuron voltage spikes. A "space-time" algebra captures the essential features of such a paradigm. The space-time algebra supports biologically plausible neural networks, as envisioned by theoretical neuroscientists. It also supports a generalization of previously proposed "race logic". A key feature of race logic is that it can be directly implemented with off-the-shelf CMOS digital circuits. This opens the possibility of designing brain-like neural networks in the neuroscience domain and implementing them directly in the CMOS digital circuit domain.
******
The increasing difficulty in leveraging CMOS scaling for improved performance requires exploring alternative technologies. A promising technique is to exploit the physical properties of devices to specialize certain computations. A recently proposed approach uses molecular-scale optical devices to construct a Resonance Energy based Sampling Unit (RSU) to accelerate sampling from parameterized probability distributions. Sampling is an important component of many algorithms, including statistical machine learning. This paper explores the relationship between application result quality and RSU design. The previously proposed RSU-G focuses on Gibbs sampling using Markov Chain Monte Carlo (MCMC) solvers for Markov Random Field (MRF) Bayesian Inference. By quantitatively analyzing the result quality across three computer vision applications, we find that the previously proposed RSU-G lacks both sufficient precision and dynamic range in key design parameters, which limits the overall result quality compared to software-only MCMC implementations. Naively scaling the problematic parameters to increase precision and dynamic range consumes too much area and power. Therefore, we introduce a new RSU-G microarchitecture that exploits an alternative approach to increase precision that incurs 1.27X power and equivalent area, while maintaining the significant speedups of the previous design and supporting a wider set of applications.
******
Increasing the capacity of the Last Level Cache (LLC) can help scale the memory wall. Due to prohibitive area and leakage power, however, growing conventional SRAM LLC already incurs diminishing returns. Emerging Non-Volatile Memory (NVM) technologies like Spin Torque Transfer RAM (STTRAM) promise high density and low leakage, thereby offering an attractive alternative for building large capacity LLCs. However these technologies have significantly longer write latency compared to SRAM, which interferes with reads and severely limits their performance potential. Despite the recent work showing the write latency reduction at NVM technology level, practical considerations like high yield and low bit error rates will result a significant loss of NVM density when these techniques are implemented. Therefore, improving the write latency while compromising on the density results in sub-optimal usage of the NVM technology. In this paper we present a novel STTRAM LLC design that mitigates the long write latency, thereby delivering SRAM like performance while preserving the benefits of high density. Based on a light-weight learning mechanism, our solution relieves LLC congestion through two schemes. Firstly, we propose write congestion aware bypass that eliminates a large fraction of writes. Despite dropping LLC hit rates which could severely degrade performance in a conventional LLC, our policy smartly modulates the bypass, overcomes the hit rate loss and delivers significant performance gain. Furthermore, our solution establishes a virtual hybrid cache that absorbs and eliminates the redundant writes, which otherwise might be repeatedly and slowly written to the NVM LLC. Detailed simulation of traditional SPEC CPU 2006 suite as well as important industry workloads running on a 4-core system shows that our proposal delivers on an average 26% performance improvement over a baseline LLC design using 8MB STTRAM, while reducing the memory system energy by 10%. Our design outperforms a similar area SRAM LLC by nearly 18%, thereby making NVM technology an attractive alternative for future high performance computing.
******
Stacked-DRAM technology has enabled high bandwidth gigascale DRAM caches. Since DRAM caches require a tag store of several tens of megabytes, commercial DRAM cache designs typically co-locate tag and data within the DRAM array. DRAM caches are organized as a direct-mapped structure so that the tag and data can be streamed out in a single access. While direct-mapped DRAM caches provide low hit-latency, they suffer from low hit-rate due to conflict misses. Ideally, we want the hit-rate of a set-associative DRAM cache, without incurring additional latency and bandwidth costs of increasing associativity. To address this problem, way prediction can be applied to a set-associative DRAM cache to achieve the latency and bandwidth of a direct-mapped DRAM cache. Unfortunately, conventional way prediction policies typically require per-set storage, causing multi-megabyte storage overheads for gigascale DRAM caches. If we can obtain accurate way prediction without incurring significant storage overheads, we can efficiently enable set-associativity for DRAM caches. This paper proposes <u>A</u>sso<u>c</u>iativity via <u>C</u>o<u>ord</u>inated Way-Install and Way-Prediction (ACCORD), a design that steers an incoming line to a "preferred way" based on the line address and uses the preferred way as the default way prediction. We propose two way-steering policies that are effective for 2-way caches. First, Probabilistic Way-Steering (PWS), which steers lines to a preferred way with high probability, while still allowing lines to be installed in an alternate way in case of conflicts. Second, Ganged Way-Steering (GWS), which steers lines of a spatially contiguous region to the way where an earlier line from that region was installed. On a 2-way cache, ACCORD (PWS+GWS) obtains a way prediction accuracy of 90% and retains a hit-rate similar to a baseline 2-way cache while incurring 320 bytes of storage overhead. We extend ACCORD to support highly-associative caches using a Skewed Way-Steering (SWS) design that steers a line to at-most two ways in the highly-associative cache. This design retains the low-latency of the 2-way ACCORD while obtaining most of the hit-rate benefits of a highly associative design. Our studies with a 4GB DRAM cache backed by non-volatile memory shows that ACCORD provides an average of 11% speedup (up to 54%) across a wide range of workloads.
******
The growing size of convolutional neural networks (CNNs) requires large amounts of on-chip storage. In many CNN accelerators, their limited on-chip memory capacity causes massive off-chip memory access and leads to very high system energy consumption. Embedded DRAM (eDRAM), with higher density than SRAM, can be used to improve on-chip buffer capacity and reduce off-chip access. However, eDRAM requires periodic refresh to maintain data retention, which costs much energy consumption. Refresh is unnecessary if the data's lifetime in eDRAM is shorter than the eDRAM's retention time. Based on this principle, we propose a Retention-Aware Neural Acceleration (RANA) framework for CNN accelerators to save total system energy consumption with refresh-optimized eDRAM. The RANA framework includes three levels of techniques: a retention-aware training method, a hybrid computation pattern and a refresh-optimized eDRAM controller. At the training level, CNN's error resilience is exploited in training to improve eDRAM's tolerable retention time. At the scheduling level, RANA assigns each CNN layer with a computation pattern that consumes the lowest energy. At the architecture level, a refresh-optimized eDRAM controller is proposed to alleviate unnecessary refresh operations. We implement an evaluation platform to verify RANA. Owing to the RANA framework, 99.7% eDRAM refresh operations can be removed with negligible performance and accuracy loss. Compared with the conventional SRAM-based CNN accelerator, an eDRAM-based CNN accelerator strengthened by RANA can save 41.7% off-chip memory access and 66.2% system energy consumption, with the same area cost.
******
Hardware specialization is commonly used in data-centers to ameliorate the nearing end of CMOS technology scaling. While offering superior performance and energy-efficiency returns compared to general-purpose processors, specialized accelerators are bound to the same device technology constraints, and are thus prone to similar limitations in the future. Once technology scaling plateaus, accelerator and application tuning will reach a point of near-optimum, with no clear direction for further improvements. Emerging non-volatile memory (NVM) technologies follow different scaling trends due to different physical properties and manufacturing techniques. NVMs have inspired recent efforts of innovation in computer systems, as they possess appealing qualities such as high capacity and low energy. We present the COmpute-REuse Accelerators (COREx) architecture that shifts computations from the scalability-hindered transistor-based logic towards the continuing-to-scale storage domain. COREx leverages datacenter redundancy by integrating a storage layer together with the accelerator processing layer. The added layer stores the outcomes of previous accelerated computations. The previously computed results are reused in the case of recurring computations, thus eliminating the need to re-compute them. We designed COREx as a combination of an accelerator and specialized storage layer using emerging memory technologies, and evaluated it on a set of datacenter workloads. Our results show that, when integrated with a well-tuned accelerator, COREx achieves an average speedup of 6.4X and average savings of 50% in energy and 68% in energy-delay product. We expect further increase in gains in the future, as memory technologies continue to improve steadily.
******
Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight- to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored---reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling---that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naive floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.
******
This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3X over state-of-art multi-core CPU (Xeon E5), 7.7X over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4X over CPU (2.2X over GPU), while reducing power consumption by 50% over CPU (53% over GPU).
******
Modern solid-state drives (SSDs) use new host-interface protocols, such as NVMe, to provide applications with fast access to storage. These new protocols make use of a concept known as the multi-queue SSD (MQ-SSD), where the SSD has direct access to the application-level I/O request queues. This removes most of the OS software stack that was used in older protocols to control how and when the I/O requests were dispatched to storage devices. Unfortunately, while the elimination of the OS software stack leads to a significant performance improvement, we show in this paper that it introduces a new problem: unfairness. This is because the elimination of the OS software stack eliminates the mechanisms that were used to provide fairness among applications in older SSDs. To study application-level unfairness, we perform experiments using four real state-of-the-art MQ-SSDs. We demonstrate that the lack of fair scheduling mechanisms leads to high unfairness among concurrently-executing applications due to the interference among them. For instance, when one of these applications issues many more I/O requests than others, the other applications are slowed down significantly. We perform a comprehensive analysis of interference in real MQ-SSDs, and find four major interference sources: (1) the intensity of requests sent by each application, (2) differences in request access patterns, (3) the ratio of reads to writes, and (4) garbage collection. To alleviate unfairness in MQ-SSDs, we propose the <u>F</u>lash-<u>L</u>evel <u>IN</u>terference-aware scheduler (FLIN). FLIN is a lightweight I/O request scheduling mechanism that provides fairness among requests from different applications. FLIN uses a three-stage scheduling algorithm that protects against all four major sources of interference, while respecting the application-level priorities assigned by the host. FLIN is implemented fully within the SSD controller firmware, requiring no new hardware, and has negligible (<0.06%) storage cost. Compared to a state-of-the-art I/O scheduler, FLIN improves the fairness and performance of a wide range of enterprise and datacenter storage workloads, with an average improvement of 70% and 47%, respectively.
******
We describe GraFBoost, a flash-based architecture with hardware acceleration for external analytics of multi-terabyte graphs. We compare the performance of GraFBoost with 1 GB of DRAM against various state-of-the-art graph analytics software including FlashGraph, running on a 32-thread Xeon server with 128 GB of DRAM. We demonstrate that despite the relatively small amount of DRAM, GraFBoost achieves high performance with very large graphs no other system can handle, and rivals the performance of the fastest software platforms on sizes of graphs that existing platforms can handle. Unlike in-memory and semi-external systems, GraFBoost uses a constant amount of memory for all problems, and its performance decreases very slowly as graph sizes increase, allowing GraFBoost to scale to much larger problems than possible with existing systems while using much less resources on a single-node system. The key component of GraFBoost is the sort-reduce accelerator, which implements a novel method to sequentialize fine-grained random accesses to flash storage. The sort-reduce accelerator logs random update requests and then uses hardware-accelerated external sorting with interleaved reduction functions. GraFBoost also stores newly updated vertex values generated in each superstep of the algorithm lazily with the old vertex values to further reduce I/O traffic. We evaluate the performance of GraFBoost for PageRank, breadth-first search and betweenness centrality on our FPGA-based prototype (Xilinx VC707 with 1 GB DRAM and 1 TB flash) and compare it to other graph processing systems including a pure software implementation of GrapFBoost.
******
Performance critical transaction and storage systems require fast persistence of write data. Typically, a non-volatile RAM (NVRAM) is employed on the datapath to the permanent storage, to temporarily and quickly store write data before the system acknowledges the write request. NVRAM is commonly implemented with battery-backed DRAM. Unfortunately, battery-backed DRAM is small and costly, and occupies a precious DIMM slot. In this paper, we make a case for dual, byte- and block-addressable solid-state drive (2B-SSD), a novel NAND flash SSD architecture designed to offer a dual view of byte addressability and traditional block addressability at the same time. Unlike a conventional storage device, 2B-SSD allows accessing the same file with two independent byte-and block- I/O paths. It controls the data transfer between its internal DRAM and NAND flash memory through an intuitive software interface, and manages the mapping of the two address spaces. 2B-SSD realizes a wholly different way and speed of accessing files on a storage device; applications can access them directly using memory-mapped I/O, and moreover write with a DRAM-like latency. To quantify the benefits of 2B-SSD, we modified logging subsystems of major database engines to store log records directly on it without buffering them in the host memory. When running popular workloads, we measured throughput gains in the range of 1.2X and 2.8X with no risk of data loss.
******
Emerging Non-Volatile Memories (NVMs) are expected to be included in future main memory, providing the opportunity to host important data persistently in main memory. However, achieving persistency requires that programs be written with failure-safety in mind. Many persistency models and techniques have been proposed to help the programmer reason about failure-safety. They require that the programmer eagerly flush data out of caches to make it persistent. Eager persistency comes with a large overhead because it adds many instructions to the program for flushing cache lines and incurs costly stalls at barriers to wait for data to become durable. To reduce these overheads, we propose Lazy Persistency (LP), a software persistency technique that allows caches to slowly send dirty blocks to the NVMM through natural evictions. With LP, there are no additional writes to NVMM, no decrease in write endurance, and no performance degradation from cache line flushes and barriers. Persistency failures are discovered using software error detection (checksum), and the system recovers from them by recomputing inconsistent results. We describe the properties and design of LP and demonstrate how it can be applied to loop-based kernels popularly used in scientific computing. We evaluate LP and compare it to the state-of-the-art Eager Persistency technique from prior work. Compared to it, LP reduces the execution time and write amplification overheads from 9% and 21% to only 1% and 3%, respectively.
******
Emerging Non-Volatile Memories (NVMs) are expected to be included in future main memory, providing the opportunity to host important data persistently in main memory. However, achieving persistency requires that programs be written with failure-safety in mind. Many persistency models and techniques have been proposed to help the programmer reason about failure-safety. They require that the programmer eagerly flush data out of caches to make it persistent. Eager persistency comes with a large overhead because it adds many instructions to the program for flushing cache lines and incurs costly stalls at barriers to wait for data to become durable. To reduce these overheads, we propose Lazy Persistency (LP), a software persistency technique that allows caches to slowly send dirty blocks to the NVMM through natural evictions. With LP, there are no additional writes to NVMM, no decrease in write endurance, and no performance degradation from cache line flushes and barriers. Persistency failures are discovered using software error detection (checksum), and the system recovers from them by recomputing inconsistent results. We describe the properties and design of LP and demonstrate how it can be applied to loop-based kernels popularly used in scientific computing. We evaluate LP and compare it to the state-of-the-art Eager Persistency technique from prior work. Compared to it, LP reduces the execution time and write amplification overheads from 9% and 21% to only 1% and 3%, respectively.
******
Non-Volatile Memory technologies are advancing rapidly and may augment or replace DRAM in future systems. However, a key question is how programmers will use them to construct and manipulate persistent data. One possible approach gives programmers direct access to persistent memory using relocatable persistent pools that hold persistent objects which can be accessed using persistent pointers, called ObjectIDs. Prior work has shown that hardware-supported address translation for ObjectIDs provides significant performance improvement and simplifies programming, however these works did not consider the large overheads incurred to check permissions before accessing persistent objects. In this paper, we identify permission checking in hardware as a critical mechanism that must be included when translating ObjectIDs to addresses in order to simplify programming and fully benefit from hardware translation. To support it, we add a System Persistent Object Table (SPOT) to support translation and permissions checks on ObjectIDs. The SPOT holds all known pools, their physical address, and their permissions information in memory. When a program attempts to access a persistent object, the SPOT is consulted and permissions are verified without trapping to the operating system. We have implemented our new design in a cycle accurate simulator and compared it with software only approaches and prior work. We find that our design offers a compelling 2.9x speedup on average for microbenchmarks that access pools with the RANDOM pattern and 1.4x and 1.8x speedup on TPC-C and vacation, respectively, for the SEPARATE pattern.
******
Novel algorithmic advances have paved the way for robotics to transform the dynamics of many social and enterprise applications. To achieve true autonomy, robots need to continuously process and interact with their environment through computationally-intensive motion planning and control algorithms under a low power budget. Specialized architectures offer a potent choice to provide low-power, high-performance accelerators for these algorithms. Instead of taking a traditional route which profiles and maps hot code regions to accelerators, this paper delves into the algorithmic characteristics of the application domain. We observe that many motion planning and control algorithms are formulated as a constrained optimization problems solved online through Model Predictive Control (MPC). While models and objective functions differ between robotic systems and tasks, the structure of the optimization problem and solver remain fixed. Using this theoretical insight, we create RoboX, an end-to-end solution which exposes a high-level domain-specific language to roboticists. This interface allows roboticists to express the physics of the robot and its task in a form close to its concise mathematical expressions. The RoboX backend then automatically maps this high-level specification to a novel programmable architecture, which harbors a programmable memory access engine and compute-enabled interconnects. Hops in the interconnect are augmented with simple functional units that either operate on in-fight data or are bypassed according a micro-program. Evaluations with six different robotic systems and tasks show that RoboX provides a 29.4 x (7.3 x) speedup and 22.1 x (79.4 x) performance-per-watt improvement over an ARM Cortex A57 (Intel Xeon E3). Compared to GPUs, RoboX attains 7.8x, 65.5x, and 71.8x higher Performance-per-Watt to Tegra X2, GTX 650 Ti, and Tesla K40 with a power envelope of only 3.4 Watts at 45 nm.
******
Modern high-performance servers leverage a large number of emerging peripheral devices (e.g., data processing accelerators, non-volatile memory storage, high-bandwidth network cards) to meet ever-increasing performance demands of server applications. However, as such servers experience severe kernel overhead due to frequently invoked device operations (e.g., buffer management and data copy), server architects have proposed various hardware and software approaches to enable direct communications among the devices. Unfortunately, existing direct device-to-device (D2D) communication schemes still suffer from low performance and the lack of flexibility. First, software-based schemes depend on complicated kernel routines and necessitate multiple hardware-software and user-kernel boundary crossings, which significantly limit the performance improvement opportunities from direct D2D communications. On the other hand, hardware-based schemes require tight integration and custom-built devices, preventing architects from flexibly adding off-the-shelf devices. In this paper, we propose DCS-ctrl, a novel Hardware-based Device-Control (HDC) mechanism for Device-Centric Server (DCS) architecture to provide fast and CPU-efficient direct D2D communications among a large number of off-the-shelf peripheral devices. The key idea of DCS-ctrl is to implement a low-cost and flexible device-control mechanism on an independent FPGA device called HDC Engine. As HDC Engine manages all data and control transfers among devices at the hardware level, the server achieves high performance, scalability, and flexibility. First, optimizing both data and control paths at the hardware level minimizes the latency of inter-device communications. Second, implementing FPGA-based reconfigurable device controllers enables direct D2D communications among commodity devices and thus improves per-device flexibility. Third, merging heterogeneous device operations with intermediate data processing supports creates more opportunities for direct inter-device communications in server applications. Our DCS-ctrl prototype reduces the latency of software-based direct D2D communications by 42% and the CPU utilization by 52%.
******
Since computers increasingly execute in constrained environments, they are being equipped with controllers for resource management. However, the operation of modern computer systems is structured in multiple layers, such as the hardware, OS, and networking layers---each with its own resources. Managing such a system scalably and portably requires that we have a controller in each layer, and that the different controllers coordinate their operation. In addition, such controllers should not rely on heuristics, but be based on formal control theory. This paper presents a new approach to build coordinated multilayer formal controllers for computers. The approach uses Structured Singular Value (SSV) controllers from Robust Control Theory. Such controllers are especially suited for multilayer computer system control. Indeed, SSV controllers can read signals from other controllers to coordinate multilayer operation. In addition, they allow designers to specify the discrete values allowed in each input, and the desired bounds on output value deviations. Finally, they accept uncertainty guardbands, which incorporate the effects of interference between the controllers. We call this approach Yukta. To assess its effectiveness, we prototype it in an 8-core big.LITTLE board. We build a two-layer SSV controller, and show that it is very effective. Yukta reduces the ExD and the execution time of a set of applications by an average of 50% and 38%, respectively, over advanced heuristic-based coordinated controllers.
******
Modern processors support instruction fetch with the instruction cache (I-cache) and branch target buffer (BTB). Due to timing and area constraints, the I-cache and BTB must efficiently make use of their limited capacities. Blocks in the I-cache or entries in the BTB that have low potential for reuse should be replaced by more useful blocks/entries. This work explores predictive replacement policies based on reuse prediction that can be applied to both the I-cache and BTB. Using a large suite of recently released industrial traces, we show that predictive replacement policies can reduce misses in the I-cache and BTB. We introduce Global History Reuse Prediction (GHRP), a replacement technique that uses the history of past instruction addresses and their reuse behaviors to predict dead blocks in the I-cache and dead entries in the BTB. This paper describes the effectiveness of GHRP as a dead block replacement and bypass optimization for both the I-cache and BTB. For a 64KB set-associative I-cache with a 64B block size, GHRP lowers the I-cache misses per 1000 instructions (MPKI) by an average of 18% over the least-recently-used (LRU) policy on a set of 662 industrial workloads, performing significantly better than Static Re-reference Interval Prediction (SRRIP) [1] and Sampling Dead Block Prediction (SDBP)[2]. For a 4K-entry BTB, GHRP lowers MPKI by an average of 30% over LRU, 23% over SRRIP, and 29% over SDBP.
******
Hardware support for deep convolutional neural networks (CNNs) is critical to advanced computer vision in mobile and embedded devices. Current designs, however, accelerate generic CNNs; they do not exploit the unique characteristics of real-time vision. We propose to use the temporal redundancy in natural video to avoid unnecessary computation on most frames. A new algorithm, activation motion compensation, detects changes in the visual input and incrementally updates a previously-computed activation. The technique takes inspiration from video compression and applies well-known motion estimation techniques to adapt to visual changes. We use an adaptive key frame rate to control the trade-off between efficiency and vision quality as the input changes. We implement the technique in hardware as an extension to state-of-the-art CNN accelerator designs. The new unit reduces the average energy per frame by 54%, 62%, and 87% for three CNNs with less than 1% loss in vision accuracy.
******
Continuous computer vision (CV) tasks increasingly rely on convolutional neural networks (CNN). However, CNNs have massive compute demands that far exceed the performance and energy constraints of mobile devices. In this paper, we propose and develop an algorithm-architecture co-designed system, Euphrates, that simultaneously improves the energy-efficiency and performance of continuous vision tasks. Our key observation is that changes in pixel data between consecutive frames represents visual motion. We first propose an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by continuous vision applications. We co-design a mobile System-on-a-Chip (SoC) architecture to maximize the efficiency of the new algorithm. The key to our architectural augmentation is to co-optimize different SoC IP blocks in the vision pipeline collectively. Specifically, we propose to expose the motion data that is naturally generated by the Image Signal Processor (ISP) early in the vision pipeline to the CNN engine. Measurement and synthesis results show that Euphrates achieves up to 66% SoC-level energy savings (4x for the vision computations), with only 1% accuracy loss.
******
Sensors in mobile devices and IoT systems increasingly generate data that may contain private information of individuals. Generally, users of such systems are willing to share their data for public and personal benefit as long as their private information is not revealed. A fundamental challenge lies in designing systems and data processing techniques for obtaining meaningful information from sensor data, while maintaining the privacy of the data and individuals. In this work, we explore the feasibility of providing local differential privacy on ultra-low-power systems that power many sensor and IoT applications. We show that low resolution and fixed point nature of ultra-low-power implementations prevent privacy guarantees from being provided due to low quality noising. We present techniques, resampling and thresholding, to overcome this limitation. The techniques, along with a privacy budget control algorithm, are implemented in hardware to provide privacy guarantees with high integrity. We show that our hardware implementation, DP-Box, has low overhead and provides high utility, while guaranteeing local differential privacy, for a range of sensor/IoT benchmarks.
******
Wearable devices are now leveraging multi-core processors to cater to the increasing computational demands of the applications via multi-threading. However, the power, performance constraints of many wearable applications can only be satisfied when the thread-level parallelism is coupled with hardware acceleration of common computational kernels. The ASIC accelerators with high performance/watt suffer from high non-recurring engineering costs. Configurable accelerators that can be reused across applications present a promising alternative. Autonomous configurable accelerators loosely-coupled to the processor occupy additional silicon area for local data and control and incur data communication overhead. In contrast, configurable instruction set extension (ISE) accelerators tightly integrated into the processor pipeline eliminate such overheads by sharing the existing processor resources. Yet, naively adding full-blown ISE accelerators to each core in a many-core architecture will lead to huge area and power overheads, which is clearly infeasible in resource-constrained wearables. In this paper, we propose Stitch, a many-core architecture where tiny, heterogeneous, configurable and fusible ISE accelerators, called polymorphic patches are effectively enmeshed with the cores. The novelty of our architecture lies in the ability to stitch together multiple polymorphic patches, where each can handle very simple ISEs, across the chip to create large, virtual accelerators that can execute complex ISEs. The virtual connections are realized efficiently with a very lightweight compiler-scheduled network-on-chip (NoC) with no buffers or control logic. Our evaluations across representative wearable applications show an average 2.3X improvement in runtime for Stitch compared to a baseline many-core processor without ISEs, at a modest area and power overhead.
******
Since its inception half a century ago, DRAM has required dynamic/active refresh operations that block read requests and decrease performance. We propose refreshing DRAM in the background without stalling read accesses to refreshing memory blocks, similar to the static/background refresh in SRAM. Our proposed Nonblocking Refresh works by refreshing a portion of the data in a memory block at a time and uses redundant data, such as Reed-Solomon codes, in the block to compute the block's refreshing/unreadable data to satisfy read requests. For proof of concept, we apply Nonblocking Refresh to server memory systems, where every memory block already contains redundant data to provide hardware failure protection. In this context, Nonblocking Refresh can utilize server memory system's existing per-block redundant data in the common-case when there are no hardware faults to correct, without requiring any dedicated redundant data of its own. Our evaluations show that on average across five server memory systems with different redundancy and failure protection strengths, Nonblocking Refresh improves performance by 16.2% and 30.3% for 16gb and 32gb DRAM chips, respectively.
******
In this paper, we propose Random Embedded Secret Tokens (REST), a simple hardware primitive to provide content-based checks, and show how it can be used to mitigate common types of spatial and temporal memory errors at very low cost. REST is simply a very large random value that is embedded into programs. To provide memory safety, REST is used to bookend data structures during allocation. If the hardware accesses a REST value during execution, due to programming errors or adversarial actions, it reports a privileged memory safety exception. Implementing REST requires 1 bit of metadata per L1 data cache line and a comparator to check for REST tokens during a cache fill. The software infrastructure to provide memory safety with REST reuses a production-quality memory error detection tool, AddressSanitizer, by changing less than 1.5K lines of code. REST based memory safety offers several advantages compared to extant methods: (1) it does not require significant redesign of hardware or software, (2) the overhead of heap and stack safety is 2% compared to 40% for AddressSanitizer, (3) the security of the memory safety implementation is improved compared AddressSanitizer, and (4) REST based memory safety can mitigate heap safety errors in legacy binaries without recompilation or source code. These advantages provide a significant step towards continuous runtime memory safety monitoring and mitigation for legacy and new binaries.
******
DRAM technology scaling has the undesirable side effect of degrading cell reliability. One such concern of deeply scaled DRAMs is the increased coupling between adjacent cells, commonly referred to as crosstalk. High access frequency of certain rows in the DRAM may cause data loss in cells of physically adjacent rows due to crosstalk. The malicious exploit of this crosstalk by repeatedly accessing a row to induce this effect is known as row hammering. Additionally, inadvertent row hammering may also occur due to the natural weighted nature of applications' access patterns. In this paper, we analyze the efficiency of existing approaches for mitigating wordline crosstalk and demonstrate that they have been conservatively designed. Given the unbalanced nature of DRAM accesses, a small group of dynamically allocated counters in banks can deterministically detect "hot" rows and mitigate crosstalk. Based on our findings, we propose a Counter-based Adaptive Tree (CAT) approach to mitigate wordline crosstalk using adaptive trees of counters to guide appropriate refreshing of vulnerable rows. The key idea is to tune the distribution of the counters to the rows in a bank based on the memory reference patterns. In contrast to deterministic solutions, CAT utilizes fewer counters, making it practically feasible to be implemented on-chip. Compared to existing probabilistic approaches, CAT more precisely refreshes rows vulnerable to crosstalk based on their access frequency. Experimental results on workloads from four benchmark suites show that CAT reduces the Crosstalk Mitigation Refresh Power Overhead in quad-core systems to 7%, which is an improvement over the 21% and 18% incurred in the leading deterministic and probabilistic approaches, respectively. Moreover, CAT incurs very low performance overhead (∼ 0.5%). Hardware synthesis evaluation shows that CAT can be implemented on-chip with only a nominal area overhead.
******
Modern instruction set decoders feature translation of native instructions into internal micro-ops to simplify CPU design and improve instruction-level parallelism. However, this translation is static in most known instances. This work proposes context-sensitive decoding, a technique that enables customization of the micro-op translation at the microsecond or faster granularity, based on the current execution context and/or preset hardware events. While there are many potential applications, this work demonstrates its effectiveness with two use cases: 1) as a novel security defense to thwart instruction/data cache-based side-channel attacks, as demonstrated on commercial implementations of RSA and AES and 2) as a power management technique that performs selective devectorization to enable efficient unit-level power gating. This architecture, first by allowing execution to transition between different translation modes rapidly, defends against a variety of attacks, completely obfuscating code-dependent cache access, only sacrificing 5% in steady-state performance - orders of magnitude less than prior art. By selectively disabling the vector units without disabling vector arithmetic, context-sensitive decoding reduces energy by 12.9% with minimal loss in performance. Both optimizations work with no significant changes to the pipeline or the external ISA.
******
As demonstrated by numerous practical attacks, the physical act of computation emits unintended and damaging information through infinitesimal variations in timing, power, and resource contention. While there are many techniques for preventing the leakage of information through power channels for specific cryptographic units, they are typically either built directly into the hardware logic or exploit intricate mathematical properties of the algorithm itself. However, such leaks are not uniform in time but, as we show, rather occur in specific bursts. Exploiting this observation we propose a set of software-controlled techniques allowing for the seamless disconnection and reconnection of general purpose programmable components in a system-on-chip. Such a system is capable of providing brief moments of electrical isolation during which the most critical computations can be performed free from both timing and power measurement. Of course, disconnection comes at a cost. To balance the resulting trade-off between overhead and security effectively, we describe a new analysis technique to uncover the "leakiest" intervals of time, we provide an algorithm to co-optimize the covering of these intervals and the performance/energy costs under a set of architecture imposed constraints, and explore the architectural and software ramifications of such intermittent disconnection. In the end we find that by hiding only between 15% and 30% of the trace, at a performance cost of between 15% and 50%, we are able to reduce the mutual information between the leakage model and key bits by 75% on average, and to nearly zero in specific cases.
******
Generative Adversarial Networks (GANs) are one of the most recent deep learning models that generate synthetic data from limited genuine datasets. GANs are on the frontier as further extension of deep learning into many domains (e.g., medicine, robotics, content synthesis) requires massive sets of labeled data that is generally either unavailable or prohibitively costly to collect. Although GANs are gaining prominence in various fields, there are no accelerators for these new models. In fact, GANs leverage a new operator, called transposed convolution, that exposes unique challenges for hardware acceleration. This operator first inserts zeros within the multidimensional input, then convolves a kernel over this expanded array to add information to the embedded zeros. Even though there is a convolution stage in this operator, the inserted zeros lead to underutilization of the compute resources when a conventional convolution accelerator is employed. We propose the GANAX architecture to alleviate the sources of inefficiency associated with the acceleration of GANs using conventional convolution accelerators, making the first GAN accelerator design possible. We propose a reorganization of the output computations to allocate compute rows with similar patterns of zeros to adjacent processing engines, which also avoids inconsequential multiply-adds on the zeros. This compulsory adjacency reclaims data reuse across these neighboring processing engines, which had otherwise diminished due to the inserted zeros. The reordering breaks the full SIMD execution model, which is prominent in convolution accelerators. Therefore, we propose a unified MIMD-SIMD design for GANAX that leverages repeated patterns in the computation to create distinct microprograms that execute concurrently in SIMD mode. The interleaving of MIMD and SIMD modes is performed at the granularity of single microprogrammed operation. To amortize the cost of MIMD execution, we propose a decoupling of data access from data processing in GANAX. This decoupling leads to a new design that breaks each processing engine to an access micro-engine and an execute micro-engine. The proposed architecture extends the concept of access-execute architectures to the finest granularity of computation for each individual operand. Evaluations with six GAN models shows, on average, 3.6x speedup and 3.1x energy savings over Eyeriss without compromising the efficiency of conventional convolution accelerators. These benefits come with a mere ≈7.8% area increase. These results suggest that GANAX is an effective initial step that paves the way for accelerating the next generation of deep neural models.
******
Deep Convolutional Neural Networks (CNNs) perform billions of operations for classifying a single input. To reduce these computations, this paper offers a solution that leverages a combination of runtime information and the algorithmic structure of CNNs. Specifically, in numerous modern CNNs, the outputs of compute-heavy convolution operations are fed to activation units that output zero if their input is negative. By exploiting this unique algorithmic property, we propose a predictive early activation technique, dubbed SnaPEA. This technique cuts the computation of convolution operations short if it determines that the output will be negative. SnaPEA can operate in two distinct modes, exact and predictive. In the exact mode, with no loss in classification accuracy, SnaPEA statically re-orders the weights based on their signs and periodically performs a single-bit sign check on the partial sum. Once the partial sum drops below zero, the rest of computations can simply be ignored, since the output value will be zero in any case. In the predictive mode, which trades the classification accuracy for larger savings, SnaPEA speculatively cuts the computation short even earlier than the exact mode. To control the accuracy, we develop a multi-variable optimization algorithm that thresholds the degree of speculation. As such, the proposed algorithm exposes a knob to gracefully navigate the trade-offs between the classification accuracy and computation reduction. Compared to a state-of-the-art CNN accelerator, SnaPEA in the exact mode, yields, on average, 28% speedup and 16% energy reduction in various modern CNNs without affecting their classification accuracy. With 3% loss in classification accuracy, on average, 67.8% of the convolutional layers can operate in the predictive mode. The average speedup and energy saving of these layers are 2.02X and 1.89X, respectively. The benefits grow to a maximum of 3.59X speedup and 3.14X energy reduction. Compared to static pruning approaches, which are complimentary to the dynamic approach of SnaPEA, our proposed technique offers up to 63% speedup and 49% energy reduction across the convolution layers with no loss in classification accuracy.
******
Convolutional Neural Networks (CNNs) have begun to permeate all corners of electronic society (from voice recognition to scene generation) due to their high accuracy and machine efficiency per operation. At their core, CNN computations are made up of multi-dimensional dot products between weight and input vectors. This paper studies how weight repetition---when the same weight occurs multiple times in or across weight vectors---can be exploited to save energy and improve performance during CNN inference. This generalizes a popular line of work to improve efficiency from CNN weight sparsity, as reducing computation due to repeated zero weights is a special case of reducing computation due to repeated weights. To exploit weight repetition, this paper proposes a new CNN accelerator called the Unique Weight CNN Accelerator (UCNN). UCNN uses weight repetition to reuse CNN sub-computations (e.g., dot products) and to reduce CNN model size when stored in off-chip DRAM---both of which save energy. UCNN further improves performance by exploiting sparsity in weights. We evaluate UCNN with an accelerator-level cycle and energy model and with an RTL implementation of the UCNN processing element. On three contemporary CNNs, UCNN improves throughput-normalized energy consumption by 1.2X ~ 4X, relative to a similarly provisioned baseline accelerator that uses Eyeriss-style sparsity optimizations. At the same time, the UCNN processing element adds only 17-24% area overhead relative to the same baseline.
******
Owing to the presence of large values, which we call outliers, conventional methods of quantization fail to achieve significantly low precision, e.g., four bits, for very deep neural networks, such as ResNet-101. In this study, we propose a hardware accelerator, called the outlier-aware accelerator (OLAccel). It performs dense and low-precision computations for a majority of data (weights and activations) while efficiently handling a small number of sparse and high-precision outliers (e.g., amounting to 3% of total data). The OLAccel is based on 4-bit multiply-accumulate (MAC) units and handles outlier weights and activations in a different manner. For outlier weights, it equips SIMD lanes of MAC units with an additional MAC unit, which helps avoid cycle overhead for the majority of outlier occurrences, i.e., a single occurrence in the SIMD lanes. The OLAccel performs computations using outlier activation on dedicated, high-precision MAC units. In order to avoid coherence problem due to updates from low-and high-precision computation units, both units update partial sums in a pipelined manner. Our experiments show that the OLAccel can reduce by 43.5% (27.0%), 56.7% (36.3%), and 62.2% (49.5%) energy consumption for AlexNet, VGG-16, and ResNet-18, respectively, compared with a 16-bit (8-bit) state-of-the-art zero-aware accelerator. The energy gain mostly comes from the memory components, the DRAM, and on-chip memory due to reduced precision.
******
One of the most fundamental design challenges in any interconnection network is that of routing deadlocks. A deadlock is a cyclic dependence between buffers that renders forward progress impossible. Deadlocks are a necessary evil and almost every on-chip/HPC network today avoids it either via routing restrictions across physical channels (Daily's Theory) or with at least one escape virtual channel (Duato's Theory). This ensures that a cyclic dependence between buffers is never created in the first place. Moreover, each solution is tied to a specific topology, requiring an updated policy if the topology were to change. Alternately, solutions have also been proposed to reserve certain resources (buffers) and allocate them only upon detection of a deadlock, thereby breaking the dependence chain and recovering from the deadlock. Unfortunately, all these approaches fundamentally lead to a loss in available bandwidth due to routing restrictions or buffer resource usage restrictions. In this work, we challenge the theoretical notion of viewing deadlocks as a lack of routing resource (buffers) problem that every solution to date is based on. We argue that a deadlock can in fact be considered as a lack of coordination between distributed entities. We prove that orchestrating a forward movement of every flit in the deadlocked ring at exactly the same time, which we call a spin, can guarantee forward progress and eventually lead to deadlock resolution with a bounded number of spins. We name this novel theory as SPIN (Synchronized Progress in Interconnection Networks). SPIN eliminates the need for virtual channels to achieve deadlock freedom thereby enabling fully adaptive routing with only one buffer per message class. We illustrate this capability by designing FAvORS, a novel truly one VC fully-adaptive routing algorithm. We also present a low-cost distributed implementation of SPIN and compare it against state-of-the-art deadlock avoidance/recovery schemes. SPIN provides up to 80% higher throughput, 52% lower area and 50% lower power for an on-chip 64-core mesh, and up to 83% higher throughput, 53% lower area and 55% lower power for an off-chip 1024-node dragon-fly.
******
High-radix topologies in large-scale networks provide low network diameter and high path diversity, but the idle power from high-speed links results in energy inefficiency, especially at low traffic load. In this work, we exploit the high path diversity and non-minimal adaptive routing in high-radix topologies to consolidate traffic to a smaller number of links to enable more network channels to be power-gated. In particular, we propose TCEP (Traffic Consolidation for Energy-Proportional high-radix networks), a distributed, proactive power management mechanism for large-scale networks that achieves energy-proportionality by proactively power-gating network channels through traffic consolidation. Instead of naively power-gating the least utilized link, TCEP differentiates links with the type of traffic (i.e., minimally vs. non-minimally routed traffic) on them since the performance impact of power-gating on minimal traffic is greater than non-minimal traffic. The performance degradation from the reduced number of channels is minimized by concentrating available links to a small number of routers, instead of distributing them across the network, to maximize path diversity. TCEP introduces a shadow link to quickly reactivate an inactive link and Power-Aware progressive Load-balanced (PAL) routing algorithm that incorporates the link power states in load-balancing the network. Our evaluations show that TCEP achieves significantly higher throughput across various traffic patterns while providing comparable energy savings for real workloads, compared to a prior approach proposed for the flattened butterfly topology.
******
System-on-Chip (SoC) complexity and the increasing costs of silicon motivate the breaking of an SoC into smaller "chiplets." A chiplet-based SoC design process has the promise to enable fast SoC construction by using advanced packaging technologies to tightly integrate multiple disparate chips (e.g., CPU, GPU, memory, FPGA). However, when assembling chiplets into a single SoC, correctness validation becomes a significant challenge. In particular, the network-on-chip (NoC) used within the individual chiplets and across chiplets to tie them together can easily have deadlocks, especially if each chip is designed in isolation. We introduce a simple, modular, yet elegant methodology for ensuring deadlock-free routing in multi-chiplet systems. As an example, we focus on future systems combining chiplets on an active silicon interposer. To maximize modularity, each individual chiplet is free to implement its own NoC topology and local routing algorithm, and the interposer can implement its own independent topology and routing. Our methodology imposes a few simple turn restrictions applied only to traffic as it flows into or out of the chiplets from the interposer, and we provide a way to determine these restrictions. The end result is an overall approach that enables highly-modular, chiplet-based SoC construction while eliminating deadlocks with high performance.
******
Networks-on-Chip (NoCs) implemented on FPGAs have to be designed differently from ASICs to fully exploit the unique architectural features and properties of the FPGA fabric. The FPGA-friendly bufferless, deflection routed Hoplite NoC is almost an order of magnitude smaller and runs at a faster operating frequency than competing classic buffered FPGA NoCs. It is able achieve this by sacrificing NoC link utilization that suffers due to the cost of packet deflections and associated high latency traversals. In this paper, we address these shortcomings by developing FastTrack, which is an FPGA-optimized, high-radix NoC that exploits the segmented interconnect structure of modern FPGAs. We adapt the NoC organization to use express bypass links in the NoC to skip multiple router stages in a single cycle. Our FastTrack design can be tuned to use different express link lengths for performance, and supports depopulation strategies for controlling the balance between FPGA LUT and wiring cost. For the Xilinx Virtex-7 485T FPGA, an 8X8 FastTrack NoC is 1.7--2.5X larger than a base Hoplite NoC, but operates at almost the same clock frequency. FastTrack delivers throughput and latency improvements across a range of statistical workloads (2.5X), and traces extracted from FPGA accelerator case studies such as Sparse Matrix-Vector Multiplication (2.5X), Graph Analytics (2.8X), Token LU Factorization Dataflow (1.4X) and Multi-processor overlay applications (2X). FastTrack also shows energy efficiency improvements by factors of up to 2.2X over baseline Hoplite due to higher sustained rates and high speed operation of express links made possible by fast FPGA interconnect.
******
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.
******
Hardware acceleration of Deep Neural Networks (DNNs) aims to tame their enormous compute intensity. Fully realizing the potential of acceleration in this domain requires understanding and leveraging algorithmic properties of DNNs. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of Bit Fusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss [1] and Stripes [2]. In the same area, frequency, and process technology, Bit Fusion offers 3.9X speedup and 5.1X energy savings over Eyeriss. Compared to Stripes, Bit Fusion provides 2.6X speedup and 3.9X energy reduction at 45 nm node when Bit Fusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, Bit Fusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while Bit Fusion merely consumes 895 milliwatts of power.
******
Modern deep neural networks (DNNs) training typically relies on GPUs to train complex hundred-layer deep networks. A significant problem facing both researchers and industry practitioners is that, as the networks get deeper, the available GPU main memory becomes a primary bottleneck, limiting the size of networks it can train. In this paper, we investigate widely used DNNs and find that the major contributors to memory footprint are intermediate layer outputs (feature maps). We then introduce a framework for DNN-layer-specific optimizations (e.g., convolution, ReLU, pool) that significantly reduce this source of main memory pressure on GPUs. We find that a feature map typically has two uses that are spread far apart temporally. Our key approach is to store an encoded representation of feature maps for this temporal gap and decode this data for use in the backward pass; the full-fidelity feature maps are used in the forward pass and relinquished immediately. Based on this approach, we present Gist, our system that employs two classes of layer-specific encoding schemes -- lossless and lossy -- to exploit existing value redundancy in DNN training to significantly reduce the memory consumption of targeted feature maps. For example, one insight is by taking advantage of the computational nature of back propagation from pool to ReLU layer, we can store the intermediate feature map using just 1 bit instead of 32 bits per value. We deploy these mechanisms in a state-of-the-art DNN framework (CNTK) and observe that Gist reduces the memory footprint to upto 2X across 5 state-of-the-art image classification DNNs, with an average of 1.8X with only 4% performance overhead. We also show that further software (e.g., CuDNN) and hardware (e.g., dynamic allocation) optimizations can result in even larger footprint reduction (upto 4.1X).
******
DNN pruning has been recently proposed as an effective technique to improve the energy-efficiency of DNN-based solutions. It is claimed that by removing unimportant or redundant connections, the pruned DNN delivers higher performance and energy-efficiency with negligible impact on accuracy. However, DNN pruning has an important side effect: it may reduce the confidence of DNN predictions. We show that, although top-1 accuracy may be maintained with DNN pruning, the likelihood of the class in the top-1 is significantly reduced when using the pruned models. For applications such as Automatic Speech Recognition (ASR), where the DNN scores are consumed by a successive stage, the workload of this stage can be dramatically increased due to the loss of confidence in the DNN. An ASR system consists of a DNN for computing acoustic scores, followed by a Viterbi beam search to find the most likely sequence of words. We show that, when pruning the DNN model used for acoustic scoring, the Word Error Rate (WER) is maintained but the execution time of the ASR system is increased by 33%. Although pruning improves the efficiency of the DNN, it results in a huge increase of activity in the Viterbi search since the output scores of the pruned model are less reliable. Based on this observation, we propose a novel hardware-based ASR system that effectively integrates a DNN accelerator for pruned models with a Viterbi accelerator. In order to avoid the aforementioned increase in Viterbi search workload, our system loosely selects the N-best hypotheses at every time step, exploring only the N most likely paths. To avoid an expensive sort of the hypotheses based on their likelihoods, our accelerator employs a set-associative hash table to keep track of the best paths mapped to each set. In practice, this solution approaches the selection of N-best, but it requires much simpler hardware. Our approach manages to efficiently combine both DNN pruning and Viterbi search, and achieves 9x energy savings and 4.2x speedup with respect to the state-of-the-art ASR solutions.
******
Tunneling Field-Effect Transistors (TFETs) attain much higher energy efficiency than CMOS at low voltages. However, their performance saturates at high voltages and, therefore, cannot replace CMOS when high performance is needed. Ideally, we desire a core that is as energy-efficient as a TFET core and provides as much performance as a CMOS core. To approach this goal, this paper judiciously integrates both TFET units and CMOS units in a single core, effectively creating a hetero-device core. We call it HetCore, and present CPU and GPU versions. In HetCore, TFETs are used in units that consume high power under CMOS, are amenable to pipelining or are not very latency sensitive, and use a sizable area. HetCore powers CMOS and TFET units at different voltage levels, so they operate optimally. However, all units are clocked at the same frequency. Our results based on simulations running standard applications show the potential of this approach, even with conservative assumptions. A HetCore CPU consumes on average 39% less energy than a CMOS CPU, while delivering an average performance that is within 10% of the CMOS CPU. In addition, under a fixed power budget, a multicore with HetCore CPUs can employ twice as many cores as a multicore with CMOS CPUs, resulting in average performance gains of 32% while, at the same time, improving the energy efficiency (ED2) by an average of 68%. Similar results are obtained with HetCore GPUs.
******
Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are equipped with large register files. However, to avoid over-complicating the hardware, registers are statically assigned and exclusively dedicated to threads for the entire duration of the thread's lifetime. This decomposition takes into account the maximum number of live registers at any given point in the GPU binary although the points at which all the requested registers are used may constitute only a small fraction of the whole program. Therefore, a considerable portion of the register file remains under-utilized. In this paper, we propose a software-hardware co-mechanism named RegMutex (Register Mutual Exclusion) to share a subset of physical registers between warps during the GPU kernel execution. With RegMutex, the compiler divides the architected register set into a base register set and an extended register set. While physical registers corresponding to the base register set are statically and exclusively assigned to the warp, the hardware time-shares the remaining physical registers across warps to provision their extended register set. Therefore, the GPU programs can sustain approximately the same performance with the lower number of registers hence yielding higher performance per dollar. For programs that require a large number of registers for execution, RegMutex will enable a higher number of concurrent warps to be resident in the hardware via sharing their register allocations with each other, leading to a higher device occupancy. Since some aspects of register sharing orchestration are being offloaded to the compiler, RegMutex introduces lower hardware complexity compared to existing approaches. Our experiments show that RegMutex improves the register utilization and reduces the number of execution cycles by up to 23% for kernels demanding a high number of registers.
******
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU programming models are designed to explicitly express parallelism, there is no clear explicit way to express data locality---i.e., reuse-based locality to make efficient use of the caches, or NUMA locality to efficiently utilize a NUMA system. On the one hand, this lack of expressiveness makes it a very challenging task for the programmer to write code to get the best performance out of the memory hierarchy. On the other hand, hardware-only architectural techniques are often suboptimal as they miss key higher-level program semantics that are essential to effectively exploit data locality. In this work, we propose the Locality Descriptor, a cross-layer abstraction to explicitly express and exploit data locality in GPUs. The Locality Descriptor (i) provides the software a flexible and portable interface to optimize for data locality, requiring no knowledge of the underlying memory techniques and resources, and (ii) enables the architecture to leverage key program semantics and effectively coordinate a range of techniques (e.g., CTA scheduling, cache management, memory placement) to exploit locality in a programmer-transparent manner. We demonstrate that the Locality Descriptor improves performance by 26.6% on average (up to 46.6%) when exploiting reuse-based locality in the cache hierarchy, and by 53.7% (up to 2.8X) when exploiting NUMA locality in a NUMA memory system.
******
GPUs are becoming first-class compute citizens and increasingly support programmability-enhancing features such as shared virtual memory and hardware cache coherence. This enables them to run a wider variety of programs. However, a key aspect of general-purpose programming where GPUs still have room for improvement is the ability to invoke system calls. We explore how to directly invoke system calls from GPUs. We examine how system calls can be integrated with GPGPU programming models, where thousands of threads are organized in a hierarchy of execution groups. To answer questions on GPU system call usage and efficiency, we implement Genesys, a generic GPU system call interface for Linux. Numerous architectural and OS issues are considered and subtle changes to Linux are necessary, as the existing kernel assumes that only CPUs invoke system calls. We assess the performance of Genesys using micro-benchmarks and applications that exercise system calls for signals, memory management, filesystems, and networking.
******
﻿This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24× to 1.55× and by 1.37× on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED2P (Energy Delay Squared Product) on average by 1.47× and 2.01×, respectively. The average performance improvements increase to 1.52× without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
******
A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.
******
Processing-in-memory (PIM) is a promising solution to address the “memory wall” challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrixvector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360x and the energy consumption by ~895x, across the evaluated machine learning benchmarks.
******
Amdahl's law provides architects a compelling reason to introduce system asymmetry to optimize for both serial and parallel regions of execution. Asymmetry in a multicore processor can arise statically (e.g., from core microarchitecture) or dynamically (e.g., applying dynamic voltage/frequency scaling). Work stealing is an increasingly popular approach to task distribution that elegantly balances task-based parallelism across multiple worker threads. In this paper, we propose asymmetry-aware work-stealing (AAWS) runtimes, which are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key hardware/software techniques: work-pacing, work-sprinting, and work-mugging. Work-pacing and work-sprinting are novel techniques that combine a marginal-utility-based approach with integrated voltage regulators to improve performance and energy efficiency in high-and low-parallel regions. Work-mugging is a previously proposed technique that enables a waiting big core to preemptively migrate work from a busy little core. We propose a simple implementation of work-mugging based on lightweight user-level interrupts. We use a vertically integrated research methodology spanning software, architecture, and VLSI to make the case that holistically combining static asymmetry, dynamic asymmetry, and work-stealing runtimes can improve both performance and energy efficiency in future multicore systems.
******
In high performance computing systems, object deserialization can become a surprisingly important bottleneck-in our test, a set of general-purpose, highly parallelized applications spends 64% of total execution time deserializing data into objects. This paper presents the Morpheus model, which allows applications to move such computations to a storage device. We use this model to deserialize data into application objects inside storage devices, rather than in the host CPU. Using the Morpheus model for object deserialization avoids unnecessary system overheads, frees up scarce CPU and main memory resources for compute-intensive workloads, saves I/O bandwidth, and reduces power consumption. In heterogeneous, co-processor-equipped systems, Morpheus allows application objects to be sent directly from a storage device to a coprocessor (e.g., a GPU) by peer-to-peer transfer, further improving application performance as well as reducing the CPU and main memory utilizations. This paper implements Morpheus-SSD, an SSD supporting the Morpheus model. Morpheus-SSD improves the performance of object deserialization by 1.66×, reduces power consumption by 7%, uses 42% less energy, and speeds up the total execution time by 1.32×. By using NVMe-P2P that realizes peer-to-peer communication between Morpheus-SSD and a GPU, Morpheus-SSD can speed up the total execution time by 1.39× in a heterogeneous computing platform.
******
Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator-improving performance and efficiency-or run on the precise core-maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.
******
Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.
******
Conventional translation look-aside buffers (TLBs) are required to complete address translation with short latencies, as the address translation is on the critical path of all memory accesses even for L1 cache hits. Such strict TLB latency restrictions limit the TLB capacity, as the latency increase with large TLBs may lower the overall performance even with potential TLB miss reductions. Furthermore, TLBs consume a significant amount of energy as they are accessed for every instruction fetch and data access. To avoid the latency restriction and reduce the energy consumption, virtual caching techniques have been proposed to defer translation to after L1 cache misses. However, an efficient solution for the synonym problem has been a critical issue hindering the wide adoption of virtual caching. Based on the virtual caching concept, this study proposes a hybrid virtual memory architecture extending virtual caching to the entire cache hierarchy, aiming to improve both performance and energy consumption. The hybrid virtual caching uses virtual addresses augmented with address space identifiers (ASID) in the cache hierarchy for common non-synonym addresses. For such non-synonyms, the address translation occurs only after last-level cache (LLC) misses. For uncommon synonym addresses, the addresses are translated to physical addresses with conventional TLBs before L1 cache accesses. To support such hybrid translation, we propose an efficient synonym detection mechanism based on Bloom filters which can identify synonym candidates with few false positives. For large memory applications, delayed translation alone cannot solve the address translation problem, as fixed-granularity delayed TLBs may not scale with the increasing memory requirements. To mitigate the translation scalability problem, this study proposes a delayed many segment translation designed for the hybrid virtual caching. The experimental results show that our approach effectively lowers accesses to the TLBs, leading to significant power savings. In addition, the approach provides performance improvement with scalable delayed translation with variable length segments.
******
Emerging non-volatile memory (NVM) technologies, such as spin-transfer torque RAM (STT-RAM), are attractive options for replacing or augmenting SRAM in implementing last-level caches (LLCs). However, the asymmetric read/write energy and latency associated with NVM introduces new challenges in designing caches where, in contrast to SRAM, dynamic energy from write operations can be responsible for a larger fraction of total cache energy than leakage. These properties lead to the fact that no single traditional inclusion policy being dominant in terms of LLC energy consumption for asymmetric LLCs. We propose a novel selective inclusion policy, Loop-block-Aware Policy (LAP), to reduce energy consumption in LLCs with asymmetric read/write properties. In order to eliminate redundant writes to the LLC, LAP incorporates advantages from both non-inclusive and exclusive designs to selectively cache only part of upper-level data in the LLC. Results show that LAP outperforms other variants of selective inclusion policies and consumes 20% and 12% less energy than non-inclusive and exclusive STT-RAM-based LLCs, respectively. We extend LAP to a system with SRAM/STT-RAM hybrid LLCs to achieve energy-efficient data placement, reducing the energy consumption by 22% and 15% over non-inclusion and exclusion on average, with average-case performance improvements, small worst-case performance loss, and minimal hardware overheads.
******
Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.
******
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.
******
On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.
******
Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15x.
******
Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.
******
GPU and FPGA-based clouds have already demonstrated the promise of accelerating computing-intensive workloads with greatly improved power and performance. In this paper, we examine the design of ASIC Clouds, which are purpose-built datacenters comprised of large arrays of ASIC accelerators, whose purpose is to optimize the total cost of ownership (TCO) of large, high-volume chronic computations, which are becoming increasingly common as more and more services are built around the Cloud model. On the surface, the creation of ASIC clouds may seem highlyimprobable due to high NREs and the inflexibility of ASICs. Surprisingly, however, large-scale ASIC Clouds have already been deployed by a large number of commercial entities, to implement the distributed Bitcoin cryptocurrency system. We begin with a case study of Bitcoin mining ASIC Clouds, which are perhaps the largest ASIC Clouds to date. From there, we design three more ASIC Clouds, including a YouTube-style video transcoding ASIC Cloud, a Litecoin ASIC Cloud, and a Convolutional Neural Network ASIC Cloud and show 2-3 orders of magnitude better TCO versus CPU and GPU. Among our contributions, we present a methodology that given an accelerator design, derives Pareto-optimal ASIC Cloud Servers, by extracting data from place-and-routed circuits and computational fluid dynamic simulations, and then employing clever but brute-force search to find the best jointly-optimized ASIC, DRAM subsystem, motherboard, power delivery system, cooling system, operating voltage, and case design. Moreover, we show how data center parameters determine which of the many Pareto-optimal points is TCO-optimal. Finally we examine when it makes sense to build an ASIC Cloud, and examine the impact of ASIC NRE.
******
Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.
******
Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer. Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping. Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
******
As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.
******
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving, Exploiting sparsity saves 10x, Weight sharing gives 8x, Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x104 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
******
Continuous mobile vision is limited by the inability to efficiently capture image frames and process vision features. This is largely due to the energy burden of analog readout circuitry, data traffic, and intensive computation. To promote efficiency, we shift early vision processing into the analog domain. This results in RedEye, an analog convolutional image sensor that performs layers of a convolutional neural network in the analog domain before quantization. We design RedEye to mitigate analog design complexity, using a modular column-parallel design to promote physical design reuse and algorithmic cyclic reuse. RedEye uses programmable mechanisms to admit noise for tunable energy reduction. Compared to conventional systems, RedEye reports an 85% reduction in sensor energy, 73% reduction in cloudlet-based system energy, and a 45% reduction in computation-based system energy.
******
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
******
With the degree of parallelism increasing, performance of multi-threaded shared variable applications is not only limited by serialized critical section execution, but also by the serialized competition overhead for threads to get access to critical section. As the number of concurrent threads grows, such competition overhead may exceed the time spent in critical section itself, and become the dominating factor limiting the performance of parallel applications. In modern operating systems, queue spinlock, which comprises a low-overhead spinning phase and a high-overhead sleeping phase, is often used to lock critical sections. In the paper, we show that this advanced locking solution may create very high competition overhead for multithreaded applications executing in NoC-based CMPs. Then we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance that a thread wins the critical section access in the low-overhead spinning phase, thereby reducing the competition overhead. At the OS primitives level, we monitor the remaining times of retry (RTR) in a thread's spinning phase, which reflects in how long the thread must enter into the high-overhead sleep mode. At the hardware level, we integrate the RTR information into the packets of locking requests, and let the NoC prioritize locking request packets according to the RTR information. The principle is that the smaller RTR a locking request packet carries, the higher priority it gets and thus quicker delivery. We evaluate our opportunistic competition overhead reduction technique with cycle-accurate full-system simulations in GEM5 using PARSEC (11 programs) and SPEC OMP2012 (14 programs) benchmarks. Compared to the original queue spinlock implementation, experimental results show that our method can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely by 39.9% (maximally by 61.8%) and accelerating the execution of the Region-of-Interest averagely by 14.4% (maximally by 24.5%) across all 25 benchmark programs.
******
Interpreters are widely used to implement high-level language virtual machines (VMs), especially on resource-constrained embedded platforms. Many scripting languages employ interpreter-based VMs for their advantages over native code compilers, such as portability, smaller resource footprint, and compact codes. For efficient interpretation a script (program) is first compiled into an intermediate representation, or bytecodes. The canonical interpreter then runs an infinite loop that fetches, decodes, and executes one bytecode at a time. This bytecode dispatch loop is a well-known source of inefficiency, typically featuring a large jump table with a hard-to-predict indirect jump. Most existing techniques to optimize this loop focus on reducing the misprediction rate of this indirect jump in both hardware and software. However, these techniques are much less effective on embedded processors with shallow pipelines and low IPCs. Instead, we tackle another source of inefficiency more prominent on embedded platforms - redundant computation in the dispatch loop. To this end, we propose Short-Circuit Dispatch (SCD), a low cost architectural extension that enables fast, hardware-based bytecode dispatch with fewer instructions. The key idea of SCD is to overlay the software-created bytecode jump table on a branch target buffer (BTB). Once a bytecode is fetched, the BTB is looked up using the bytecode, instead of PC, as key. If it hits, the interpreter directly jumps to the target address retrieved from the BTB, otherwise, it goes through the original dispatch path. This effectively eliminates redundant computation in the dispatcher code for decode, bound check, and target address calculation, thus significantly reducing total instruction count. Our simulation results demonstrate that SCD achieves geomean speedups of 19.9% and 14.1% for two production-grade script interpreters for Lua and JavaScript, respectively. Moreover, our fully synthesizable RTL design based on a RISC-V embedded processor shows that SCD improves the EDP of the Lua interpreter by 24.2%, while increasing the chip area by only 0.72% at a 40nm technology node.
******
ARM servers are becoming increasingly common, making server technologies such as virtualization for ARM of growing importance. We present the first study of ARM virtualization performance on server hardware, including multi-core measurements of two popular ARM and x86 hypervisors, KVM and Xen. We show how ARM hardware support for virtualization can enable much faster transitions between VMs and the hypervisor, a key hypervisor operation. However, current hypervisor designs, including both Type 1 hypervisors such as Xen and Type 2 hypervisors such as KVM, are not able to leverage this performance benefit for real application workloads. We discuss the reasons why and show that other factors related to hypervisor software design and implementation have a larger role in overall performance. Based on our measurements, we discuss changes to ARM's hardware virtualization support that can potentially bridge the gap to bring its faster VM-to-hypervisor transition mechanism to modern Type 2 hypervisors running real applications. These changes have been incorporated into the latest ARM architecture.
******
The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.
******
As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC), to increase the effective memory bandwidth. BPC aims at homogeneously-typed memory blocks, which are prevalent in many-core architectures, and applies a smart data transformation to both improve the inherent data compressibility and to reduce the complexity of compression hardware. We demonstrate that BPC provides superior compression ratios of 4.1:1 for integer benchmarks and reduces memory bandwidth requirements significantly.
******
Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose eXposed On-Die Error Detection (XED), which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined “catch-word” instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172× higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.
******
Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.
******
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.
******
This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. The PE clusters access multiple memory channels (vaults) in parallel. The operating principle, referred to as the memory centric computing, embeds specialized state-machines within the vault controllers of HMC to drive data into the PE clusters. The paper presents the basic architecture of the Neurocube and an analysis of the logic tier synthesized in 28nm and 15nm process technologies. The performance of the Neurocube is evaluated and illustrated through the mapping of a Convolutional Neural Network and estimating the subsequent power and performance for both training and inference.
******
Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.
******
We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.
******
As the rate of annual data generation grows exponentially, there is a demand to aggregate and summarise vast amounts of information quickly. In the past, frequency scaling was relied upon to push application throughput. Today, Dennard scaling has ceased and further performance must come from exploiting parallelism. Single instruction-multiple data (SIMD) instruction sets offer a highly efficient and scalable way of exploiting data-level parallelism (DLP). While microprocessors originally offered very simple SIMD support targeted at multimedia applications, these extensions have been growing both in width and functionality. Observing this trend, we use a simulation framework to model future SIMD support and then propose and evaluate five different ways of vectorising data aggregation. We find that although data aggregation is abundant in DLP, it is often too irregular to be expressed efficiently using typical SIMD instructions. Based on this observation, we propose a set of novel algorithms and SIMD instructions to better capture this irregular DLP. Furthermore, we discover that the best algorithm is highly dependent on the characteristics of the input. Our proposed solution can dynamically choose the optimal algorithm in the majority of cases and achieves speedups between 2.7x and 7.6x over a scalar baseline.
******
Simultaneous multithreading (SMT) out-of-order cores waste a significant portion of structural out-of-order core resources on instructions that do not need them. These resources eliminate false ordering dependences. However, because thread interleaving spreads dependent instructions, nearly half of instructions dynamically issue in program order after all false dependences have resolved. These in-sequence instructions interleave with other reordered instructions at a fine granularity within the instruction window. We develop a technique to efficiently scale in-flight instructions through a hybrid out-of-order/in-order microarchitecture, which can dispatch instructions to efficient in-order scheduling mechanisms -- using a FIFO issue queue called the shelf -- on an instruction-by-instruction basis. Instructions dispatched to the shelf do not allocate out-of-order core resources in the reorder buffer, issue queue, physical registers, or load-store queues. We measure opportunity for such hybrid microarchitectures and design and evaluate a practical dispatch mechanism targeted at 4-threaded cores. Adding a shelf to a baseline 4-thread system with 64- entry ROB improves normalized system throughput by 11.5% (up to 19.2% at best) and energy-delay product by 10.9% (up to 17.5% at best).
******
On-chip contention increases memory access latency for multi-core processors. We identify that this additional latency has a substantial effect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the first cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Buffer prefetcher, the highest performing prefetcher in our evaluation.
******
Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to the deployed software and production hardware configurations. Such changes demand a confident understanding of the impact on one's service, in particular its effect on tail latency (e.g., 95th-or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions. In this paper, we develop a methodology for statistically rigorous performance evaluation and performance factor attribution for server workloads. First, we find that careful design of the server load tester can ensure high quality performance evaluation, and empirically demonstrate the inaccuracy of load testers in previous work. Learning from the design flaws in prior work, we design and develop a modular load tester platform, Treadmill, that overcomes pitfalls of existing tools. Next, utilizing Treadmill, we construct measurement and analysis procedures that can properly attribute performance factors. We rely on statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the impact of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.
******
Data center power is a scarce resource that often goes underutilized due to conservative planning. This is because the penalty for overloading the data center power delivery hierarchy and tripping a circuit breaker is very high, potentially causing long service outages. Recently, dynamic server power capping, which limits the amount of power consumed by a server, has been proposed and studied as a way to reduce this penalty, enabling more aggressive utilization of provisioned data center power. However, no real at-scale solution for data center-wide power monitoring and control has been presented in the literature. In this paper, we describe Dynamo -- a data center-wide power management system that monitors the entire power hierarchy and makes coordinated control decisions to safely and efficiently use provisioned data center power. Dynamo has been developed and deployed across all of Facebook's data centers for the past three years. Our key insight is that in real-world data centers, different power and performance constraints at different levels in the power hierarchy necessitate coordinated data center-wide power management. We make three main contributions. First, to understand the design space of Dynamo, we provide a characterization of power variation in data centers running a diverse set of modern workloads. This characterization uses fine-grained power samples from tens of thousands of servers and spanning a period of over six months. Second, we present the detailed design of Dynamo. Our design addresses several key issues not addressed by previous simulation-based studies. Third, the proposed techniques and design have been deployed and evaluated in large scale data centers serving billions of users. We present production results showing that Dynamo has prevented 18 potential power outages in the past 6 months due to unexpected power surges, that Dynamo enables optimizations leading to a 13% performance boost for a production Hadoop cluster and a nearly 40% performance increase for a search cluster, and that Dynamo has already enabled an 8% increase in the power capacity utilization of one of our data centers with more aggressive power subscription measures underway.
******
Energy proportionality of data center severs have improved drastically over the past decade to the point where near ideal energy proportional servers are now common. These highly energy proportional servers exhibit the unique property where peak efficiency no longer coincides with peak utilization. In this paper, we explore the implications of this property on data center scheduling. We identified that current state of the art data center schedulers does not efficiently leverage these properties, leading to inefficient scheduling decisions. We propose Peak Efficiency Aware Scheduling (PEAS) which can achieve better-than-ideal energy proportionality at the data center level. We demonstrate that PEAS can reduce average power by 25.5% with 3.0% improvement to TCO compared to state-of-the-art scheduling policies.
******
Battery systems are crucial components for mission-critical data centers. Without secure energy backup, existing under-provisioned data centers are largely unguarded targets for cyber criminals. Particularly for today's scale-out servers, power oversubscription unavoidably taxes a data center's backup energy resources, leaving very little room for dealing with emergency. Besides, the emerging trend towards deploying distributed energy storage architecture causes the associated energy backup of each rack to shrink, making servers vulnerable to power anomalies. As a result, an attacker can generate power peaks to easily crash or disrupt a power-constrained system. This study aims at securing data centers from malicious loads that seek to drain their precious energy storage and overload server racks without prior detection. We term such load as Power Virus (PV) and demonstrate its basic two-phase attacking model and characterize its behaviors on real systems. The PV can learn the victim rack's battery characteristics by disguising as benign loads. Once gaining enough information, the PV can be mutated to generate hidden power spikes that have a high chance to overload the system. To defend against PV, we propose power attack defense (PAD), a novel energy management patch built on lightweight software and hardware mechanisms. PAD not only increases the attacking cost considerably by hiding vulnerable racks from visible spikes, it also strengthens the last line of defense against hidden spikes. Using Google cluster traces we show that PAD can effectively raise the bar of a successful power attack: compared to prior arts, it increases the data center survival time by 1.6~11X and provides better performance guarantee. It enables modern data centers to safely exploit the benefits that power oversubscription may provide, with the slightest cost overhead.
******
FPGAs are a popular target for application-specific accelerators because they lead to a good balance between flexibility and energy efficiency. However, FPGA lookup tables introduce significant area and power overheads, making it difficult to use FPGA devices in environments with tight cost and power constraints. This is the case for datacenter servers, where a modestly-sized FPGA cannot accommodate the large number of diverse accelerators that datacenter applications need. This paper introduces DRAF, an architecture for bit-level reconfigurable logic that uses DRAM subarrays to implement dense lookup tables. DRAF overlaps DRAM operations like bitline precharge and charge restoration with routing within the reconfigurable routing fabric to minimize the impact of DRAM latency. It also supports multiple configuration contexts that can be used to quickly switch between different accelerators with minimal latency. Overall, DRAF trades off some of the performance of FPGAs for significant gains in area and power. DRAF improves area density by 10x over FPGAs and power consumption by more than 3x, enabling DRAF to satisfy demanding applications within strict power and cost constraints. While accelerators mapped to DRAF are 2-3x slower than those in FPGAs, they still deliver a 13x speedup and an 11x reduction in power consumption over a Xeon core for a wide range of datacenter tasks, including analytics and interactive services like speech recognition.
******
Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major drawback for such resistive memory technologies. Wear leveling (balancing the distribution of writes) and wear limiting (reducing the number of writes) have been proposed to mitigate this disadvantage, but both techniques only manage a fixed budget of writes to a memory system rather than increase the number available. In this paper, we propose a new type of wear limiting technique, Mellow Writes, which reduces the wearout of individual writes rather than reducing the number of writes. Mellow Writes is based on the fact that slow writes performed with lower dissipated power can lead to longer endurance (and therefore longer lifetimes). For non-volatile memories, an N1 to N3 times endurance can be achieved if the write operation is slowed down by N times. We present three microarchitectural mechanisms (BankAware Mellow Writes, Eager Mellow Writes, and Wear Quota) that selectively perform slow writes to increase memory lifetime while minimizing performance impact. Assuming a factor N2 advantage in cell endurance for a factor N slower write, our best Mellow Writes mechanism can achieve 2.58× lifetime and 1.06× performance of the baseline system. In addition, its performance is almost the same as a system aggressively optimized for performance (at the expense of endurance). Finally, Wear Quota guarantees a minimal lifetime (e.g., 8 years) by forcing more slow writes in presence of heavy workloads. We also perform sensitivity analysis on the endurance advantage factor for slow writes, from N1 to N3, and find that our technique is still useful for factors as low as N1.
******
Memory bandwidth severely limits the scalability and performance of multicore and manycore systems. Application performance can be very sensitive to both the delivered memory bandwidth and latency. In multicore systems, a memory channel is usually shared by multiple cores. Having the ability to precisely provision, schedule, and isolate memory bandwidth and latency on a per-core basis is particularly important when different memory guarantees are needed on a per-customer, per-application, or per-core basis. Infrastructure as a Service (IaaS) Cloud systems, and even general purpose multicores optimized for application throughput or fairness all benefit from the ability to control and schedule memory access on a fine-grain basis. In this paper, we propose MITTS (Memory Inter-arrival Time Traffic Shaping), a simple, distributed hardware mechanism which limits memory traffic at the source (Core or LLC). MITTS shapes memory traffic based on memory request inter-arrival time, enabling fine-grain bandwidth allocation. In an IaaS system, MITTS enables Cloud customers to express their memory distribution needs and pay commensurately. For instance, MITTS enables charging customers that have bursty memory traffic more than customers with uniform memory traffic for the same aggregate bandwidth. Beyond IaaS systems, MITTS can also be used to optimize for throughput or fairness in a general purpose multi-program workload. MITTS uses an online genetic algorithm to configure hardware bins, which can adapt for program phases and variable input sets. We have implemented MITTS in Verilog and have taped-out the design in a 25-core 32nm processor and find that MITTS requires less than 0.9% of core area. We evaluate across SPECint, PARSEC, Apache, and bhm Mail Server workloads, and find that MITTS achieves an average 1.18× performance gain compared to the best static bandwidth allocation, a 2.69× average performance/cost advantage in an IaaS setting, and up to 1.17x better throughput and 1.52× better fairness when compared to conventional memory bandwidth provisioning techniques.
******
Approximate computing is an emerging paradigm enabling tradeoffs between accuracy and efficiency. However, a fundamental challenge persists: state-of-the-art techniques lack the ability to enforce runtime guarantees on accuracy. The convention is to 1) employ offline or online accuracy models, or 2) present experimental results that demonstrate empirically low error. Unfortunately, these approaches are still unable to guarantee acceptability of all application outputs at runtime. We offer a solution that revisits concepts from anytime algorithms. Originally explored for real-time decision problems, anytime algorithms have the property of producing results with increasing accuracy over time. We propose the Anytime Automaton, a new computation model that executes applications as a parallel pipeline of anytime approximations. An automaton produces approximate versions of the application output with increasing accuracy, guaranteeing that the final precise version is eventually reached. The automaton can be stopped whenever the output is deemed acceptable, otherwise, it is a simple matter of letting it run longer. We present an in-depth analysis of the model and demonstrate attractive runtime-accuracy profiles on various applications. Our anytime automaton is the first step towards systems where the acceptability of an application's output directly governs the amount of time and energy expended.
******
The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov Chain Monte Carlo (MCMC) sampling, which can be used on a wide variety of applications in Bayesian Inference. However, this probabilistic iterative algorithm can be inefficient in practice on today's processors, especially for problems with high dimensionality and complex structure. The source of inefficiency is generating samples from parameterized probability distributions. This paper seeks to address this sampling inefficiency and presents a new approach to support probabilistic computing that leverages the native randomness of Resonance Energy Transfer (RET) networks to construct RET-based sampling units (RSU). Although RSUs can be designed for a variety of applications, we focus on the specific class of probabilistic problems described as Markov Random Field Inference. Our proposed RSU uses a RET network to implement a molecular-scale optical Gibbs sampling unit (RSU-G) that can be integrated into a processor / GPU as specialized functional units or organized as a discrete accelerator. We experimentally demonstrate the fundamental operation of an RSU using a macro-scale hardware prototype. Emulation-based evaluation of two computer vision applications for HD images reveal that an RSU augmented GPU provides speedups over a GPU of 3 and 16. Analytic evaluation shows a discrete accelerator that is limited by 336 GB/s DRAM produces speedups of 21 and 54 versus the GPU implementations.
******
Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a reconfigurable analog accelerator for solving systems of linear equations. Commonly perceived downsides of analog computing, such as low precision and accuracy, limited problem sizes, and difficulty in programming are all compensated for using methods we discuss. Based on a prototyped analog accelerator chip we compare the performance and energy consumption of the analog solver against an efficient digital algorithm running on a CPU, and find that the analog accelerator approach may be an order of magnitude faster and provide one third energy savings, depending on the accelerator design. Due to the speed and efficiency of linear algebra algorithms running on digital computers, an analog accelerator that matches digital performance needs a large silicon footprint. Finally, we conclude that problem classes outside of systems of linear equations may hold more promise for analog acceleration.
******
Recent developments in GPU execution models and architectures have introduced dynamic parallelism to facilitate the execution of irregular applications where control flow and memory behavior can be unstructured, time-varying, and hierarchical. The changes brought about by this extension to the traditional bulk synchronous parallel (BSP) model also creates new challenges in exploiting the current GPU memory hierarchy. One of the major challenges is that the reference locality that exists between the parent and child thread blocks (TBs) created during dynamic nested kernel and thread block launches cannot be fully leveraged using the current TB scheduling strategies. These strategies were designed for the current implementations of the BSP model but fall short when dynamic parallelism is introduced since they are oblivious to the hierarchical reference locality. We propose LaPerm, a new locality-aware TB scheduler that exploits such parent-child locality, both spatial and temporal. LaPerm adopts three different scheduling decisions to i) prioritize the execution of the child TBs, ii) bind them to the stream multiprocessors (SMXs) occupied by their parents TBs, and iii) maintain workload balance across compute units. Experiments with a set of irregular CUDA applications executed on a cycle-level simulator employing dynamic parallelism demonstrate that LaPerm is able to achieve an average of 27% performance improvement over the baseline round-robin TB scheduler commonly used in modern GPUs.
******
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.
******
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.
******
Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leave serious gaps in data-only protection. This paper presents All-Inclusive ECC (AIECC), a memory protection scheme that leverages and augments data ECC to also thoroughly protect CCCA signals. AIECC provides strong end-to-end protection of memory, detecting nearly 100% of CCCA errors and also preventing transmission errors from causing latent memory data corruption. AIECC provides these system-level benefits without requiring extra storage and transfer overheads and without degrading the effective level of data protection.
******
Voltage scaling can effectively reduce processor power, but also reduces the reliability of the SRAM cells in on-chip memories. Therefore, it is often accompanied by the use of an error correcting code (ECC). To enable reliable and efficient memory operation at low voltages, ECCs for on-chip memories must provide both high error coverage and low correction latency. In this paper, we propose error pattern transformation, a novel low-latency error correction technique that allows on-chip memories to be scaled to voltages lower than what has been previously possible. Our technique relies on the observation that the number of on-chip memory errors that many ECCs can correct differs widely depending on the error patterns in the logical words they protect. We propose adaptively rearranging the logical bit to physical bit mapping per word according to the BIST-detectable fault pattern in the physical word. The adaptive logical bit to physical bit mapping transforms many uncorrectable error patterns in the logical words into correctable error patterns and, therefore, improving on-chip memory reliability. This reduces the minimum voltage at which on-chip memory can run by 70mV over the best low-latency ECC baseline, leading to a 25.7% core-wide power reduction for an ARM Cortex-A7-like core. Energy per instruction is reduced by 15.7% compared to the best baseline.
******
Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the vast majority of memory faults using a small amount of the LLC to remap faulty memory locations. RelaxFault requires less than 100KiB of LLC capacity, has near-zero impact on performance and power. By repairing faults, RelaxFault relaxes the requirement for high fault tolerance of other mechanisms, such as ECC. A better tradeoff between resilience and overhead is made by exploiting an understanding of memory system architecture and fault characteristics. We show that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements. We also propose a more refined memory fault model than prior work and demonstrate its importance.
******
As processors seek more resource efficiency, they increasingly need to target multiple goals at the same time, such as a level of performance, power consumption, and average utilization. Robust control solutions cannot come from heuristic-based controllers or even from formal approaches that combine multiple single-parameter controllers. Such controllers may end-up working against each other. What is needed is control-theoretical MIMO (multiple input, multiple output) controllers, which actuate on multiple inputs and control multiple outputs in a coordinated manner. In this paper, we use MIMO control-theory techniques to develop controllers to dynamically tune architectural parameters in processors. To our knowledge, this is the first work in this area. We discuss three ways in which a MIMO controller can be used. We develop an example of MIMO controller and show that it is substantially more effective than controllers based on heuristics or built by combining single-parameter formal controllers. The general approach discussed here is likely to be increasingly relevant as future processors become more resource-constrained and adaptive.
******
Many emerging applications such as the internet of things, wearables, and sensor networks have ultra-low-power requirements. At the same time, cost and programmability considerations dictate that many of these applications will be powered by general purpose embedded microprocessors and microcontrollers, not ASICs. In this paper, we exploit a new opportunity for improving energy efficiency in ultralow-power processors expected to drive these applications -- dynamic timing slack. Dynamic timing slack exists when an embedded software application executed on a processor does not exercise the processor's static critical paths. In such scenarios, the longest path exercised by the application has additional timing slack which can be exploited for power savings at no performance cost by scaling down the processor's voltage at the same frequency until the longest exercised paths just meet timing constraints. Paths that cannot be exercised by an application can safely be allowed to violate timing constraints. We show that dynamic timing slack exists for many ultra-low-power applications and that exploiting dynamic timing slack can result in significant power savings for any ultra-low-power processors. We also present an automated methodology for identifying dynamic timing slack and selecting a safe operating point for a processor and a particular embedded software. Our approach for identifying and exploiting dynamic timing slack is non-speculative, requires no programmer intervention and little or no hardware support, and demonstrates potential power savings of up to 32%, 25% on average, over a range of embedded applications running on a common ultra-low-power processor, at no performance cost.
******
Infrastructure as a Service (IaaS) Clouds have grown increasingly important. Recent architecture designs support IaaS providers through fine-grain configurability, allowing providers to orchestrate low-level resource usage. Little work, however, has been devoted to supporting IaaS customers who must determine how to use such fine-grain configurable resources to meet quality-of-service (QoS) requirements while minimizing cost. This is a difficult problem because the multiplicity of configurations creates a non-convex optimization space. In addition, this optimization space may change as customer applications enter and exit distinct processing phases. In this paper, we overcome these issues by proposing CASH: a fine-grain configurable architecture co-designed with a cost-optimizing runtime system. The hardware architecture enables configurability at the granularity of individual ALUs and L2 cache banks and provides unique interfaces to support low-overhead, dynamic configuration and monitoring. The runtime uses a combination of control theory and machine learning to configure the architecture such that QoS requirements are met and cost is minimized. Our results demonstrate that the combination of fine-grain configurability and non-convex optimization provides tremendous cost savings (70% savings) compared to coarse-grain heterogeneity and heuristic optimization. In addition, the system is able to customize configurations to particular applications, respond to application phases, and provide near optimal cost for QoS targets.
******
Despite its promise as a DRAM main memory replacement, Phase Change Memory (PCM) has high write latencies which can be a serious detriment to its widespread adoption. Apart from slowing down a write request, the consequent high latency can also keep other chips of the same rank, that are not involved in this write, idle for long times. There are several practical considerations that make it difficult to allow subsequent reads and/or writes to be served concurrently from the same chips during the long latency write. This paper proposes and evaluates several novel mechanisms - re-constructing data from error correction bits instead of waiting for chips currently busy to serve a read, rotating word mappings across chips of a PCM rank, and rotating the mapping of error detection/correction bits across these chips - to overlap several reads with an ongoing write (RoW) and even a write with an ongoing write (WoW). The paper also presents the necessary micro-architectural enhancements needed to implement these mechanisms, without significantly changing the current interfaces. The resulting PCM access parallelism (PCMap) system incorporating these enhancements, boosts the intra-rank-level parallelism during such writes from a very low baseline value of 2.4 to an average and maximum values of 4.5 and 7.4, respectively (out of a maximum of 8.0), across a wide spectrum of both multiprogrammed and multithreaded workloads. This boost in parallelism results in an average IPC improvement of 15.6% and 16.7% for the multi-programmed and multi-threaded workloads, respectively.
******
Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels of address translation - one in the guest virtual machine (VM) and the other in the host virtual machine monitor (VMM) - with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, but enables fast page tables changes. Alternatively, shadow paging restores native TLB miss speeds, but requires costly VMM intervention on page table updates. This paper proposes agile paging that combines both techniques and exceeds the best of both. A virtualized page walk starts with shadow paging and optionally switches in the same page walk to nested paging where frequent page table updates would cause costly VMM interventions. Agile paging enables most TLB misses to be handled as fast as native while most page table changes avoid VMM intervention. It requires modest changes to hardware (e.g., demark when to switch) and VMM policies (e.g., predict good switching opportunities). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. Agile paging performs more than 12% better than the best of the two techniques and comes within 4% of native execution for all workloads.
******
As DRAM data bandwidth increases, tremendous energy is dissipated in the DRAM data bus. To reduce the energy consumed in the data bus, DRAM interfaces with symmetric termination, such as Pseudo Open Drain (POD) and Low Voltage Swing Terminated Logic (LVSTL), have been adopted in modern DRAMs. In interfaces using asymmetric termination, the amount of termination energy is proportional to the hamming weight of the data words. In this work, we propose Bitwise Difference Encoding (BD-Encoding), which decreases the hamming weight of data words, leading to a reduction in energy consumption in the modern DRAM data bus. Since smaller hamming weight of the data words also reduces switching activity, switching energy and power noise are also both reduced. BD-Encoding exploits the similarity in data words in the DRAM data bus. We observed that similar data words (i.e. data words whose hamming distance is small) are highly likely to be sent over at similar times. Based on this observation, BD-coder stores the data recently sent over in both the memory controller and DRAMs. Then, BD-coder transfers the bitwise difference between the current data and the most similar data. In an evaluation using SPEC 2006, BD-Encoding using 64 recent data reduced termination energy by 58.3% and switching energy by 45.3%. In addition, 55% of the LdI/dt noise was decreased with BD-Encoding.
﻿Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
******
Deep Neural Networks (DNNs) have demonstrated state-of-the-art performance on a broad range of tasks involving natural language, speech, image, and video processing, and are deployed in many real world applications. However, DNNs impose significant computational challenges owing to the complexity of the networks and the amount of data they process, both of which are projected to grow in the future. To improve the efficiency of DNNs, we propose ScaleDeep, a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs. While several DNN accelerator designs have been proposed in recent years, the key difference is that ScaleDeep primarily targets DNN training, as opposed to only inference or evaluation. The key architectural features from which ScaleDeep derives its efficiency are: (i) heterogeneous processing tiles and chips to match the wide diversity in computational characteristics (FLOPs and Bytes/FLOP ratio) that manifest at different levels of granularity in DNNs, (ii) a memory hierarchy and 3-tiered interconnect topology that is suited to the memory access and communication patterns in DNNs, (iii) a low-overhead synchronization mechanism based on hardware data-flow trackers, and (iv) methods to map DNNs to the proposed architecture that minimize data movement and improve core utilization through nested pipelining. We have developed a compiler to allow any DNN topology to be programmed onto ScaleDeep, and a detailed architectural simulator to estimate performance and energy. The simulator incorporates timing and power models of ScaleDeep's components based on synthesis to Intel's 14nm technology. We evaluate an embodiment of ScaleDeep with 7032 processing tiles that operates at 600 MHz and has a peak performance of 680 TFLOPs (single precision) and 1.35 PFLOPs (half-precision) at 1.4KW. Across 11 state-of-the-art DNNs containing 0.65M-14.9M neurons and 6.8M-145.9M weights, including winners from 5 years of the ImageNet competition, ScaleDeep demonstrates 6x-28x speedup at iso-power over the state-of-the-art performance on GPUs.
******
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.
******
A large number of emerging applications such as implantables, wearables, printed electronics, and IoT have ultra-low area and power constraints. These applications rely on ultra-low-power general purpose microcontrollers and microprocessors, making them the most abundant type of processor produced and used today. While general purpose processors have several advantages, such as amortized development cost across many applications, they are significantly over-provisioned for many area- and power-constrained systems, which tend to run only one or a small number of applications over their lifetime. In this paper, we make a case for bespoke processor design, an automated approach that tailors a general purpose processor IP to a target application by removing all gates from the design that can never be used by the application. Since removed gates are never used by an application, bespoke processors can achieve significantly lower area and power than their general purpose counterparts without any performance degradation. Also, gate removal can expose additional timing slack that can be exploited to increase area and power savings or performance of a bespoke design. Bespoke processor design reduces area and power by 62% and 50%, on average, while exploiting exposed timing slack improves average power savings to 65%.
******
This paper investigates the feasibility of a unified processor architecture to enable error coding flexibility and secure communication in low power Internet of Things (IoT) wireless networks. Error coding flexibility for wireless communication allows IoT applications to exploit the large tradeoff space in data rate, link distance and energy-efficiency. As a solution, we present a light-weight Galois Field (GF) processor to enable energy-efficient block coding and symmetric/asymmetric cryptography kernel processing for a wide range of GF sizes (2m, m = 2, 3, ..., 233) and arbitrary irreducible polynomials. Program directed connections among primitive GF arithmetic units enable dynamically configured parallelism to efficiently perform either four-way SIMD 5- to 8-bit GF operations, including multiplicative inverse, or a wide bit-width (e.g., 32-bit) GF product in a single cycle. To illustrate our ideas, we synthesized our GF processor in a 28nm technology. Compared to a baseline software implementation optimized for a general purpose ARM M0+ processor, our processor exhibits a 5-20 x speedup for a range of error correction codes and symmetric/asymmetric cryptography applications. Additionally, our proposed GF processor consumes 431μW at 0.9V and 100MHz, and achieves 35.5pJ/b energy efficiency while executing AES operations at 12.2Mbps. We achieve this within an area of 0.01mm2.
******
Wearable computing systems have spurred many opportunities to continuously monitor human bodies with sensors worn on or implanted in the body. These emerging platforms have started to revolutionize many fields, including healthcare and wellness applications, particularly when integrated with intelligent analytic capabilities. However, a significant challenge that computer architects are facing is how to embed sophisticated analytic capabilities in wearable computers in an energy-efficient way while not compromising system performance. In this paper, we present XPro, a novel cross-end analytic engine architecture for wearable computing systems. The proposed cross-end architecture is able to realize a generic classification design across wearable sensors and a data aggregator with high energy-efficiency. To facilitate the practical use of XPro, we also develop an Automatic XPro Generator that formally generates XPro instances according to specific design constraints. As a proof of concept, we study the design and implementation of XPro with six different health applications. Evaluation results show that, compared with state-of-the-art methods, XPro can increase the battery life of the sensor node by 1.6-2.4X while at the same time reducing system delay by 15.6-60.8% for wearable computing systems.
******
Intel's SGX secure execution technology allows running computations on secret data using untrusted servers. While recent work showed how to port applications and large-scale computations to run under SGX, the performance implications of using the technology remains an open question. We present the first comprehensive quantitative study to evaluate the performance of SGX. We show that straightforward use of SGX library primitives for calling functions add between 8,200 - 17,000 cycles overhead, compared to 150 cycles of a typical system call. We quantify the performance impact of these library calls and show that in applications with high system calls frequency, such as memcached, openVPN, and lighttpd, which all have high bandwidth network requirements, the performance degradation may be as high as 79%. We investigate the sources of this performance degradation by leveraging a new set of microbenchmarks for SGX-specific operations such as enclave entry-calls and out-calls, and encrypted memory I/O accesses. We leverage the insights we gain from these analyses to design a new SGX interface framework HotCalls. HotCalls are based on a synchronization spin-lock mechanism and provide a 13-27x speedup over the default interface. It can easily be integrated into existing code, making it a practical solution. Compared to a baseline SGX implementation of memcached, openVPN, and lighttpd - we show that using the new interface boosts the throughput by 2.6-3.7x, and reduces application latency by 62-74%.
******
A practically feasible low-overhead hardware design that provides strong defenses against memory bus side channel remains elusive. This paper observes that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem. InvisiMem expands the trust base to include the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing several memory bus side channel vulnerabilities efficiently. This allows the secure host processor to send encrypted addresses over the untrusted memory bus, and thereby eliminates the need for expensive address obfuscation techniques based on Oblivious RAM (ORAM). In addition, smart memory enables efficient solutions for ensuring freshness without using expensive Merkle trees, and mitigates memory bus timing channel using constant heart-beat packets. We demonstrate that InvisiMem designs have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.
******
Trustworthy software requires strong privacy and security guarantees from a secure trust base in hardware. While chipmakers provide hardware support for basic security and privacy primitives such as enclaves and memory encryption. these primitives do not address hiding of the memory access pattern, information about which may enable attacks on the system or reveal characteristics of sensitive user data. State-of-the-art approaches to protecting the access pattern are largely based on Oblivious RAM (ORAM). Unfortunately, current ORAM implementations suffer from very significant practicality and overhead concerns, including roughly an order of magnitude slowdown, more than 100% memory capacity overheads, and the potential for system deadlock.
Memory technology trends are moving towards 3D and 2.5D integration, enabling significant logic capabilities and sophisticated memory interfaces. Leveraging the trends, we propose a new approach to access pattern obfuscation, called ObfusMem. ObfusMem adds the memory to the trusted computing base and incorporates cryptographic engines within the memory. ObfusMem encrypts commands and addresses on the memory bus, hence the access pattern is cryptographically obfuscated from external observers. Our evaluation shows that ObfusMem incurs an overhead of 10.9% on average, which is about an order of magnitude faster than ORAM implementations. Furthermore, ObfusMem does not incur capacity overheads and does not amplify writes. We analyze and compare the security protections provided by ObfusMem and ORAM, and highlight their differences.
******
Tailoring the operating voltage to fine-grain temporal changes in the power and performance needs of the workload can effectively enhance power efficiency. Therefore, power-limited computing platforms of today widely deploy integrated (i.e., on-chip) voltage regulation which enables fast fine-grain voltage control. Voltage regulators convert and distribute power from an external energy source to the processor. Unfortunately, power conversion loss is inevitable and projected integrated regulator designs are unlikely to eliminate this loss even asymptotically. Reconfigurable power delivery by selective shut-down, i.e., gating, of distributed on-chip regulators in response to spatio-temporal changes in power demand can sustain operation at the minimum conversion loss. However, even the minimum conversion loss is sizable, and as conversion loss gets dissipated as heat, on-chip regulators can easily cause thermal emergencies due to their small footprint.
Although reconfigurable distributed on-chip power delivery is emerging as a new design paradigm to enforce sustained operation at minimum possible power conversion loss, thermal implications have been overlooked at the architectural level. This paper hence provides a thermal characterization. We introduce ThermoGater, an architectural governor for a collection of practical, thermally-aware regulator gating policies to mitigate (if not prevent) regulator-induced thermal emergencies, which also consider potential implications for voltage noise. Practical ThermoGater policies can not only sustain minimum power conversion loss throughout execution effectively, but also keep the maximum temperature (thermal gradient) across chip within 0.6°C (0.3°C) on average in comparison to thermally-optimal oracular regulator gating, while the maximum voltage noise stays within 1.0% of the best case voltage noise profile.
******
Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively.
In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.
******
High-performance architectures are over-provisioned with resources to extract the maximum achievable performance out of applications. Two sources of avoidable power dissipation are the leakage power from underutilized resources, along with clock power from the clock hierarchy that feeds these resources. Most reconfiguration mechanisms either focus solely on power gating execution resources alone or in addition, simply turn off the immediate clock tree segment which supplied the clock to those resources. These proposals neither attempt to gate further up the clock hierarchy nor do they involve the clock hierarchy in influencing the reconfiguration decisions. The primary contribution of CHARSTAR is optimizing reconfiguration mechanisms to become clock hierarchy aware. Resource gating decisions are cognizant of the power consumed by each node in the clock hierarchy and additionally, entire branches of the clock tree are greedily shut down whenever possible.
The CHARSTAR design is further optimized for balanced spatio-temporal reconfiguration and also enables efficient joint control of resource and frequency scaling. The proposal is implemented by leveraging the inherent advantages of spatial architectures, utilizing a control mechanism driven by a lightweight offline trained neural predictor. CHARSTAR, when deployed on the CRIB tiled microarchitecture, improves processor energy efficiency by 20-25%, with efficiency improvements of roughly 2x in comparison to a naive power gating mechanism. Alternatively, it improves performance by 10-20% under varying power and energy constraints.
******
An unambiguous and easy-to-understand memory consistency model is crucial for ensuring correct synchronization and guiding future design of heterogeneous systems. In a widely adopted approach, the memory model guarantees sequential consistency (SC) as long as programmers obey certain rules. The popular data-race-free-0 (DRF0) model exemplifies this SC-centric approach by requiring programmers to avoid data races. Recent industry models, however, have extended such SC-centric models to incorporate relaxed atomics. These extensions can improve performance, but are difficult to specify formally and use correctly. This work addresses the impact of relaxed atomics on consistency models for heterogeneous systems in two ways. First, we introduce a new model, Data-Race-Free-Relaxed (DRFrlx), that extends DRF0 to provide SC-centric semantics for the common use cases of relaxed atomics. Second, we evaluate the performance of relaxed atomics in CPU-GPU systems for these use cases. We find mixed results -- for most cases, relaxed atomics provide only a small benefit in execution time, but for some cases, they help significantly (e.g., up to 51% for DRFrlx over DRF0).
******
Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory. This new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data structures in memory instead of serializing it to the file system, thereby providing a substantial performance boost. However, modern systems reorder memory operations and utilize volatile caches for better performance, making it difficult to ensure a consistent state in NVMM. Intel recently announced a new set of persistence instructions, clflushopt, clwb, and pcommit. These new instructions make it possible to implement fail-safe code on NVMM, but few workloads have been written or characterized using these new instructions.
In this work, we describe how these instructions work and how they can be used to implement write-ahead logging based transactions. We implement several common data structures and kernels and evaluate the performance overhead incurred over traditional non-persistent implementations. In particular, we find that persistence instructions occur in clusters along with expensive fence operations, they have long latency, and they add a significant execution time overhead, on average by 20.3% over code with logging but without fence instructions to order persists.
To deal with this overhead and alleviate the performance bottleneck, we propose to speculate past long latency persistency operations using checkpoint-based processing. Our speculative persistence architecture reduces the execution time overheads to only 3.6%.
******
In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the first time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modifying a typical directory protocol we can successfully hide load-load reordering without perceptible performance cost and without deadlock. Our solution is cost-effective and increases the performance of out-of-order commit by a sizable margin, compared to the base case where memory operations are not allowed to commit if the consistency model could be violated.
******
This work presents a minimally-intrusive, high-performance, post-silicon validation framework for validating memory consistency in multi-core systems. Our framework generates constrained-random tests that are instrumented with observability-enhancing code for memory consistency verification. For each test, we generate a set of compact signatures reflecting the memory-ordering patterns observed over many executions of the test, with each of the signatures corresponding to a unique memory-ordering pattern. We then leverage an efficient and novel analysis to quickly determine if the observed execution patterns represented by each unique signature abide by the memory consistency model. Our analysis derives its efficiency by exploiting the structural similarities among the patterns observed.
We evaluated our framework, MTraceCheck, on two platforms: an x86-based desktop and an ARM-based SoC platform, both running multi-threaded test programs in a bare-metal environment. We show that MTraceCheck reduces the perturbation introduced by the memory-ordering monitoring activity by 93% on average, compared to a baseline register flushing approach that saves the register's state after each load operation. We also reduce the computation requirements of our consistency checking analysis by 81% on average, compared to a conventional topological sorting solution. We finally demonstrate the effectiveness of MTraceCheck on buggy designs, by evaluating multiple case studies where it successfully exposes subtle bugs in a full-system simulation environment.
******
Memory hardware errors may result from transient particle-induced faults as well as device defects due to aging. These errors are an important threat to computer system reliability as VLSI technologies continue to scale. Managing memory hardware errors is a critical component in developing an overall system dependability strategy. Memory error detection and correction are supported in a range of available hardware mechanisms. However, memory protections (particularly the more advanced ones) come at substantial costs in performance and energy usage. Moreover, the protection mechanisms are often a fixed, system-wide choice and can not easily adapt to different protection demand of different applications or memory regions.
In this paper, we present a new RAIM (redundant array of independent memory) design that compared to the state-of-the-art implementation can easily provide high protection capability and the ability to selectively protect a subset of the memory. A straightforward implementation of the design can incur a substantial memory traffic overhead. We propose a few practical optimizations to mitigate this overhead. With these optimizations the proposed RAIM design offers significant advantages over existing RAIM design at lower or comparable costs.
******
The processors that drive embedded systems are getting smaller; meanwhile, the batteries used to provide power to those systems have stagnated. If we are to realize the dream of ubiquitous computing promised by the Internet of Things, processors must shed large, heavy, expensive, and high maintenance batteries and, instead, harvest energy from their environment. One challenge with this transition is that harvested energy is insufficient for continuous operation. Unfortunately, existing programs fail miserably when executed intermittently.
This paper presents Clank: lightweight architectural support for correct and efficient execution of long-running applications on harvested energy---without programmer intervention or extreme hardware modifications. Clank is a set of hardware buffers and memory-access monitors that dynamically maintain idempotency. Essentially, Clank dynamically decomposes program execution into a stream of restartable sub-executions connected via lightweight checkpoints.
To validate Clank's ability to correctly stretch program execution across frequent, random power cycles, and to explore the associated hardware and software overheads, we implement Clank in Verilog, formally verify it, and then add it to an ARM Cortex M0+ processor which we use to run a set of 23 embedded systems benchmarks. Experiments show run-time overheads as low as 2.5%, with run-time overheads of 6% for a version of Clank that adds 1.7% hardware. Clank minimizes checkpoints so much that re-execution time becomes the dominate contributor to run-time overhead.
******
Early reliability assessment of hardware structures using microarchitecture level simulators can effectively guide major error protection decisions in microprocessor design. Statistical fault injection on microarchitectural structures modeled in performance simulators is an accurate method to measure their Architectural Vulnerability Factor (AVF) but requires excessively long campaigns to obtain high statistical significance.
We propose MeRLiN1, a methodology to boost microarchitecture level injection-based reliability assessment by several orders of magnitude and keep the accuracy of the assessment unaffected even for large injection campaigns with very high statistical significance. The core of MeRLiN is the grouping of faults of an initial list in equivalent classes. All faults in the same group target equivalent vulnerable intervals of program execution ending up to the same static instruction that reads the faulty entries. Faults in the same group occur in different times and entries of a structure and it is extremely likely that they all have the same effect in program execution; thus, fault injection is performed only on a few representatives from each group.
We evaluate MeRLiN for different sizes of the physical register file, the store queue and the first level data cache of a contemporary microarchitecture running MiBench and SPEC CPU2006 benchmarks. For all our experiments, MeRLiN is from 2 to 3 orders of magnitude faster than an extremely high statistical significant injection campaign, reporting the same reliability measurements with negligible loss of accuracy. Finally, we theoretically analyze MeRLiN's statistical behavior to further justify its accuracy.
******
Modern DRAM-based systems suffer from significant energy and latency penalties due to conservative DRAM refresh standards. Volatile DRAM cells can retain information across a wide distribution of times ranging from milliseconds to many minutes, but each cell is currently refreshed every 64ms to account for the extreme tail end of the retention time distribution, leading to a high refresh overhead. Due to poor DRAM technology scaling, this problem is expected to get worse in future device generations. Hence, the current approach of refreshing all cells with the worst-case refresh rate must be replaced with a more intelligent design.
Many prior works propose reducing the refresh overhead by extending the default refresh interval to a higher value, which we refer to as the target refresh interval, across parts or all of a DRAM chip. These proposals handle the small set of failing cells that cannot retain data throughout the entire extended refresh interval via retention failure mitigation mechanisms (e.g., error correcting codes or bit-repair mechanisms). This set of failing cells is discovered via retention failure profiling, which is currently a brute-force process that writes a set of known data to DRAM, disables refresh and waits for the duration of the target refresh interval, and then checks for retention failures across the DRAM chip. We show that this brute-force approach is too slow and is detrimental to system execution, especially with frequent online profiling.
This paper presents reach profiling, a new methodology for retention failure profiling based on the key observation that an overwhelming majority of failing DRAM cells at a target refresh interval fail more reliably at both longer refresh intervals and higher temperatures. Using 368 state-of-the-art LPDDR4 DRAM chips from three major vendors, we conduct a thorough experimental characterization of the complex set of tradeoffs inherent in the profiling process. We identify three key metrics to guide design choices for retention failure profiling and mitigation mechanisms: coverage, false positive rate, and runtime. We propose reach profiling, a new retention failure profiling mechanism whose key idea is to profile failing cells at a longer refresh interval and/or higher temperature relative to the target conditions in order to maximize failure coverage while minimizing the false positive rate and profiling runtime. We thoroughly explore the tradeoffs associated with reach profiling and show that there is significant room for improvement in DRAM retention failure profiling beyond the brute-force approach. We show with experimental data that on average, by profiling at 250ms above the target refresh interval, our first implementation of reach profiling (called REAPER) can attain greater than 99% coverage of failing DRAM cells with less than a 50% false positive rate while running 2.5x faster than the brute-force approach. In addition, our end-to-end evaluations show that REAPER enables significant system performance improvement and DRAM power reduction, outperforming the brute-force approach and enabling high-performance operation at longer refresh intervals that were previously unreasonable to employ due to the high associated profiling overhead.
******
GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.
******
Snapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence of aborts, save the work done by transactions, and greatly benefit long transactions involving complex data structures.
GPUs are evolving towards a general-purpose computing device with growing support for irregular workloads, including transactional memory. The usage of snapshot isolation on transactional memory has proven to be greatly beneficial for performance. In this paper, we propose a multi-versioned memory subsystem for hardware-based transactional memory on the GPU, with a method for eliminating the Write Skew anomaly on the fly, and finally incorporate Snapshot Isolation with this system.
The results show that snapshot isolation can effectively boost the performance of dynamically sized data structures such as linked lists, binary trees and red-black trees, sometimes by as much as 4.5x, which results in improved overall performance of benchmarks utilizing these data structures.
******
This paper introduces a method of decoupling affine computations---a class of expressions that produces extremely regular values across SIMT threads---from the main execution stream, so that the affine computations can be performed with greater efficiency and with greater independence from the main execution stream. This decoupling has two benefits: (1) For compute-bound programs, it significantly reduces the dynamic warp instruction count; (2) for memory-bound workloads, it significantly reduces memory latency, since it acts as a non-speculative prefetcher for the data specified by the many memory address calculations that are affine computations.
We evaluate our solution, known as Decoupled Affine Computation (DAC), using GPGPU-sim and a set of 29 GPGPU programs. We find that on average, DAC improves performance by 40% and reduces energy consumption by 20%. For the 11 compute-bound benchmarks, DAC improves performance by 34%, compared with 11% for the previous state-of-the-art. For the 18 memory-bound programs, DAC improves performance by an average of 44%, compared with 16% for state-of-the-art GPU prefetcher.
******
Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data locality behavior: (1) data brought by a warp load instruction is used only once, which is classified as streaming data (2) data brought by a warp load is reused multiple times within the same warp, called intra-warp locality (3) data brought by a warp is reused multiple times but across different warps, called inter-warp locality (4) and some data exhibit both a mix of intra- and inter-warp locality. Furthermore, each load instruction exhibits consistently the same locality type across all warps within a GPU kernel. Based on this discovery we argue that cache management must be done using per-load locality type information, rather than applying warp-wide cache management policies. We propose Access Pattern-aware Cache Management (APCM), which dynamically detects the locality type of each load instruction by monitoring the accesses from one exemplary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves performance of GPUs by 34% for cache sensitive applications while saving 27% of energy consumption over baseline GPU.
******
Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.
******
This paper describes EM-Based Detection of Deviations in Program Execution (EDDIE), a new method for detecting anomalies in program execution, such as malware and other code injections, without introducing any overheads, adding any hardware support, changing any software, or using any resources on the monitored system itself. Monitoring with EDDIE involves receiving electromagnetic (EM) emanations that are emitted as a side effect of execution on the monitored system, and it relies on spikes in the EM spectrum that are produced as a result of periodic (e.g. loop) activity in the monitored execution. During training, EDDIE characterizes normal execution behavior in terms of peaks in the EM spectrum that are observed at various points in the program execution, but it does not need any characterization of the malware or other code that might later be injected. During monitoring, EDDIE identifies peaks in the observed EM spectrum, and compares these peaks to those learned during training. Since EDDIE requires no resources on the monitored machine and no changes to the monitored software, it is especially well suited for security monitoring of embedded and IoT devices. We evaluate EDDIE on a real IoT system and in a cycle-accurate simulator, and find that even relatively brief injected bursts of activity (a few milliseconds) are detected by EDDIE with high accuracy, and that it also accurately detects when even a few instructions are injected into an existing loop within the application.
******
In cache-based side channel attacks, a spy that shares a cache with a victim probes cache locations to extract information on the victim's access patterns. For example, in evict+reload, the spy repeatedly evicts and then reloads a probe address, checking if the victim has accessed the address in between the two operations. While there are many proposals to combat these cache attacks, they all have limitations: they either hurt performance, require programmer intervention, or can only defend against some types of attacks.
This paper makes the following observation for an environment with an inclusive cache hierarchy: when the spy evicts the probe address from the shared cache, the address will also be evicted from the private cache of the victim process, creating an inclusion victim. Consequently, to disable cache attacks, this paper proposes to alter the line replacement algorithm of the shared cache, to prevent a process from creating inclusion victims in the caches of cores running other processes. By enforcing this rule, the spy cannot evict the probe address from the shared cache and, hence, cannot glimpse any information on the victim's access patterns. We call our proposal SHARP (Secure Hierarchy-Aware cache Replacement Policy). SHARP efficiently defends against all existing cross-core shared-cache attacks, needs only minimal hardware modifications, and requires no code modifications. We implement SHARP in a cycle-level full-system simulator. We show that it protects against real-world attacks, and that it introduces negligible average performance degradation.
******
Most architectures are designed to mitigate the usually undesirable phenomenon of device wearout. We take a contrarian view and harness this phenomenon to create hardware security mechanisms that resist attacks by statistically enforcing an upper bound on hardware uses, and consequently attacks. For example, let us assume that a user may log into a smartphone a maximum of 50 times a day for 5 years, resulting in approximately 91,250 legitimate uses. If we assume at least 8-character passwords and we require login (and retrieval of the storage decryption key) to traverse hardware that wears out in 91,250 uses, then an adversary has a negligible chance of successful brute-force attack before the hardware wears out, even assuming real-world password cracking by professionals. M-way replication of our hardware and periodic re-encryption of storage can increase the daily usage bound by a factor of M.
The key challenge is to achieve practical statistical bounds on both minimum and maximum uses for an architecture, given that individual devices can vary widely in wearout characteristics. We introduce techniques for architecturally controlling these bounds and perform a design space exploration for three use cases: a limited-use connection, a limited-use targeting system and one-time pads. These techniques include decision trees, parallel structures, Shamir's secret-sharing mechanism, Reed-Solomon codes, and module replication. We explore the cost in area, energy and latency of using these techniques to achieve system-level usage targets given device-level wearout distributions. With redundant encoding, for example, we can improve exponential sensitivity to device lifetime variation to linear sensitivity, reducing the total number of NEMS devices by 4 orders of magnitude to about 0.8 million for limited-use connections (compared with 4 billion if without redundant encoding).
******
With the end of Dennard scaling, architects have increasingly turned to special-purpose hardware accelerators to improve the performance and energy efficiency for some applications. Unfortunately, accelerators don't always live up to their expectations and may under-perform in some situations. Understanding the factors which effect the performance of an accelerator is crucial for both architects and programmers early in the design stage. Detailed models can be highly accurate, but often require low-level details which are not available until late in the design cycle. In contrast, simple analytical models can provide useful insights by abstracting away low-level system details.
In this paper, we propose LogCA---a high-level performance model for hardware accelerators. LogCA helps both programmers and architects identify performance bounds and design bottlenecks early in the design cycle, and provide insight into which optimizations may alleviate these bottlenecks. We validate our model across a variety of kernels, ranging from sub-linear to super-linear complexities on both on-chip and off-chip accelerators. We also describe the utility of LogCA using two retrospective case studies. First, we discuss the evolution of interface design in SUN/Oracle's encryption accelerators. Second, we discuss the evolution of memory interface design in three different GPU architectures. In both cases, we show that the adopted design optimizations for these machines are similar to LogCA's suggested optimizations. We argue that architects and programmers can use insights from these retrospective studies for improving future designs.
******
Reconfigurable architectures have gained popularity in recent years as they allow the design of energy-efficient accelerators. Fine-grain fabrics (e.g. FPGAs) have traditionally suffered from performance and power inefficiencies due to bit-level reconfigurable abstractions. Both fine-grain and coarse-grain architectures (e.g. CGRAs) traditionally require low level programming and suffer from long compilation times. We address both challenges with Plasticine, a new spatially reconfigurable architecture designed to efficiently execute applications composed of parallel patterns. Parallel patterns have emerged from recent research on parallel programming as powerful, high-level abstractions that can elegantly capture data locality, memory access patterns, and parallelism across a wide range of dense and sparse applications.
We motivate Plasticine by first observing key application characteristics captured by parallel patterns that are amenable to hardware acceleration, such as hierarchical parallelism, data locality, memory access patterns, and control flow. Based on these observations, we architect Plasticine as a collection of Pattern Compute Units and Pattern Memory Units. Pattern Compute Units are multi-stage pipelines of reconfigurable SIMD functional units that can efficiently execute nested patterns. Data locality is exploited in Pattern Memory Units using banked scratchpad memories and configurable address decoders. Multiple on-chip address generators and scatter-gather engines make efficient use of DRAM bandwidth by supporting a large number of outstanding memory requests, memory coalescing, and burst mode for dense accesses. Plasticine has an area footprint of 113 mm2 in a 28nm process, and consumes a maximum power of 49 W at a 1 GHz clock. Using a cycle-accurate simulator, we demonstrate that Plasticine provides an improvement of up to 76.9x in performance-per-Watt over a conventional FPGA over a wide range of dense and sparse applications.
******
The fast and energy-efficient simulation of dynamical systems defined by coupled ordinary/partial differential equations has emerged as an important problem. The accelerated simulation of coupled ODE/PDE is critical for analysis of physical systems as well as computing with dynamical systems. This paper presents a fast and programmable accelerator for simulating dynamical systems. The computing model of the proposed platform is based on multilayer cellular nonlinear network (CeNN) augmented with nonlinear function evaluation engines. The platform can be programmed to accelerate wide classes of ODEs/PDEs by modulating the connectivity within the multilayer CeNN engine. An innovative hardware architecture including data reuse, memory hierarchy, and near-memory processing is designed to accelerate the augmented multilayer CeNN. A dataflow model is presented which is supported by optimized memory hierarchy for efficient function evaluation. The proposed solver is designed and synthesized in 15nm technology for the hardware analysis. The performance is evaluated and compared to GPU nodes when solving wide classes of differential equations and the power consumption is analyzed to show orders of magnitude improvement in energy efficiency.
******
Demand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer vision and big data. The stark tradeoffs between efficiency and generality at these two extremes poses a difficult question: how could domain-specific hardware efficiency be achieved without domain-specific hardware solutions?
In this work, we rely on the insight that "acceleratable" algorithms have broad common properties: high computational intensity with long phases, simple control patterns and dependences, and simple streaming memory access and reuse patterns. We define a general architecture (a hardware-software interface) which can more efficiently expresses program with these properties called stream-dataflow. The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead. This paper explores the hardware and software implications, describes its detailed microarchitecture, and evaluates an implementation. Compared to a state-of-the-art domain specific accelerator (DianNao), and fixed-function accelerators for MachSuite, Softbrain can match their performance with only 2x power overhead on average.
******
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer to such activities as page remappings. Unfortunately, page remappings are expensive. We show that a big part of this cost arises from address translation coherence, particularly on systems employing virtualization. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 0.2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.
******
To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing, increase the coverage of limited hardware translation entries by exploiting the contiguous memory allocation. However, recent studies show that in non-uniform memory systems, using large pages often leads to performance degradation, or allocating large chunks of memory becomes more difficult due to memory fragmentation. Although each of the prior techniques favors its own best chunk size, diverse contiguity of memory allocation in real systems cannot always provide the optimal chunk of each technique. Under such fragmented and diverse memory allocations, this paper proposes a novel HW-SW hybrid translation architecture, which can adapt to different memory mappings efficiently. In the proposed hybrid coalescing technique, the operating system encodes memory contiguity information in a subset of page table entries, called anchor entries. During address translation through TLBs, an anchor entry provides translation for contiguous pages following the anchor entry. As a smaller number of anchor entries can cover a large portion of virtual address space, the efficiency of TLB can be significantly improved. The most important benefit of hybrid coalescing is its ability to change the coverage of the anchor entry dynamically, reflecting the current allocation contiguity status. By using the contiguity information directly set by the operating system, the technique can provide scalable translation coverage improvements with minor hardware changes, while allowing the flexibility of memory allocation. Our experimental results show that across diverse allocation scenarios with different distributions of contiguous memory chunks, the proposed scheme can effectively reap the potential translation coverage improvement from the existing contiguity.
******
In this paper, we introduce the Do-It-Yourself virtual memory translation (DVMT) architecture as a flexible complement for current hardware-fixed translation flows. DVMT decouples the virtual-to-physical mapping process from the access permissions, giving applications freedom in choosing mapping schemes, while maintaining security within the operating system. Furthermore, DVMT is designed to support virtualized environments, as a means to collapse the costly, hardware-assisted two-dimensional translations. We describe the architecture in detail and demonstrate its effectiveness by evaluating several different DVMT schemes on a range of virtualized applications with a model based on measurements from a commercial system. We show that different DVMT configurations preserve the native performance, while achieving speedups of 1.2x to 2.0x in virtualized environments.
******
With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the radix-4 type of page tables used in x86 architectures, a TLB-miss necessitates up to 24 memory references for one guest to host translation. While dedicated page walk caches and such recent enhancements eliminate many of these memory references, our measurements on the Intel Skylake processors indicate that many programs in virtualized mode of execution still spend hundreds of cycles for translations that do not hit in the TLBs.
This paper presents an innovative scheme to reduce the cost of address translations by using a very large Translation Lookaside Buffer that is part of memory, the POM-TLB. In the POM-TLB, only one access is required instead of up to 24 accesses required in commonly used 2D walks with radix-4 type of page tables. Even if many of the 24 accesses may hit in the page walk caches, the aggregated cost of the many hits plus the overhead of occasional misses from page walk caches still exceeds the cost of one access to the POM-TLB. Since the POM-TLB is part of the memory space, TLB entries (as opposed to multiple page table entries) can be cached in large L2 and L3 data caches, yielding significant benefits. Through detailed evaluation running SPEC, PARSEC and graph workloads, we demonstrate that the proposed POM-TLB improves performance by approximately 10% on average. The improvement is more than 16% for 5 of the benchmarks. It is further seen that a POM-TLB of 16MB size can eliminate nearly all TLB misses in 8-core systems.
******
The commercial release of byte-addressable persistent memories, such as Intel/Micron 3D XPoint memory, is imminent. Ongoing research has sought mechanisms to allow programmers to implement recoverable data structures in these new main memories. Ensuring recoverability requires programmer control of the order of persistent stores; recent work proposes persistency models as an extension to memory consistency to specify such ordering. Prior work has considered persistency models at the abstraction of the instruction set architecture. Instead, we argue for extending the language-level memory model to provide guarantees on the order of persistent writes.
We explore a taxonomy of guarantees a language-level persistency model might provide, considering both atomicity and ordering constraints on groups of persistent stores. Then, we propose and evaluate Acquire-Release Persistency (ARP), a language-level persistency model for C++11. We describe how to compile code written for ARP to a state-of-the-art ISA-level persistency model. We then consider enhancements to the ISA-level persistency model that can distinguish memory consistency constraints required for proper synchronization but unnecessary for correct recovery. With these optimizations, we show that ARP increases performance by up to 33.2% (19.8% avg.) over coding directly to the baseline ISA-level persistency model for a suite of persistent-write-intensive workloads.
******
The same flexibility that makes dynamic scripting languages appealing to programmers is also the primary cause of their low performance. To access objects of potentially different types, the compiler creates a dispatcher with a series of if statements, each performing a comparison to a type and a jump to a handler. This induces major overhead in instructions executed and branches mispredicted.
This paper proposes architectural support to significantly improve the efficiency of accesses to objects. The idea is to modify the instruction that calls the dispatcher so that, under most conditions, it skips most of the branches and instructions needed to reach the correct handler, and sometimes even the execution of the handler itself. Our novel architecture, called ShortCut, performs two levels of optimization. Its Plain design transforms the call to the dispatcher into a call to the correct handler --- bypassing the whole dispatcher execution. Its Aggressive design transforms the call to the dispatcher into a simple load or store --- bypassing the execution of both dispatcher and handler. We implement the ShortCut software in the state-of-the-art Google V8 JIT compiler, and the ShortCut hardware in a simulator. We evaluate ShortCut with the Octane and SunSpider JavaScript application suites. Plain ShortCut reduces the average execution time of the applications by 30% running under the baseline compiler, and by 11% running under the maximum level of compiler optimization. Aggressive ShortCut performs only slightly better.
******
PHP is the dominant server-side scripting language used to implement dynamic web content. Just-in-time compilation, as implemented in Facebook's state-of-the-art HipHopVM, helps mitigate the poor performance of PHP, but substantial overheads remain, especially for realistic, large-scale PHP applications. This paper analyzes such applications and shows that there is little opportunity for conventional microarchitectural enhancements. Furthermore, prior approaches for function-level hardware acceleration present many challenges due to the extremely flat distribution of execution time across a large number of functions in these complex applications. In-depth analysis reveals a more promising alternative: targeted acceleration of four fine-grained PHP activities: hash table accesses, heap management, string manipulation, and regular expression handling. We highlight a set of guiding principles and then propose and evaluate inexpensive hardware accelerators for these activities that accrue substantial performance and energy gains across dozens of functions. Our results reflect an average 17.93% improvement in performance and 21.01% reduction in energy while executing these complex PHP workloads on a state-of-the-art software and hardware platform.
******
Heterogeneous memory management combined with server virtualization in datacenters is expected to increase the software and OS management complexity. State-of-the-art solutions rely exclusively on the hypervisor (VMM) for expensive page hotness tracking and migrations, limiting the benefits from heterogeneity. To address this, we design HeteroOS, a novel application-transparent OS-level solution for managing memory heterogeneity in virtualized system. The HeteroOS design first makes the guest-OSes heterogeneity-aware and then extracts rich OS-level information about applications' memory usage to place data in the 'right' memory avoiding page migrations. When such pro-active placements are not possible, HeteroOS combines the power of the guest-OSes' information about applications with the VMM's hardware control to track for hotness and migrate only performance-critical pages. Finally, HeteroOS also designs an efficient heterogeneous memory sharing across multiple guest-VMs. Evaluation of HeteroOS with memory, storage, and network-intensive datacenter applications shows up to 2x performance improvement compared to the state-of-the-art VMM-exclusive approach.
******
Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.
******
As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network sparsity caused by weight pruning will actually hurt the overall performance despite large reductions in the model size and required multiply-accumulate operations. Also, encoding the sparse format of pruned networks incurs additional storage space overhead. To overcome these challenges, we propose Scalpel that customizes DNN pruning to the underlying hardware by matching the pruned network structure to the data-parallel hardware organization. Scalpel consists of two techniques: SIMD-aware weight pruning and node pruning. For low-parallelism hardware (e.g., microcontroller), SIMD-aware weight pruning maintains weights in aligned fixed-size groups to fully utilize the SIMD units. For high-parallelism hardware (e.g., GPU), node pruning removes redundant nodes, not redundant weights, thereby reducing computation without sacrificing the dense matrix format. For hardware with moderate parallelism (e.g., desktop CPU), SIMD-aware weight pruning and node pruning are synergistically applied together. Across the microcontroller, CPU and GPU, Scalpel achieves mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88%, 82%, and 53%. In comparison, traditional weight pruning achieves mean speedups of 1.90x, 1.06x, 0.41x across the three platforms.
******
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buck-wild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11X. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.
******
CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above problem is even more challenging for irregular applications, in which the execution dependency can only be determined at run time. Thus over-serialized accelerators are generated from existing works that rely on compile time analysis to schedule the computation.
In this work, we propose a comprehensive software-hardware co-design framework, which captures parallelism in irregular applications and aggressively schedules pipelined execution on reconfigurable platform. Based on an inherently parallel abstraction packaging parallelism for runtime schedule, our framework significantly differs from existing works that tend to schedule executions at compile time. An irregular application is formulated as a set of tasks with their dependencies specified as rules describing the conditions under which a subset of tasks can be executed concurrently. Then datapaths on FPGA will be generated by transforming applications in the formulation into task pipelines orchestrated by evaluating rules at runtime, which could exploit fine-grained pipeline parallelism as handcrafted accelerators do.
An evaluation shows that this framework is able to produce datapath with its quality close to handcrafted designs. Experiments show that generated accelerators are dramatically more efficient than those created by current high-level synthesis tools. Meanwhile, accelerators generated for a set of irregular applications attain 0.5x~1.9x performance compared to equivalent software implementations we selected on a server-grade 10-core processor, with the memory subsystem remaining as the bottleneck.
******
Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not support nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs that do support nested parallelism focus on parallelizing at the coarsest (shallowest) levels, incurring large overheads that squander most of their potential.
We present FRACTAL, a new execution model that supports unordered and timestamp-ordered nested parallelism. FRACTAL lets programmers seamlessly compose speculative parallel algorithms, and lets the architecture exploit parallelism at all levels. FRACTAL can parallelize a broader range of applications than prior speculative execution models. We design a FRACTAL implementation that extends the Swarm architecture and focuses on parallelizing at the finest (deepest) levels. Our approach sidesteps the issues of nested parallel HTMs and uncovers abundant fine-grain parallelism. As a result, FRACTAL outperforms prior speculative architectures by up to 88x at 256 cores.
******
Finite State Machines (FSM) are widely used computation models for many application domains. These embarrassingly sequential applications with irregular memory access patterns perform poorly on conventional von-Neumann architectures. The Micron Automata Processor (AP) is an in-situ memory-based computational architecture that accelerates non-deterministic finite automata (NFA) processing in hardware. However, each FSM on the AP is processed sequentially, limiting potential speedups.
In this paper, we explore the FSM parallelization problem in the context of the AP. Extending classical parallelization techniques to NFAs executing on AP is non-trivial because of high state-transition tracking overheads and exponential computation complexity. We present the associated challenges and propose solutions that leverage both the unique properties of the NFAs (connected components, input symbol ranges, convergence, common parent states) and unique features in the AP (support for simultaneous transitions, low-overhead flow switching, state vector cache) to realize parallel NFA execution on the AP.
We evaluate our techniques against several important benchmarks including NFAs used for network intrusion detection, malware detection, text processing, protein motif searching, DNA sequencing, and data analytics. Our proposed parallelization scheme demonstrates significant speedup (25.5x on average) compared to sequential execution on AP. Prior work has already shown that sequential execution on AP is at least an order of magnitude better than GPUs, multi-core processors and Xeon Phi accelerator.
******
Non-Volatile Memories (NVMs) can significantly improve the performance of data-intensive applications. A popular form of NVM is Battery-backed DRAM, which is available and in use today with DRAMs latency and without the endurance problems of emerging NVM technologies. Modern servers can be provisioned with up-to 4 TB of DRAM, and provisioning battery backup to write out such large memories is hard because of the large battery sizes and the added hardware and cooling costs. We present Viyojit, a system that exploits the skew in write working sets of applications to provision substantially smaller batteries while still ensuring durability for the entire DRAM capacity. Viyojit achieves this by bounding the number of dirty pages in DRAM based on the provisioned battery capacity and proactively writing out infrequently written pages to an SSD. Even for write-heavy workloads with less skew than we observe in analysis of real data center traces, Viyojit reduces the required battery capacity to 11% of the original size, with a performance overhead of 7-25%. Thus, Viyojit frees battery-backed DRAM from stunted growth of battery capacities and enables servers with terabytes of battery-backed DRAM.
******
This paper investigates compression for DRAM caches. As the capacity of DRAM cache is typically large, prior techniques on cache compression, which solely focus on improving cache capacity, provide only a marginal benefit. We show that more performance benefit can be obtained if the compression of the DRAM cache is tailored to provide higher bandwidth. If a DRAM cache can provide two compressed lines in a single access, and both lines are useful, the effective bandwidth of the DRAM cache would double. Unfortunately, it is not straight-forward to compress DRAM caches for bandwidth. The typically used Traditional Set Indexing (TSI) maps consecutive lines to consecutive sets, so the multiple compressed lines obtained from the set are from spatially distant locations and unlikely to be used within a short period of each other. We can change the indexing of the cache to place consecutive lines in the same set to improve bandwidth; however, when the data is incompressible, such spatial indexing reduces effective capacity and causes significant slowdown.
Ideally, we would like to have spatial indexing when the data is compressible and TSI otherwise. To this end, we propose Dynamic-Indexing Cache comprEssion (DICE), a dynamic design that can adapt between spatial indexing and TSI, depending on the compressibility of the data. We also propose low-cost Cache Index Predictors (CIP) that can accurately predict the cache indexing scheme on access in order to avoid probing both indices for retrieving a given cache line. Our studies with a 1GB DRAM cache, on a wide range of workloads (including SPEC and Graph), show that DICE improves performance by 19.0% and reduces energy-delay-product by 36% on average. DICE is within 3% of a design that has double the capacity and double the bandwidth. DICE incurs a storage overhead of less than 1KB and does not rely on any OS support.
******
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption.
Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49x and 5x, and efficiency by up to 28x and 5x, respectively.
******
Caches are traditionally organized as a rigid hierarchy, with multiple levels of progressively larger and slower memories. Hierarchy allows a simple, fixed design to benefit a wide range of applications, since working sets settle at the smallest (i.e., fastest and most energy-efficient) level they fit in. However, rigid hierarchies also add overheads, because each level adds latency and energy even when it does not fit the working set. These overheads are expensive on emerging systems with heterogeneous memories, where the differences in latency and energy across levels are small. Significant gains are possible by specializing the hierarchy to applications.
We propose Jenga, a reconfigurable cache hierarchy that dynamically and transparently specializes itself to applications. Jenga builds virtual cache hierarchies out of heterogeneous, distributed cache banks using simple hardware mechanisms and an OS runtime. In contrast to prior techniques that trade energy and bandwidth for performance (e.g., dynamic bypassing or prefetching), Jenga eliminates accesses to unwanted cache levels. Jenga thus improves both performance and energy efficiency. On a 36-core chip with a 1 GB DRAM cache, Jenga improves energy-delay product over a combination of state-of-the-art techniques by 23% on average and by up to 85%.
******
The trend of unsustainable power consumption and large memory bandwidth demands in massively parallel multicore systems, with the advent of the big data era, has brought upon the onset of alternate computation paradigms utilizing heterogeneity, specialization, processor-in-memory and approximation. Approximate Computing is being touted as a viable solution for high performance computation by relaxing the accuracy constraints of applications. This trend has been accentuated by emerging data intensive applications in domains like image/video processing, machine learning and big data analytics that allow inaccurate outputs within an acceptable variance. Leveraging relaxed accuracy for high throughput in Networks-on-Chip (NoCs), which have rapidly become the accepted method for connecting a large number of on-chip components, has not yet been explored. We propose APPROX-NoC, a hardware data approximation framework with an online data error control mechanism for high performance NoCs. APPROX-NoC facilitates approximate matching of data patterns, within a controllable value range, to compress them thereby reducing the volume of data movement across the chip.
Our evaluation shows that APPROX-NoC achieves on average up to 9% latency reduction and 60% throughput improvement compared with state-of-the-art NoC data compression mechanisms, while maintaining low application error. Additionally, with a data intensive graph processing application we achieve a 36.7% latency reduction compared to state-of-the-art compression mechanisms.
******
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory "cubes" with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a "Memory Network" (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube.
In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.
******
Routing algorithms can improve network performance by maximizing routing adaptiveness but can be problematic in the presence of endpoint congestion. Tree-saturation is a well-known behavior caused by endpoint congestion. Adaptive routing can, however, spread the congestion and result in thick branches of the congestion tree -- creating Head-of-Line (HoL) blocking and degrading performance. In this work, we identify how ignoring virtual channels (VCs) and their occupancy during adaptive routing results in congestion trees with thick branches as congestion is spread to all VCs. To address this limitation, we propose Footprint routing algorithm -- a new adaptive routing algorithm that minimizes the size of the congestion tree, both in terms of the number of nodes in the congestion tree as well as branch thickness. Footprint achieves this by regulating adaptiveness by requiring packets to follow the path of prior packets to the same destination if the network is congested instead of forking a new path or VC. Thus, the congestion tree is dynamically kept as slim as possible and reduces HoL blocking or congestion spreading while maintaining high adaptivity and maximizing VC buffer utilization. We evaluate the proposed Footprint routing algorithm against other adaptive routing algorithms and our simulation results show that the network saturation throughput can be improved by up to 43% (58%) compared with the fully adaptive routing (partially adaptive routing) algorithms.
﻿Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the frac- tion of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness be- cause many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy.
In this paper, we introduce Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by an underlying prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of the underlying prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. We also explore a range of features to use to train PPF’s perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the underlying prefetcher alone.
******
Processors that adapt architecture to workloads at runtime promise compelling performance per watt (PPW) gains, offering one way to mitigate diminishing returns from pipeline scaling. State-of- the-art adaptive CPUs deploy machine learning (ML) models on- chip to optimize hardware by recognizing workload patterns in event counter data. However, despite breakthrough PPW gains, such designs are not yet widely adopted due to the potential for systematic adaptation errors in the field.
This paper presents an adaptive CPU based on Intel SkyLake that (1) closes the loop to deployment, and (2) provides a novel mecha- nism for post-silicon customization. Our CPU performs predictive cluster gating, dynamically setting the issue width of a clustered architecture while clock-gating unused resources. Gating decisions are driven by ML adaptation models that execute on an existing microcontroller, minimizing design complexity and allowing per- formance characteristics to be adjusted with the ease of a firmware update. Crucially, we show that although adaptation models can suffer from statistical blindspots that risk degrading performance on new workloads, these can be reduced to minimal impact with careful design and training.
Our adaptive CPU improves PPW by 31.4% over a compara- ble non-adaptive CPU on SPEC2017, and exhibits two orders of magnitude fewer Service Level Agreement (SLA) violations than the state-of-the-art. We show how to optimize PPW using models trained to different SLAs or to specific applications, e.g. to improve datacenter hardware in situ. The resulting CPU meets real world deployment criteria for the first time and provides a new means to tailor hardware to individual customers, even as their needs change.
******
Modern software uses indirect branches for various purposes includ- ing, but not limited to, virtual method dispatch and implementation of switch statements. Because an indirect branch’s target address cannot be determined prior to execution, high-performance proces- sors depend on highly-accurate indirect branch prediction techniques to mitigate control hazards.
This paper proposes a new indirect branch prediction scheme that predicts target addresses at the bit level. Using a series of perceptron- based predictors, our predictor predicts individual branch target address bits based on correlations within branch history. Our eval- uations show this new branch target predictor is competitive with state-of-the-art branch target predictors at an equivalent hardware budget. For instance, over a set of workloads including SPEC and mobile applications, our predictor achieves a misprediction rate of 0.183 mispredictions per 1000 instructions, compared with 0.193 for the state-of-the-art ITTAGE predictor and 0.29 for a VPC-based indirect predictor.
******
Machine learning and artiicial intelligence are invaluable for com- puter systems optimization: as computer systems expose more resources for management, ML/AI is necessary for modeling these resources’ complex interactions. The standard way to incorporate ML/AI into a computer system is to irst train a learner to accurately predict the system’s behavior as a function of resource usageÐe.g., to predict energy eiciency as a function of core usageÐand then deploy the learned model as part of a systemÐe.g., a scheduler. In this paper, we show that (1) continued improvement of learning accuracy may not improve the systems result, but (2) incorporating knowledge of the systems problem into the learning process im- proves the systems results even though it may not improve overall accuracy. Speciically, we learn application performance and power as a function of resource usage with the systems goal of meeting la- tency constraints with minimal energy. We propose a novel genera- tive model which improves learning accuracy given scarce data, and we propose a multi-phase sampling technique, which incorporates knowledge of the systems problem. Our results are both positive and negative. The generative model improves accuracy, even for state-of-the-art learning systems, but negatively impacts energy. Multi-phase sampling reduces energy consumption compared to the state-of-the-art, but does not improve accuracy. These results imply that learning for systems optimization may have reached a point of diminishing returns where accuracy improvements have little efect on the systems outcome. Thus we advocate that future work on learning for systems should de-emphasize accuracy and instead incorporate the system problem’s structure into the learner.
******
With the strong computation capability, NUMA-based multi-GPU system is a promising candidate to provide sustainable and scalable performance for Virtual Reality (VR) applications and deliver the excellent user experience. However, the entire multi-GPU system is viewed as a single GPU under the single programming model which greatly ignores the data locality among VR rendering tasks during the workload distribution, leading to tremendous remote memory accesses among GPU models (GPMs). The limited inter-GPM link bandwidth (e.g., 64GB/s for NVlink) becomes the major obstacle when executing VR applications in the multi-GPU system. By conducting comprehensive characterizations on different kinds of parallel rendering frameworks, we observe that distributing the rendering object along with its required data per GPM can reduce the inter-GPM memory accesses. However, this object-level rendering still faces two major challenges in NUMA-based multi-GPU system: (1) the large data locality between the left and right views of the same object and the data sharing among different objects and (2) the unbalanced workloads induced by the software-level distribution and composition mechanisms.
To tackle these challenges, we propose object-oriented VR rendering framework (OO-VR) that conducts the software and hardware co-optimization to provide a NUMA friendly solution for VR multi-view rendering in NUMA-based multi-GPU systems. We first propose an object-oriented VR programming model to exploit the data sharing between two views of the same object and group objects into batches based on their texture sharing levels. Then, we design an object aware runtime batch distribution engine and distributed hardware composition unit to achieve the balanced workloads among GPMs and further improve the performance of VR rendering. Finally, evaluations on our VR featured simulator show that OO-VR provides 1.58x overall performance improvement and 76% inter-GPM memory traffic reduction over the state-of-the-art multi-GPU systems. In addition, OO-VR provides NUMA friendly performance scalability for the future larger multi-GPU scenarios with ever increasing asymmetric bandwidth between local and remote memory.
******
Web applications are gradually shifting toward resource-constrained mobile devices. As a result, the Web runtime system must simultane- ously address two challenges: responsiveness and energy-efficiency. Conventional Web runtime systems fall short due to their reactive nature: they react to a user event only after it is triggered. The reactive strategy leads to local optimizations that schedule event executions one at a time, missing global optimization opportunities.
This paper proposes Proactive Event Scheduling (PES). The key idea of PES is to proactively anticipate future events and thereby globally coordinate scheduling decisions across events. Specifically, PES predicts events that are likely to happen in the near future using a combination of statistical inference and application code analy- sis. PES then speculatively executes future events ahead of time in a way that satisfies the QoS constraints of all the events while minimizing the global energy consumption. Fundamentally, PES unlocks more optimization opportunities by enlarging the schedul- ing window, which enables coordination across both outstanding events and predicted events. Hardware measurements show that PES reduces the QoS violation and energy consumption by 61.2% and 26.5%, respectively, over the Android’s default Interactive CPU governor. It also reduces the QoS violation and energy con- sumption by 63.1% and 17.9%, respectively, compared to EBS, a state-of-the-art reactive scheduler.
******
Virtual reality (VR) has huge potential to enable radically new applications, behind which spherical panoramic video processing is one of the backbone techniques. However, current VR systems reuse the techniques designed for processing conventional planar videos, resulting in significant energy inefficiencies. Our characterizations show that operations that are unique to processing 360° VR content constitute 40% of the total processing energy consumption.
We present EVR, an end-to-end system for energy-efficient VR video processing. EVR recognizes that the major contributor to the VR tax is the projective transformation (PT) operations. EVR mitigates the overhead of PT through two key techniques: semantic- aware streaming (SAS) on the server and hardware-accelerated rendering (HAR) on the client device. EVR uses SAS to reduce the chances of executing projective transformation on VR devices by pre-rendering 360° frames in the cloud. Different from conventional pre-rendering techniques, SAS exploits the key semantic informa- tion inherent in VR content that is previously ignored. Comple- mentary to SAS, HAR mitigates the energy overhead of on-device rendering through a new hardware accelerator that is specialized for projective transformation. We implement an EVR prototype on an Amazon AWS server instance and a NVIDA Jetson TX2 board combined with a Xilinx Zynq-7000 FPGA. Real system measure- ments show that EVR reduces the energy of VR rendering by up to 58%, which translates to up to 42% energy saving for VR devices.
******
Non-Volatile Memory is here and provides an attractive fabric for main memory. Unlike DRAM, non-volatile main memory (NVMM) retains data after power loss. This allows memory to host data per- sistently across crashes and reboots, but opens up opportunities for attackers to snoop and/or tamper with data between boot episodes. While memory encryption and integrity verification have been well studied for DRAM systems, new challenges surface for NVMM if we want to simultaneously preserve security guarantees, data re- covery across crashes/reboots, good persistence performance, and fast recovery.
In this paper, we explore persistency of data with all security metadata (counters, MACs, and Merkle Tree) to achieve secure persistency. We show that to ensure security guarantees, message authentication code (MAC) and Bonsai Merkle Tree (BMT) need to be maintained, in addition to counters, and they provide the majority of persistency overheads. We analyze the requirements for achieving secure persistency for both persistent and non-persistent memory regions. We found that the non-volatility nature of mem- ory may trigger integrity verification failure at reboot, hence we propose a separate mechanism to support non-persistent memory region. Fourth, we propose designs to make recovery fast. Our evalu- ation shows that the proposed design, Triad-NVM, can improve the throughput by an average of 2× relative to strict persistence. More- over, Triad-NVM can achieve orders of magnitude faster recovery time compared to systems without security metadata persistence.
******
Graph analytics play a key role in a number of applications such as social networks, drug discovery, and recommendation systems. Given the large size of graphs that may exceed the capacity of the main memory, application performance is bounded by storage access time. Out-of-core graph processing frameworks try to tackle this storage access bottleneck through techniques such as graph sharding, and sub-graph partitioning. Even with these techniques, the need to access data across different graph shards or sub-graphs causes storage systems to become a significant performance hurdle. In this paper, we propose a graph semantic aware solid state drive (SSD) framework, called GraphSSD, which is a full system solution for storing, accessing, and performing graph analytics on SSDs. Rather than treating storage as a collection of blocks, GraphSSD considers graph structure while deciding on graph layout, access, and update mechanisms. GraphSSD replaces the conventional logical to physical page mapping mechanism in an SSD with a novel vertex- to-page mapping scheme and exploits the detailed knowledge of the flash properties to minimize page accesses. GraphSSD also supports efficient graph updates (vertex and edge modifications) by minimizing unnecessary page movement overheads. GraphSSD provides a simple programming interface that enables application developers to access graphs as native data in their applications, thereby simplifying the code development. It also augments the NVMe (non-volatile memory express) interface with a minimal set of changes to map the graph access APIs to appropriate storage access mechanisms.
Our evaluation results show that the GraphSSD framework im- proves the performance by up to 1.85× for the basic graph data fetch functions and on average 1.40×, 1.42×, 1.60×, 1.56×, and 1.29× for the widely used breadth-first search, connected components, random-walk, maximal independent set, and page rank applications, respectively.
******
DRAM has been the dominant technology for architecting main memory for decades. Recent trends in multi-core system design and large-dataset applications have amplified the role of DRAM as a critical system bottleneck. We propose Copy-Row DRAM (CROW), a flexible substrate that enables new mechanisms for improving DRAM performance, energy efficiency, and reliability. We use the CROW substrate to implement 1) a low-cost in-DRAM caching mechanism that lowers DRAM activation latency to frequently- accessed rows by 38% and 2) a mechanism that avoids the use of short-retention-time rows to mitigate the performance and energy overhead of DRAM refresh operations. CROW’s flexibility allows the implementation of both mechanisms at the same time. Our evaluations show that the two mechanisms synergistically improve system performance by 20.0% and reduce DRAM energy by 22.3% for memory-intensive four-core workloads, while incurring 0.48% extra area overhead in the DRAM chip and 11.3 KiB storage overhead in the memory controller, and consuming 1.6% of DRAM storage capacity, for one particular implementation.
******
Non-volatile memory (NVM) technologies can manipulate persis- tent data directly in memory. Ensuring crash consistency of per- sistent data enforces that data updates reach all the way to NVM, which puts these write requests on the critical path. Recent liter- ature sought to reduce this performance impact. However, prior works have not fully accounted for all the backend memory opera- tions (BMOs) performed at the memory controller that are necessary to maintain persistent data in NVM. These BMOs include support for encryption, integrity protection, compression, deduplication, etc., necessary to provide security, endurance, and lifetime guaran- tees. These BMOs significantly increase the NVM write latency and exacerbate the performance degradation caused by the critical write requests. The goal of this work is to minimize the BMO overhead of write requests in an NVM system.
The central challenge is to figure out how to optimize these seemingly dependent and monolithic BMOs. Our key insight is to decompose each BMO into a series of sub-operations and then reduce their overall latency through two mechanisms: (i) parallelize sub-operations across BMOs and (ii) pre-execute sub-operations off the critical path as soon as their inputs are ready. We expose a generic software interface that can be used to issue pre-execution requests compatible with common crash-consistency programming models and various BMOs. Based on these ideas, we propose Janus1 – a hardware-software co-design that parallelizes and pre-executes BMOs in an NVM system. We evaluate Janus in an NVM system that integrates encryption, integrity verification, and deduplication and issues pre-execution requests through the proposed software interface, either manually or using an automated compiler pass. Compared to a system that performs these operations serially, Janus achieves 2.35× and 2.00× speedup using manual and automated instrumentation, respectively.
******
Implementing secure Non-Volatile Memories (NVMs) is challenging, mainly due to the
necessity to persist security metadata along with data. Unlike conventional secure
memories, NVM-equipped systems are expected to recover data after crashes and hence
security metadata must be recoverable as well. While prior work explored recovery
of encryption counters, fewer efforts have been focused on recovering integrity-protected
systems. In particular, how to recover Merkle Tree. We observe two major challenges
for this. First, recovering parallelizable integrity trees, e.g., Intel's SGX trees,
requires very special handling due to inter-level dependency. Second, the recovery
time of practical NVM sizes (terabytes are expected) would take hours. Most data centers,
cloud systems, intermittent-power devices and even personal computers, are anticipated
to recover almost instantly after power restoration. In fact, this is one of the major
promises of NVMs.In this paper, we propose Anubis, a novel hardware-only solution
that speeds up recovery time by almost 107 times (from 8 hours to only 0.03 seconds).
Moreover, we propose a novel and elegant way to recover inter-level dependent trees,
as in Intel's SGX. Most importantly, while ensuring recoverability of one of the most
challenging integrity-protection schemes among others, Anubis incurs performance overhead
that is only 2% higher than the state-of-the-art scheme, Osiris, which takes hours
to recover systems with general Merkle Tree and fails to recover SGX-style trees.
******
Mobile systems-on-chips (SoCs) have become ubiquitous computing platforms, and, in
recent years, they have become increasingly heterogeneous and complex. A typical SoC
includes CPUs, graphics processor units (GPUs), image processors, video encoders/decoders,
AI engines, digital signal processors (DSPs) and 2D engines among others [33, 70,
71]. One of the most significant SoC units in terms of both off-chip memory bandwidth
and SoC die area is the GPU. In this paper, we present Emerald, a simulator that builds
on existing tools to provide a unified model for graphics and GPGPU applications.
Emerald enables OpenGL (v4.5) and OpenGL ES (v3.2) shaders to run on GPGPU-Sim's timing
model and is integrated with gem5 and Android to simulate full SoCs. Emerald thus
provides a platform for studying system-level SoC interactions while including the
impact of graphics.We present two case studies using Emerald. First, we use Emerald's
full-system mode to highlight the importance of system-wide interactions by studying
and analyzing memory organization and scheduling schemes for SoC systems. Second,
we use Emerald's standalone mode to evaluate a novel mechanism for balancing the graphics
shading work assigned to each GPU core.
******
Modern GPUs suffer from cache contention due to the limited cache size that is shared
across tens of concurrently running warps. To increase the per-warp cache size prior
techniques proposed warp throttling which limits the number of active warps. Warp
throttling leaves several registers to be dynamically unused whenever a warp is throttled.
Given the stringent cache size limitation in GPUs this work proposes a new cache management
technique named Linebacker (LB) that improves GPU performance by utilizing idle register
file space as victim cache space. Whenever a CTA becomes inactive, linebacker backs
up the registers of the throttled CTA to the off-chip memory. Then, linebacker utilizes
the corresponding register file space as victim cache space. If any load instruction
finds data in the victim cache line, the data is directly copied to the destination
register through a simple register-register move operation. To further improve the
efficiency of victim cache linebacker allocates victim cache space only to a select
few load instructions that exhibit high data locality. Through a careful design of
victim cache indexing and management scheme linebacker provides 29.0% of speedup compared
to the previously proposed warp throttling techniques.
******
The rapidly growing popularity and scale of data-parallel workloads demand a corresponding
increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU
platforms struggle to satisfy these performance demands, multi-GPU platforms have
started to dominate the high-performance computing world. The advent of such systems
raises a number of design challenges, including the GPU microarchitecture, multi-GPU
interconnect fabric, runtime libraries, and associated programming models. The research
community currently lacks a publicly available and comprehensive multi-GPU simulation
framework to evaluate next-generation multi-GPU system designs.In this work, we present
MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's
Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built
support for multi-threaded execution to enable fast, parallelized, and accurate simulation.
In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the
actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional
emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering
the same accuracy as serial simulation.We illustrate the flexibility and capability
of the simulator through two concrete design studies. In the first, we propose the
Locality API, an API extension that allows the GPU programmer to both avoid the complexity
of multi-GPU programming, while precisely controlling data placement in the multi-GPU
memory. In the second design study, we propose <u>P</u>rogressive P<u>a</u>ge <u>S</u>plitting
M<u>i</u>gration (PASI), a customized multi-GPU memory management system enabling
the hardware to progressively improve data placement. For a discrete 4-GPU system,
we observe that the Locality API can speed up the system by 1.6x (geometric mean),
and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks,
compared to a unified 4-GPU platform.
******
Data transfer overhead between computing cores and memory hierarchy has been a persistent
issue for von Neumann architectures and the problem has only become more challenging
with the emergence of manycore systems. A conceptually powerful approach to mitigate
this overhead is to bring the computation closer to data, known as Near Data Computing
(NDC). Recently, NDC has been investigated in different flavors for CPU-based multicores,
while the GPU domain has received little attention. In this paper, we present a novel
NDC solution for GPU architectures with the objective of minimizing on-chip data transfer
between the computing cores and Last-Level Cache (LLC). To achieve this, we first
identify frequently occurring Load-Compute-Store instruction chains in GPU applications.
These chains, when offloaded to a compute unit closer to where the data resides, can
significantly reduce data movement. We develop two offloading techniques, called LLC-Compute
and Omni-Compute. The first technique, LLC-Compute, augments the LLCs with computational
hardware for handling the computation offloaded to them. The second technique (Omni-Compute)
employs simple bookkeeping hardware to enable GPU cores to compute instructions offloaded
by other GPU cores. Our experimental evaluations on nine GPGPU workloads indicate
that the LLC-Compute technique provides, on an average, 19% performance improvement
(IPC), 11% performance/watt improvement, and 29% reduction in on-chip data movement
compared to the baseline GPU design. The Omni-Compute design boosts these benefits
to 31%, 16% and 44%, respectively.
******
Memory capacity in GPGPUs is a major challenge for data-intensive applications with
their ever increasing memory requirement. To fit a workload into the limited GPU memory
space, a programmer needs to manually divide the workload by tiling the working set
and perform user-level data migration. To relieve the programmer from this burden,
Unified Virtual Memory (UVM) was developed to support on-demand paging and migration,
transparent to the user. It further takes care of the memory over-subscription issue
by automatically performing page replacement in an oversubscribed GPU memory situation.
However, we found that na\"{\i}ve handling of page faults can cause orders of magnitude
slowdown in performance. Moreover, we observed that although prefetching of data from
CPU to GPU can hide the page fault latency, the difference among various prefetching
mechanisms can lead to drastically different performance results. To this end, we
performed extensive experiments on GeForceGTX 1080ti GPUs with PCI-e 3.0 16x to discover
that there exists an effective prefetch mechanism to enhance locality in GPU memory.
However, as the GPU memory is filled to its capacity, such prefetching mechanism quickly
proves to be counterproductive due to locality unaware eviction policy. This necessitates
the design of new eviction policies that are aware of the hardware prefetcher semantics.
We propose two new programmer-agnostic, locality-aware pre-eviction policies which
leverage the mechanics of existing hardware prefetcher and thus incur no additional
implementation and performance overhead. We demonstrate that combining the proposed
tree-based pre-eviction policy with the hardware prefetcher provides an average of
93% and 18.5% performance speed-up compared to LRU based 4KB and 2MB page replacement
strategies, respectively. We further examine the memory access pattern of GPU workloads
under consideration to analyze the achieved performance speed-up.
******
Exploiting model sparsity to reduce ineffectual computation is a commonly used approach
to achieve energy efficiency for DNN inference accelerators. However, due to the tightly
coupled crossbar structure, exploiting sparsity for ReRAM-based NN accelerator is
a less explored area. Existing architectural studies on ReRAM-based NN accelerators
assume that an entire crossbar array can be activated in a single cycle. However,
due to inference accuracy considerations, matrix-vector computation must be conducted
in a smaller granularity in practice, called Operation Unit (OU). An OU-based architecture
creates a new opportunity to exploit DNN sparsity. In this paper, we propose the first
practical Sparse ReRAM Engine that exploits both weight and activation sparsity. Our
evaluation shows that the proposed method is effective in eliminating ineffectual
computation, and delivers significant performance improvement and energy savings.
******
Memory-augmented neural networks are getting more attention from many researchers
as they can make an inference with the previous history stored in memory. Especially,
among these memory-augmented neural networks, memory networks are known for their
huge reasoning power and capability to learn from a large number of inputs rather
than other networks. As the size of input datasets rapidly grows, the necessity of
large-scale memory networks continuously arises. Such large-scale memory networks
provide excellent reasoning power; however, the current computer infrastructure cannot
achieve scalable performance due to its limited system architecture.In this paper,
we propose MnnFast, a novel system architecture for large-scale memory networks to
achieve fast and scalable reasoning performance. We identify the performance problems
of the current architecture by conducting extensive performance bottleneck analysis.
Our in-depth analysis indicates that the current architecture suffers from three major
performance problems: high memory bandwidth consumption, heavy computation, and cache
contention. To overcome these performance problems, we propose three novel optimizations.
First, to reduce the memory bandwidth consumption, we propose a new column-based algorithm
with streaming which minimizes the size of data spills and hides most of the off-chip
memory accessing overhead. Second, to decrease the high computational overhead, we
propose a zero-skipping optimization to bypass a large amount of output computation.
Lastly, to eliminate the cache contention, we propose an embedding cache dedicated
to efficiently cache the embedding matrix.Our evaluations show that MnnFast is significantly
effective in various types of hardware: CPU, GPU, and FPGA. MnnFast improves the overall
throughput by up to 5.38x, 4.34x, and 2.01x on CPU, GPU, and FPGA respectively. Also,
compared to CPU-based MnnFast, our FPGA-based MnnFast achieves 6.54x higher energy
efficiency.
******
In the era of artificial intelligence (AI), deep neural networks (DNNs) have emerged
as the most important and powerful AI technique. However, large DNN models are both
storage and computation intensive, posing significant challenges for adopting DNNs
in resource-constrained scenarios. Thus, model compression becomes a crucial technique
to ensure wide deployment of DNNs.This paper advances the state-of-the-art by considering
tensor train (TT) decomposition, an very promising but yet explored compression technique
in architecture domain. The method features with the extremely high compression ratio.
However, the challenge is that the inference on the TT-format DNN models inherently
incurs massive amount of redundant computations, causing significant energy consumption.
Thus, the straightforward application of TT decomposition is not feasible.To address
this fundamental challenge, this paper develops a computation-efficient inference
scheme for TT-format DNN, which enjoys two key merits: 1) it achieves theoretical
limit of number of multiplications, thus eliminating all redundant computations; and
2) the multi-stage processing scheme reduces the intensive memory access to all tensor
cores, bringing significant energy saving.Based on the novel inference scheme, we
develop TIE, a TT-format compressed DNN-targeted inference engine. TIE is highly flexible,
supporting different types of networks for different needs. A 16-processing elements
(PE) prototype is implemented using CMOS 28nm technology. Operating on 1000MHz, the
TIE accelerator consumes 1.74mm2 and 154.8mW. Compared with EIE, TIE achieves 7.22x
~ 10.66x better area efficiency and 3.03x ~ 4.48x better energy efficiency on different
workloads, respectively. Compared with CirCNN, TIE achieves 5.96x and 4.56x higher
throughput and energy efficiency, respectively. The results show that TIE exhibits
significant advantages over state-of-the-art solutions.
******
Reinforcement learning (RL) has attracted much attention recently, as new and emerging
AI-based applications are demanding the capabilities to intelligently react to environment
changes. Unlike distributed deep neural network (DNN) training, the distributed RL
training has its unique workload characteristics - it generates orders of magnitude
more iterations with much smaller sized but more frequent gradient aggregations. More
specifically, our study with typical RL algorithms shows that their distributed training
is latency critical and that the network communication for gradient aggregation occupies
up to 83.2% of the execution time of each training iteration.In this paper, we present
iSwitch, an in-switch acceleration solution that moves the gradient aggregation from
server nodes into the network switches, thus we can reduce the number of network hops
for gradient aggregation. This not only reduces the end-to-end network latency for
synchronous training, but also improves the convergence with faster weight updates
for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization
overhead by conducting on-the-fly gradient aggregation at the granularity of network
packets rather than gradient vectors. Moreover, we rethink the distributed RL training
algorithms and also propose a hierarchical aggregation mechanism to further increase
the parallelism and scalability of the distributed RL training at rack scale.We implement
iSwitch using a real-world programmable switch NetFPGA board. We extend the control
and data plane of the programmable switch to support iSwitch without affecting its
regular network functions. Compared with state-of-the-art distributed training approaches,
iSwitch offers a system-level speedup of up to 3.66x for synchronous distributed training
and 3.71x for asynchronous distributed training, while achieving better scalability.
******
Today's big and fast data and the changing circumstance require fast training of Deep
Neural Networks (DNN) in various applications. However, training a DNN with tons of
parameters involves intensive computation. Enlightened by the fact that redundancy
exists in DNNs and the observation that the ranking of the significance of the weights
changes slightly during training, we propose Eager Pruning, which speeds up DNN training
by moving pruning to an early stage.Eager Pruning is supported by an algorithm and
architecture co-design. The proposed algorithm dictates the architecture to identify
and prune insignificant weights during training without accuracy loss. A novel architecture
is designed to transform the reduced training computation into performance improvement.
Our proposed Eager Pruning system gains an average of 1.91x speedup over state-of-the-art
hardware accelerator and 6.31x energy-efficiency over Nvidia GPUs.
******
We present a method for transparently identifying ineffectual computations during
inference with Deep Learning models. Specifically, by decomposing multiplications
down to the bit level, the amount of work needed by multiplications during inference
can be potentially reduced by at least 40x across a wide selection of neural networks
(8b and 16b). This method produces numerically identical results and does not affect
overall accuracy. We present Laconic, a hardware accelerator that implements this
approach to boost energy efficiency for inference with Deep Learning Networks. Laconic
judiciously gives up some of the work reduction potential to yield a low-cost, simple,
and energy efficient design that outperforms other state-of-the-art accelerators:
an optimized DaDianNao-like design [13], Eyeriss [15], SCNN [71], Pragmatic [3], and
BitFusion [83]. We study 16b, 8b, and 1b/2b fixed-point quantized models.
******
The popularity of hardware-based Trusted Execution Environments (TEEs) has recently
skyrocketed with the introduction of Intel's Software Guard Extensions (SGX). In SGX,
the user process is protected from supervisor software, such as the operating system,
through an isolated execution environment called an enclave. Despite the isolation
guarantees provided by TEEs, numerous microarchitectural side channel attacks have
been demonstrated that bypass their defense mechanisms. But, not all hope is lost
for defenders: many modern fine-grain, high-resolution side channels---e.g., execution
unit port contention---introduce large amounts of noise, complicating the adversary's
task to reliably extract secrets.In this work, we introduce Microarchitectural Replay
Attacks, whereby an SGX adversary can denoise nearly arbitrary microarchitectural
side channels in a single run of the victim, by causing the victim to repeatedly replay
on a page faulting instruction. We design, implement, and demonstrate our ideas in
a framework, called MicroScope, and use it to denoise notoriously noisy side channels.
Our main result shows how MicroScope can denoise the execution unit port contention
channel. Specifically, we show how Micro-Scope can reliably detect the presence or
absence of as few as two divide instructions in a single logical run of the victim
program. Such an attack could be used to detect subnormal input to individual floating-point
instructions, or infer branch directions in an enclave despite today's countermeasures
that flush the branch predictor at the enclave boundary. We also use MicroScope to
single-step and denoise a cache-based attack on the OpenSSL implementation of AES.
Finally, we discuss the broader implications of microarchitectural replay attacks---as
well as discuss other mechanisms that can cause replays.
******
Directories for cache coherence have been recently shown to be vulnerable to conflict-based
side-channel attacks. By forcing directory conflicts, an attacker can evict victim
directory entries, which in turn trigger the eviction of victim cache lines from private
caches. This evidence strongly suggests that directories need to be redesigned for
security. The key to a secure directory is to block interference between processes.
Sadly, in an environment with many cores, this is hard or expensive to do.This paper
presents the first design of a scalable secure directory. We call it SecDir. SecDir
takes part of the storage used by a conventional directory and re-assigns it to per-core
private directory areas used in a victim-cache manner called Victim Directories (VDs).
The partitioned nature of VDs prevents directory interference across cores, defeating
directory side-channel attacks. The VD of a core is distributed, and holds as many
entries as lines in the private L2 cache of the core. To minimize victim self-conflicts
in a VD during an attack, a VD is organized as a cuckoo directory. Such a design also
obscures the victim's conflict patterns from the attacker. For our evaluation, we
model with simulations the directory of an Intel Skylake-X server with and without
SecDir. Our results show that SecDir has a negligible performance overhead. Furthermore,
SecDir is area-efficient.
******
This paper focuses on a new attack vector in modern processors: the timing-based side
and covert channel attacks due to the Translation Look-aside Buffers (TLBs). This
paper first presents a novel three-step modeling approach that is used to exhaustively
enumerate all possible TLB timing-based vulnerabilities. Building on the three-step
model, this paper then shows how to automatically generate micro security benchmarks
that test for the TLB vulnerabilities. After showing the insecurity of standard TLBs,
two new secure TLB designs are presented: a Static-Partition (SP) TLB and a Random-Fill
(RF) TLB. The new secure TLBs are evaluated using the Rocket Core implementation of
the RISC-V processor architecture enhanced with the two new designs. The three-step
model and the security benchmarks are used to analyze the security of the new designs
in simulation. Based on the analysis, the proposed secure TLBs can defend not only
against the previously publicized attacks but also against other new timing-based
attacks in TLBs found using the new three-step model. The performance overhead is
evaluated on an FPGA-based setup, and, for example, shows that the RF TLB has less
than 10% overhead while defending all the attacks.
******
Conflict-based cache attacks can allow an adversary to infer the access pattern of
a co-running application by orchestrating evictions via cache conflicts. Such attacks
can be mitigated by randomizing the location of the lines in the cache. Our recent
proposal, CEASER, makes cache randomization practical by accessing the cache using
an encrypted address and periodically changing the encryption key. CEASER was analyzed
with the state-of-the-art algorithm on forming eviction sets, and the analysis showed
that CEASER with a Remap-Rate of 1% is sufficient to tolerate years of attack.In this
paper, we present two new attacks that significantly push the state-of-the-art in
forming eviction sets. Our first attack reduces the time required to form the eviction
set from O(L2) to O(L), where L is the number of lines in the attack. This attack
is 35x faster than the best-known attack and requires that the Remap-Rate of CEASER
be increased to 35%. Our second attack exploits the replacement policy (we analyze
LRU, RRIP, and Random) to form eviction set quickly and requires that the Remap-Rate
of CEASER be increased to more than 100%, incurring impractical overheads.To improve
the robustness of CEASER against these attacks in a practical manner, we propose Skewed-CEASER
(CEASER-S), which divides the cache ways into multiple partitions and maps the cache
line to be resident in a different set in each partition. This design significantly
improves the robustness of CEASER, as the attacker must form an eviction set that
can dislodge the line from multiple possible locations. We show that CEASER-S can
tolerate years of attacks while retaining a Remap-Rate of 1%. CEASER-S incurs negligible
slowdown (within 1%) and a storage overhead of less than 100 bytes for the newly added
structures.
******
State-of-art secure processors like Intel SGX remain susceptible to leaking page-level
address trace of an application via the page fault channel in which a malicious OS
induces spurious page faults and deduces application's secrets from it. Prior works
which fix this vulnerability do not provision for OS demand paging to be oblivious.
In this work, we present InvisiPage which obfuscates page fault channel while simultaneously
making OS demand paging oblivious. To do so, InvisiPage first carefully distributes
page management actions between the application and the OS. Second, InvisiPage secures
application's page management interactions with the OS using a novel construct which
is derived from Oblivious RAM (ORAM) but is customized for page management. Finally,
we lower overheads of our approach by reducing page management interactions with the
OS via a novel memory partition. For a suite of cloud applications which process sensitive
data we show that page fault channel can be tackled while enabling oblivious demand
paging at low overheads.
******
Computer systems using DRAM are exposed to row-hammer (RH) attacks, which can flip
data in a DRAM row without directly accessing a row but by frequently activating its
adjacent ones. There have been a number of proposals to prevent RH, but they either
incur large area overhead, suffer from noticeable performance drop on adversarial
memory access patterns, or provide probabilistic protection with no capability to
detect attacks.In this paper, we propose a new counter-based RH prevention solution
named Time Window Counter (TWiCe) based row refresh, which accurately detects potential
RH attacks only using a small number of counters with a minimal performance impact.
We first make a key observation that the number of rows that can cause RH is limited
by the maximum values of row activation frequency and DRAM cell retention time. We
calculate the maximum number of required counter entries per DRAM bank, with which
TWiCe prevents RH with a strong deterministic guarantee. We leverage pseudo-associative
cache design and separate the TWiCe table to further reduce area and energy overheads.
TWiCe incurs no performance overhead on normal DRAM operations and less than 0.7%
area and energy overheads over contemporary DRAM devices. Our evaluation shows that
TWiCe makes no more than 0.006% of additional DRAM row activations for adversarial
memory access patterns including RH attack scenarios.
******
Duality Cache is an in-cache computation architecture that enables general purpose
data parallel applications to run on caches. This paper presents a holistic approach
of building Duality Cache system stack with techniques of performing in-cache floating
point arithmetic and transcendental functions, enabling a data-parallel execution
model, designing a compiler that accepts existing CUDA programs, and providing flexibility
in adopting for various workload characteristics.Exposure to massive parallelism that
exists in the Duality Cache architecture improves performance of GPU benchmarks by
3.6x and OpenACC benchmarks by 4.0x over a server class GPU. Re-purposing existing
caches provides 72.6x better performance for CPUs with only 3.5% of area cost. Duality
Cache reduces energy by 5.8x over GPUs and 21x over CPUs.
******
Emerging GPU applications exhibit increasingly high computation demands which has
led GPU manufacturers to build GPUs with an increasingly large number of streaming
multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant
pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically
partition the memory-side last-level cache (LLC) in equally-sized slices that are
shared by all SMs. Although a shared LLC typically results in a lower miss rate, we
find that for workloads with high degrees of data sharing across SMs, a private LLC
leads to a significant performance advantage because of increased bandwidth to replicated
cache lines across different LLC slices.In this paper, we propose adaptive memory-side
last-level GPU caching to boost performance for sharing-intensive workloads that need
high bandwidth to read-only shared data. Adaptive caching leverages a lightweight
performance model that balances increased LLC bandwidth against increased miss rate
under private caching. In addition to improving performance for sharing-intensive
workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage
crossbar NoC by power-gating and bypassing the second stage if the LLC is configured
as a private cache. Our experimental results using 17 GPU workloads show that adaptive
caching improves performance by 28.1% on average (up to 38.1%) compared to a shared
LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy
by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to
27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC
design space exploration that a hierarchical two-stage crossbar is both more power-
and area-efficient than full and concentrated crossbars with the same bisection bandwidth,
thus providing a low-cost cooperative solution to exploit workload sharing behavior
in memory-side last-level caches.
******
Graph processing algorithms are key in many emerging applications in areas such as
machine learning and data analytics. Although the processing of large scale graphs
exhibits a high degree of parallelism, the memory access pattern tend to be highly
irregular, leading to poor GPGPU efficiency due to memory divergence. To ameliorate
this issue, GPGPU applications perform a stream compaction operation each iteration
of the algorithm to extract the subset of active nodes/edges, so subsequent steps
work on compacted dataset.We show that GPGPU architectures are inefficient for stream
compaction, and propose to offload this task to a programmable Stream Compaction Unit
(SCU) tailored to the requirements of this kernel. The SCU is a small unit tightly
integrated in the GPU that efficiently gathers the active nodes/edges into a compacted
array in memory. Applications can make use of it through a simple API. The remaining
steps of the graph-based algorithm are executed on the GPU cores taking benefit of
the large amount of parallelism in the GPU, but they operate on the SCU-prepared data
and achieve larger memory coalescing and, hence, much higher efficiency. Besides,
the SCU performs filtering of repeated and already visited nodes during the compaction
process, significantly reducing GPGPU workload, and writes the compacted nodes/edges
in an order that improves memory coalescing by reducing memory divergence.We evaluate
the performance of a state-of-the-art GPGPU architecture extended with our SCU for
a wide variety of applications. Results show that for high-performance and for low-power
GPU systems the SCU achieves speedups of 1.37x and 2.32x, 84.7% and 69% energy savings,
and an area increase of 3.3% and 4.1% respectively.
******
Modern processors contain store-buffers to allow stores to retire under a miss, thus
hiding store-miss latency. The store-buffer needs to be large (for performance) and
searched on every load (for correctness), thereby making it a costly structure in
both area and energy. Yet on every load, the store-buffer is probed in parallel with
the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether
a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes.In
this work we cache data that have been written back to memory in a unified store-queue/buffer/cache,
and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting
the allocation of entries between the store-queue/buffer/cache, we can achieve nearly
optimal reuse, without causing stalls. We are able to do this efficiently and cheaply
by recognizing key properties of stores: free caching (since they must be written
into the store-buffer for correctness we need no additional data movement), cheap
coherence (since we only need to track state changes of the local, dirty data in the
store-buffer), and free and accurate hit prediction (since the memory dependence predictor
already does this for scheduling).As a result, we are able to increase the store-buffer
hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006
without hurting performance (average IPC improvements of 1.5%, up to 4.7%). The cost
for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and
one additional tail pointer in the store-buffer.
******
Temporal prefetchers have the potential to prefetch arbitrary memory access patterns,
but they require large amounts of metadata that must typically be stored in DRAM.
In 2013, the Irregular Stream Buffer (ISB), showed how this metadata could be cached
on chip and managed implicitly by synchronizing its contents with that of the TLB.
This paper reveals the inefficiency of that approach and presents a new metadata management
scheme that uses a simple metadata prefetcher to feed the metadata cache. The result
is the Managed ISB (MISB), a temporal prefetcher that significantly advances the state-of-the-art
in terms of both traffic overhead and IPC.Using a highly accurate proprietary simulator
for single-core workloads, and using the ChampSim simulator for multi-core workloads,
we evaluate MISB on programs from the SPEC CPU 2006 and CloudSuite benchmarks suites.
Our results show that for single-core workloads, MISB improves performance by 22.7%,
compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. MISB also significantly
reduces off-chip traffic; for SPEC, MISB's traffic overhead of 70% is roughly one
fifth of STMS's (342%) and one sixth of ISB's (411%). On 4-core multi-programmed workloads,
MISB improves performance by 27.5%, compared to 13.6% for idealized STMS. For CloudSuite,
MISB improves performance by 12.8% (vs. 6.0% for idealized STMS), while achieving
a traffic reduction of 7 x (83.5% for MISB vs. 572.3% for STMS).
******
The large instruction working sets of private and public cloud workloads lead to frequent
instruction cache misses and costs in the millions of dollars. While prior work has
identified the growing importance of this problem, to date, there has been little
analysis of where the misses come from, and what the opportunities are to improve
them. To address this challenge, this paper makes three contributions. First, we present
the design and deployment of a new, always-on, fleet-wide monitoring system, AsmDB,
that tracks front-end bottlenecks. AsmDB uses hardware support to collect bursty execution
traces, fleet-wide temporal and spatial sampling, and sophisticated offline post-processing
to construct full-program dynamic control-flow graphs. Second, based on a longitudinal
analysis of AsmDB data from real-world online services, we present two detailed insights
on the sources of front-end stalls: (1) cold code that is brought in along with hot
code leads to significant cache fragmentation and a corresponding large number of
instruction cache misses; (2) distant branches and calls that are not amenable to
traditional cache locality or next-line prefetching strategies account for a large
fraction of cache misses. Third, we prototype two optimizations that target these
insights. For misses caused by fragmentation, we focus on memcmp, one of the hottest
functions contributing to cache misses, and show how fine-grained layout optimizations
lead to significant benefits. For misses at the targets of distant jumps, we propose
new hardware support for software code prefetching and prototype a new feedback-directed
compiler optimization that combines static program flow analysis with dynamic miss
profiles to demonstrate significant benefits for several large warehouse-scale workloads.
Improving upon prior work, our proposal avoids invasive hardware modifications by
prefetching via software in an efficient and scalable way. Simulation results show
that such an approach can eliminate up to 96% of instruction cache misses with negligible
overheads.
******
Driven by the increasing power consumption of datacenters, the industry is focusing
more on water cooling for improving the energy efficiency. Using warm water to cool
servers has been considered as an efficient method to reduce the cooling energy. However,
warm water cooling may lead to the risk of cooling failure and its energy efficiency
suffers from the thermal imbalance among servers, due to the lack of fine-grained
cooling control. In this paper, we propose a hybrid cooling architecture design that
incorporates thermoelectric cooler into the water cooling system, to deal with cooling
mismatching in a fine-grained manner. We exploit the warm water cooling strategy and
design an adaptive cooling control framework according to workload variations, to
make water cooling system more economical for datacenters. We evaluate the hybrid
water cooling design based on a real hardware prototype and cluster traces from Google
and Alibaba. Compared with conventional water cooling system, our hybrid water cooling
system can reduce the energy consumption by 58.72%~78.43% to handle the cooling mismatching.
******
Emerging hardware architectures for Deep Neural Networks (DNNs) are being commercialized
and considered as the hardware-level Intellectual Property (IP) of the device providers.
However, these intelligent devices might be abused and such vulnerability has not
been identified. The unregulated usage of intelligent platforms and the lack of hardware-bounded
IP protection impair the commercial advantage of the device provider and prohibit
reliable technology transfer. Our goal is to design a systematic methodology that
provides hardware-level IP protection and usage control for DNN applications on various
platforms. To address the IP concern, we present DeepAttest, the first on-device DNN
attestation method that certifies the legitimacy of the DNN program mapped to the
device. DeepAttest works by designing a device-specific fingerprint which is encoded
in the weights of the DNN deployed on the target platform. The embedded fingerprint
(FP) is later extracted with the support of the Trusted Execution Environment (TEE).
The existence of the pre-defined FP is used as the attestation criterion to determine
whether the queried DNN is authenticated. Our attestation framework ensures that only
authorized DNN programs yield the matching FP and are allowed for inference on the
target device. DeepAttest provisions the device provider with a practical solution
to limit the application usage of her manufactured hardware and prevents unauthorized
or tampered DNNs from execution.We take an Algorithm/Software/Hardware co-design approach
to optimize DeepAttest's overhead in terms of latency and energy consumption. To facilitate
the deployment, we provide a high-level API of DeepAttest that can be seamlessly integrated
into existing deep learning frameworks and TEEs for hardware-level IP protection and
usage control. Extensive experiments corroborate the fidelity, reliability, security,
and efficiency of DeepAttest on various DNN benchmarks and TEE-supported platforms.
******
Current shared cloud operating systems (cloud OS) such as Mesos and YARN are based
on the "de facto" two-horizontal-layer cloud platform architecture, which decouples
cloud application frameworks (e.g., Apache Spark) from the underlying resource management
infrastructure. Each layer normally has its own task or resource allocation scheduler,
based on either time-sharing or space-sharing. As such, the schedulers in different
layers are unavoidably disconnected, --- not aware of each other, which highly likely
leads to resource (e.g.,CPU) wastes. Moreover, the tail latency may even be harmed
due to the performance interference on shared resources.This paper takes the first
step to establish the critical missing connection between the horizontal layers. We
propose TPShare, a time-space sharing scheduling abstraction, using a simple but efficient
vertical label mechanism to coordinate the time- or space-sharing schedulers in different
layers. The vertical labels are bidirectional (i.e., up and down) message carriers
which convey necessary information across two layers and are kept as small as possible.
The schedulers in different layers can thus take actions according to the label messages
to reduce resource wastes and improve tail latency. Moreover, the labels can be defined
to support different cloud application frameworks.We implement the label mechanism
in Mesos and two popular cloud application frameworks (Apache Spark and Flink) to
study the effectiveness of the time-space sharing scheduling abstraction. The label
messages of TPShare reduce resource waste and performance interference due to independent
time-sharing or space-sharing scheduling of different layers by enabling 1) on-demand
fine-grained resource offering, 2) load-aware resource filtering, and 3) resource
demand scaling with global view, eventually improving performance and tail latency.We
co-locate 13 Spark batch and 4 Flink latency-sensitive programs on a 8-node cluster
managed by TPShare to evaluate the speedup, CPU and memory utilization, and tail latency.
The results show that TPShare accelerates the Spark programs significantly with even
lower CPU and memory utilization compared to Mesos. With higher resource utilization,
the throughput of TPShare is drastically larger than that of Mesos. For the Flink
programs, TPShare improves the 99th tail latency by 48% on average and up to 120%.
******
The variety and complexity of microservices in warehouse-scale data centers has grown
precipitously over the last few years to support a growing user base and an evolving
product portfolio. Despite accelerating microservice diversity, there is a strong
requirement to limit diversity in underlying server hardware to maintain hardware
resource fungibility, preserve procurement economies of scale, and curb qualification/test
overheads. As such, there is an urgent need for strategies that enable limited server
CPU architectures (a.k.a "SKUs") to provide performance and energy efficiency over
diverse microservices. To this end, we first undertake a comprehensive characterization
of the top seven microservices that run on the compute-optimized data center fleet
at Facebook.Our characterization reveals profound diversity in OS and I/O interaction,
cache misses, memory bandwidth utilization, instruction mix, and CPU stall behavior.
Whereas customizing a CPU SKU for each microservice might be beneficial, it is prohibitive.
Instead, we argue for "soft SKUs", wherein we exploit coarse-grain (e.g., boot time)
configuration knobs to tune the platform for a particular microservice. We develop
a tool, μSKU, that automates search over a soft-SKU design space using A/B testing
in production and demonstrate how it can obtain statistically significant gains (up
to 7.2% and 4.5% performance improvement over stock and production servers, respectively)
with no additional hardware requirements.
******
In recent years, Quantum Computing (QC) has progressed to the point where small working
prototypes are available for use. Termed Noisy Intermediate-Scale Quantum (NISQ) computers,
these prototypes are too small for large benchmarks or even for Quantum Error Correction
(QEC), but they do have sufficient resources to run small benchmarks, particularly
if compiled with optimizations to make use of scarce qubits and limited operation
counts and coherence times. QC has not yet, however, settled on a particular preferred
device implementation technology, and indeed different NISQ prototypes implement qubits
with very different physical approaches and therefore widely-varying device and machine
characteristics.Our work performs a full-stack, benchmark-driven hardware-software
analysis of QC systems. We evaluate QC architectural possibilities, software-visible
gates, and software optimizations to tackle fundamental design questions about gate
set choices, communication topology, the factors affecting benchmark performance and
compiler optimizations. In order to answer key cross-technology and cross-platform
design questions, our work has built the first top-to-bottom toolflow to target different
qubit device technologies, including superconducting and trapped ion qubits which
are the current QC front-runners. We use our toolflow, TriQ, to conduct real-system
measurements on seven running QC prototypes from three different groups, IBM, Rigetti,
and University of Maryland. Overall, we demonstrate that leveraging microarchitecture
details in the compiler improves program success rate up to 28x on IBM (geomean 3x),
2.3x on Rigetti (geomean 1.45x), and 1.47x on UMDTI (geomean 1.17x), compared to vendor
toolflows. In addition, from these real-system experiences at QC's hardware-software
interface, we make observations and recommendations about native and software-visible
gates for different QC technologies, as well as communication topologies, and the
value of noise-aware compilation even on lower-noise platforms. This is the largest
cross-platform real-system QC study performed thus far; its results have the potential
to inform both QC device and compiler design going forward.
******
In support of the growing interest in quantum computing experimentation, programmers
need new tools to write quantum algorithms as program code. Compared to debugging
classical programs, debugging quantum programs is difficult because programmers have
limited ability to probe the internal states of quantum programs; those states are
difficult to interpret even when observations exist; and programmers do not yet have
guidelines for what to check for when building quantum programs. In this work, we
present quantum program assertions based on statistical tests on classical observations.
These allow programmers to decide if a quantum program state matches its expected
value in one of classical, superposition, or entangled types of states. We extend
an existing quantum programming language with the ability to specify quantum assertions,
which our tool then checks in a quantum program simulator. We use these assertions
to debug three benchmark quantum programs in factoring, search, and chemistry. We
share what types of bugs are possible, and lay out a strategy for using quantum programming
patterns to place assertions and prevent bugs.
******
Quantum computation is traditionally expressed in terms of quantum bits, or qubits.
In this work, we instead consider three-level qutrits. Past work with qutrits has
demonstrated only constant factor improvements, owing to the log2(3) binary-to-ternary
compression factor. We present a novel technique using qutrits to achieve a logarithmic
depth (runtime) decomposition of the Generalized Toffoli gate using no ancilla-a significant
improvement over linear depth for the best qubit-only equivalent. Our circuit construction
also features a 70x improvement in two-qudit gate count over the qubit-only equivalent
decomposition. This results in circuit cost reductions for important algorithms like
quantum neurons and Grover search. We develop an open-source circuit simulator for
qutrits, along with realistic near-term noise models which account for the cost of
operating qutrits. Simulation results for these noise models indicate over 90% mean
reliability (fidelity) for our circuit construction, versus under 30% for the qubit-only
baseline. These results suggest that qutrits offer a promising path towards scaling
quantum computation.
******
The Adiabatic Quantum-Flux-Parametron (AQFP) superconducting technology has been recently
developed, which achieves the highest energy efficiency among superconducting logic
families, potentially 104--105 gain compared with state-of-the-art CMOS. In 2016,
the successful fabrication and testing of AQFP-based circuits with the scale of 83,000
JJs have demonstrated the scalability and potential of implementing large-scale systems
using AQFP. As a result, it will be promising for AQFP in high-performance computing
and deep space applications, with Deep Neural Network (DNN) inference acceleration
as an important example.Besides ultra-high energy efficiency, AQFP exhibits two unique
characteristics: the deep pipelining nature since each AQFP logic gate is connected
with an AC clock signal, which increases the difficulty to avoid RAW hazards; the
second is the unique opportunity of true random number generation (RNG) using a single
AQFP buffer, far more efficient than RNG in CMOS. We point out that these two characteristics
make AQFP especially compatible with the stochastic computing (SC) technique, which
uses a time-independent bit sequence for value representation, and is compatible with
the deep pipelining nature. Further, the application of SC has been investigated in
DNNs in prior work, and the suitability has been illustrated as SC is more compatible
with approximate computations.This work is the first to develop an SC-based DNN acceleration
framework using AQFP technology. The deep-pipelining nature of AQFP circuits translates
into the difficulty in designing accumulators/counters in AQFP, which makes the prior
design in SC-based DNN not suitable. We overcome this limitation taking into account
different properties in CONV and FC layers: (i) the inner product calculation in FC
layers has more number of inputs than that in CONV layers; (ii) accurate activation
function is critical in CONV rather than FC layers. Based on these observations, we
propose (i) accurate integration of summation and activation function in CONV layers
using bitonic sorting network and feedback loop, and (ii) low-complexity categorization
block for FC layers based on chain of majority gates. For complete design we also
develop (i) ultra-efficient stochastic number generator in AQFP, (ii) a high-accuracy
sub-sampling (pooling) block in AQFP, and (iii) majority synthesis for further performance
improvement and automatic buffer/splitter insertion for requirement of AQFP circuits.
Experimental results suggest that the proposed SC-based DNN using AQFP can achieve
up to 6.8 x 104 times higher energy efficiency compared to CMOS-based implementation
while maintaining 96% accuracy on the MNIST dataset.
******
Quantum computing, once just a theoretical field, is quickly advancing as physical
quantum technology increases in size, capability, and reliability. In order to fully
harness the power of a general quantum computer or an application-specific device,
compilers and tools must be developed that optimize specifications and map them to
a realization on a specific architecture. In this work, a technique and prototype
tool for synthesizing algorithms into a quantum computer is described and evaluated.
Most recently reported methods produce technologically-independent reversible cascades
comprised of a functionally complete set of operators with no regard to actual technologically-dependent
cell libraries or constraints due to a device's pre-configured interconnectivity.
In contrast, our prototype tool synthesizes algorithms into technologically-dependent
specifications that consist of a set of primitives and connectivity constraints present
in the computer architecture. The tool performs optimizations based on actual architectural
constraints, and a high-quality technology-dependent synthesized result is achieved
through the use of optimizing cost functions derived from real hardware and architecture
parameters. Additionally, another important aspect of our tool is the incorporation
of internal formal equivalence checking that ensures the initially specified algorithm
is functionally equivalent to the optimized, technologically-mapped output. Experimental
results are provided that target the IBM Q family of quantum computers.
******
As technology scales, Network-on-Chips (NoCs), currently being used for on-chip communication
in manycore architectures, face several problems including high network latency, excessive
power consumption, and low reliability. Simultaneously addressing these problems is
proving to be difficult due to the explosion of the design space and the complexity
of handling many trade-offs. In this paper, we propose IntelliNoC, an intelligent
NoC design framework which introduces architectural innovations and uses reinforcement
learning to manage the design complexity and simultaneously optimize performance,
energy-efficiency, and reliability in a holistic manner. IntelliNoC integrates three
NoC architectural techniques: (1) multifunction adaptive channels (MFACs) to improve
energy-efficiency; (2) adaptive error detection/correction and re-transmission control
to enhance reliability; and (3) a stress-relaxing bypass feature which dynamically
powers off NoC components to prevent overheating and fatigue. To handle the complex
dynamic interactions induced by these techniques, we train a dynamic control policy
using Q-learning, with the goal of providing improved fault-tolerance and performance
while reducing power consumption and area overhead. Simulation using PARSEC benchmarks
shows that our proposed IntelliNoC design improves energy-efficiency by 67% and mean-time-to-failure
(MTTF) by 77%, and decreases end-to-end packet latency by 32% and area requirements
by 25% over baseline NoC architecture.
******
Network Function Virtualization (NFV) has become the new standard in the cloud platform,
as it provides the flexibility and agility for deploying various network services
on general-purpose servers. However, it still suffers from sub-optimal performance
in software packet processing. Our characterization study of virtual switches shows
that the flow classification is the major bottleneck that limits the throughput of
the packet processing in NFV, even though a large portion of the classification rules
can be cached in the last level cache (LLC) in modern servers.To overcome this bottleneck,
we propose Halo, an effective near-cache computing solution for accelerating the flow
classification. Halo exploits the hardware parallelism of the cache architecture consists
of Non-Uniform Cache Access (NUCA) and Caching and Home Agent (CHA) available in almost
all Intel® multi-core CPUs. It associates the accelerator with each CHA component
to speed up and scale the flow classification within LLC. To make Halo more generic,
we extend the x86--64 instruction set with three simple data lookup instructions for
utilizing the proposed near-cache accelerators. We develop Halo with the full-system
simulator gem5. The experiments with a variety of real-world workloads of network
services demonstrate that Halo improves the throughput of basic flow-rule lookup operations
by 3.3x, and scales the representative flow classification algorithm - tuple space
search by up to 23.4x with negligible negative impact on the performance of collocated
network services, compared with state-of-the-art software-based solutions. Halo also
performs up to 48.2x more energy-efficient than the fastest but expensive ternary
content-addressable memory (TCAM), with trivial power and area overhead.
******
Recent years have seen the increased adoption of Coarse-Grained Reconfigurable Architectures
(CGRAs) as flexible, energy-efficient compute accelerators. Obtaining performance
using spatial architectures while supporting diverse applications requires a flexible,
high-bandwidth interconnect. Because modern CGRAs support vector units with wide datapaths,
designing an interconnect that balances dynamism, communication granularity, and programmability
is a challenging task.In this work, we explore the space of spatial architecture interconnect
dynamism, granularity, and programmability. We start by characterizing several benchmarks'
communication patterns and showing links' imbalanced bandwidth requirements, fanout,
and data width. We then describe a compiler stack that maps applications to both static
and dynamic networks and performs virtual channel allocation to guarantee deadlock
freedom. Finally, using a cycle-accurate simulator and 28 nm ASIC synthesis, we perform
a detailed performance, area, and power evaluation across the identified design space
for a variety of benchmarks. We show that the best network design depends on both
applications and the underlying accelerator architecture. Network performance correlates
strongly with bandwidth for streaming accelerators, and scaling raw bandwidth is more
area- and energy-efficient with a static network. We show that the application mapping
can be optimized to move less data by using a dynamic network as a fallback from a
high-bandwidth static network. This static-dynamic hybrid network provides a 1.8x
energy-efficiency and 2.8x performance advantage over the purely static and purely
dynamic networks, respectively.
******
Specialized on-chip accelerators are widely used to improve the energy efficiency
of computing systems. Recent advances in memory technology have enabled near-data
accelerators (NDAs), which reside off-chip close to main memory and can yield further
benefits than on-chip accelerators. However, enforcing coherence with the rest of
the system, which is already a major challenge for accelerators, becomes more difficult
for NDAs. This is because (1) the cost of communication between NDAs and CPUs is high,
and (2) NDA applications generate a lot of off-chip data movement. As a result, as
we show in this work, existing coherence mechanisms eliminate most of the benefits
of NDAs. We extensively analyze these mechanisms, and observe that (1) the majority
of off-chip coherence traffic is unnecessary, and (2) much of the off-chip traffic
can be eliminated if a coherence mechanism has insight into the memory accesses performed
by the NDA.Based on our observations, we propose CoNDA, a coherence mechanism that
lets an NDA optimistically execute an NDA kernel, under the assumption that the NDA
has all necessary coherence permissions. This optimistic execution allows CoNDA to
gather information on the memory accesses performed by the NDA and by the rest of
the system. CoNDA exploits this information to avoid performing unnecessary coherence
requests, and thus, significantly reduces data movement for coherence.We evaluate
CoNDA using state-of-the-art graph processing and hybrid in-memory database workloads.
Averaged across all of our workloads operating on modest data set sizes, CoNDA improves
performance by 19.6% over the highest-performance prior coherence mechanism (66.0%/51.7%
over a CPU-only/NDA-only system) and reduces memory system energy consumption by 18.0%
over the most energy-efficient prior coherence mechanism (43.7% over CPU-only). CoNDA
comes within 10.4% and 4.4% of the performance and energy of an ideal mechanism with
no cost for coherence. The benefits of CoNDA increase with large data sets, as CoNDA
improves performance over the highest-performance prior coherence mechanism by 38.3%
(8.4x/7.7x over CPU-only/NDA-only), and comes within 10.2% of an ideal no-cost coherence
mechanism.
******
A processor laid out vertically in stacked layers can benefit from reduced wire delays,
low energy consumption, and a small footprint. Such a design can be enabled by Monolithic
3D (M3D), a technology that provides short wire lengths, good thermal properties,
and high integration. In current M3D technology, due to manufacturing constraints,
the layers in the stack are asymmetric: the bottom-most one has a relatively higher
performance.In this paper, we examine how to partition a processor for M3D. We partition
logic and storage structures into two layers, taking into account that the top layer
has lower-performance transistors. In logic structures, we place the critical paths
in the bottom layer. In storage structures, we partition the hardware unequally, assigning
to the top layer fewer ports with larger access transistors, or a shorter bitcell
subarray with larger bitcells. We find that, with conservative assumptions on M3D
technology, an M3D core executes applications on average 25% faster than a 2D core,
while consuming 39% less energy. With aggressive technology assumptions, the M3D core
performs even better: it is on average 38% faster than a 2D core and consumes 41%
less energy. Further, under a similar power budget, an M3D multicore can use twice
as many cores as a 2D multicore, executing applications on average 92% faster with
39% less energy. Finally, an M3D core is thermally efficient.
******
Dynamic timing slack has emerged as a compelling opportunity for eliminating inefficiency
in ultra-low power embedded systems. This slack arises when all the signals have propagated
through logic paths well in advance of the clock signal. When it is properly identified,
the system can exploit this unused cycle time for energy savings. In this paper, we
describe compiler and architecture co-design that opens new opportunities for timing
slack that are otherwise impossible. Through cross-layer optimization, we introduce
novel mechanisms in the hardware and in the compiler that work together to improve
the benefit of circuit-level timing speculation by effectively squeezing time during
execution. This approach is particularly well-suited to tiny embedded devices. Our
evaluation on a gate-level model of a complete processor shows that our co-design
saves (on average) 40.5% of the original energy consumption (additional 16.5% compared
to the existing clock scheduling technique) across 13 workloads while retaining transparency
to developers.
******
Microkernel has many intriguing features like security, fault-tolerance, modularity
and customizability, which recently stimulate a resurgent interest in both academia
and industry (including seL4, QNX and Google's Fuchsia OS). However, IPC (inter-process
communication), which is known as the Achilles' Heel of microkernels, is still the
major factor for the overall (poor) OS performance. Besides, IPC also plays a vital
role in monolithic kernels like Android Linux, as mobile applications frequently communicate
with plenty of user-level services through IPC. Previous software optimizations of
IPC usually cannot bypass the kernel which is responsible for domain switching and
message copying/remapping; hardware solutions like tagged memory or capability replace
page tables for isolation, but usually require non-trivial modification to existing
software stack to adapt the new hardware primitives. In this paper, we propose a hardware-assisted
OS primitive, XPC (Cross Process Call), for fast and secure synchronous IPC. XPC enables
direct switch between IPC caller and callee without trapping into the kernel, and
supports message passing across multiple processes through the invocation chain without
copying. The primitive is compatible with the traditional address space based isolation
mechanism and can be easily integrated into existing microkernels and monolithic kernels.
We have implemented a prototype of XPC based on a Rocket RISC-V core with FPGA boards
and ported two microkernel implementations, seL4 and Zircon, and one monolithic kernel
implementation, Android Binder, for evaluation. We also implement XPC on GEM5 simulator
to validate the generality. The result shows that XPC can reduce IPC call latency
from 664 to 21 cycles, up to 54.2x improvement on Android Binder, and improve the
performance of real-world applications on microkernels by 1.6x on Sqlite3 and 10x
on an HTTP server with minimal hardware resource cost.
******
Historically, continuous improvements in general-purpose processors have fueled the
economic success and growth of the IT industry. However, the diminishing benefits
from transistor scaling and conventional optimization techniques necessitates moving
beyond common practices. Approximate computing is one such unconventional technique
that has shown promise in pushing the boundaries of general-purpose processing. This
paper sets out to employ approximation for processors that are commonly used in cyber-physical
domains and may become building blocks of Internet of Things. To this end, we propose
AxMemo to exploit the computation redundancy that stems from data similarity in the
inputs of code blocks. Such input behavior is prevalent in cyber-physical systems
as they deal with real-world data that naturally harbors redundancy. Therefore, in
contrast to existing memoization techniques that replace costly floating-point arithmetic
operations with limited number of inputs, AxMemo focuses on memoizing blocks of code
with potentially many inputs. As such, AxMemo aims to replace long sequences of instructions
with a few hash and lookup operations. By reducing the number of dynamic instructions,
AxMemo alleviates the von Neumann and execution overheads of passing instructions
through the processor pipeline altogether. The challenge AxMemo facing is to provide
low-cost hashing mechanisms that can generate rather unique signature for each multi-input
combination. To address this challenge, we develop a novel use of Cyclic Redundancy
Checking (CRC) to hash the inputs. To increase lookup table hit rate, AxMemo employs
a two-level memoization lookup, which utilizes small dedicated SRAM and spare storage
in the last level cache. These solutions enable AxMemo to efficiently memoize relatively
large code regions with variable input sizes and types using the same underlying hardware.
Our experiment shows that AxMemo offers 2.64x speedup and 2.58 x energy reduction
with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come
with an area overhead of just 2.1%.
******
Virtual memory (VM) eases programming effort but can suffer from high address translation
overheads. Architects have traditionally coped by increasing Translation Lookaside
Buffer (TLB) capacity; this approach, however, requires considerable hardware resources.
One promising alternative is to rely on software-generated translation contiguity
to compress page translation encodings within the TLB. To enable this, operating systems
(OSes) have to assign spatially-adjacent groups of physical frames to contiguous groups
of virtual pages, as doing so allows compression or coalescing of these contiguous
translations in hardware. Unfortunately, modern OSes do not currently guarantee translation
contiguity in many real-world scenarios; as systems remain online for long periods
of time, their memory can and does become fragmented.We propose Translation Ranger,
an OS service that recovers lost translation contiguity even where previous contiguity-generation
proposals struggle with memory fragmentation. Translation Ranger increases contiguity
by actively coalescing scattered physical frames into contiguous regions and can be
leveraged by any contiguity-aware TLB without requiring changes to applications. We
implement and evaluate Translation Ranger in Linux on real hardware and find that
it generates contiguous memory regions 40x larger than the Linux default configuration,
permitting TLB coverage of 120GB memory with typically no more than 128 contiguous
translation regions. This is achieved with less than 2% run time overhead, a number
that is outweighed by the TLB coverage improvements that Translation Ranger provides.
******
When discussing safety and security for embedded systems, we typically divide the
world into software checks (which are either static or dynamic) or hardware checks
(which are dynamic). As others have pointed out, hardware checks offer more than just
efficiency. They are intrinsic to the device's functionality and thus are live from
power-up; they require little to no dependency on other software functioning correctly,
and due to the fact they are wired directly into the operation of the system, are
difficult or impossible to bypass. We explore an experimental new embedded system
that uses special-purpose hardware for static analysis that prevents all program binaries
with memory errors, invalid control flow, and several other undesirable properties
from ever being loaded onto the device. Static analysis often requires whole-binary-level,
rather than instruction-level, examination. We show that a carefully constructed hardware
state machine, using available scratch-pad memory, is capable of efficiently checking
functional binaries in a streaming and verifiably non-bypassable way directly in hardware
as they are loaded into the embedded program store. The resulting system is surprisingly
small (taking no more than .0079 mm2), efficient (capable of checking binaries at
an average throughput of around 60 cycles per instruction), and yet guarantees execution
free from many of the fragile behaviors that result in security and safety concerns.
We believe this is the first time any static analysis has been implemented at the
hardware level and opens the door to more complex hardware-checked properties.
******
Speculative execution, the base on which modern high-performance general-purpose CPUs
are built on, has recently been shown to enable a slew of security attacks. All these
attacks are centered around a common set of behaviors: During speculative execution,
the architectural state of the system is kept unmodified, until the speculation can
be verified. In the event that a misspeculation occurs, then anything that can affect
the architectural state is reverted (squashed) and re-executed correctly. However,
the same is not true for the microarchitectural state. Normally invisible to the user,
changes to the microarchitectural state can be observed through various side-channels,
with timing differences caused by the memory hierarchy being one of the most common
and easy to exploit. The speculative side-channels can then be exploited to perform
attacks that can bypass software and hardware checks in order to leak information.
These attacks, out of which the most infamous are perhaps Spectre and Meltdown, have
led to a frantic search for solutions.In this work, we present our own solution for
reducing the microarchitectural state-changes caused by speculative execution in the
memory hierarchy. It is based on the observation that if we only allow accesses that
hit in the L1 data cache to proceed, then we can easily hide any microarchitectural
changes until after the speculation has been verified. At the same time, we propose
to prevent stalls by value predicting the loads that miss in the L1. Value prediction,
though speculative, constitutes an invisible form of speculation, not seen outside
the core. We evaluate our solution and show that we can prevent observable microarchitectural
changes in the memory hierarchy while keeping the performance and energy costs at
11% and 7%, respectively. In comparison, the current state of the art solution, InvisiSpec,
incurs a 46% performance loss and a 51% energy increase.
******
Because of severe limitations in technology scaling, architects have innovated in
specializing general purpose processors for computation primitives (e.g. vector instructions,
loop accelerators). The general principle is exposing rich semantics to the ISA. An
opportunity to explore is whether richer semantics of memory access patterns could
also be used to improve the efficiency of memory and communication. Two important
open questions are how to convey higher level memory information and how to take advantage
of this information in hardware.We find that a majority of memory accesses follow
a small number of simple patterns; we term these streams (e.g. affine, indirect).
Streams can often be decoupled from core execution, and patterns persist long enough
to express useful behavior. Our approach is therefore to express streams as ISA primitives,
which we argue can enable: prefetch stream accesses to hide memory latency, semi-binding
decoupled access to remove address computation and optimize the memory interface,
and finally inform cache policies.In this work, we propose ISA-extensions for decoupled-streams,
which interact with the core using a FIFO-based interface. We implement optimizations
for each of the aforementioned opportunities on an aggressive wide-issue OOO core
and evaluate with SPEC CPU 2017 and CortexSuite[1, 2]. Across all workloads, we observe
about 1.37x speedup and energy efficiency improvement over hardware stride prefetching.
******
IaaS datacenters offer virtual machines (VMs) to their clients, who in turn sometimes
deploy their own virtualized environments, thereby running a VM inside a VM. This
is known as nested virtualization.VMs are intrinsically slower than bare-metal execution,
as they often trap into their hypervisor to perform tasks like operating virtual I/O
devices. Each VM trap requires loading and storing dozens of registers to switch between
the VM and hypervisor contexts, thereby incurring costly runtime overheads. Nested
virtualization further magnifies these overheads, as every VM trap in a traditional
virtualized environment triggers at least twice as many traps.We propose to leverage
the replicated thread execution resources in simultaneous multithreaded (SMT) cores
to alleviate the overheads of VM traps in nested virtualization. Our proposed architecture
introduces a simple mechanism to colocate different VMs and hypervisors on separate
hardware threads of a core, and replaces the costly context switches of VM traps with
simple thread stall and resume events. More concretely, as each thread in an SMT core
has its own register set, trapping between VMs and hypervisors does not involve costly
context switches, but simply requires the core to fetch instructions from a different
hardware thread. Furthermore, our inter-thread communication mechanism allows a hypervisor
to directly access and manipulate the registers of its subordinate VMs, given that
they both share the same in-core physical register file.A model of our architecture
shows up to 2.3x and 2.6x better I/O latency and bandwidth, respectively. We also
show a software-only prototype of the system using existing SMT architectures, with
up to 1.3x and 1.5x better I/O latency and bandwidth, respectively, and 1.2--2.2x
speedups on various real-world applications.
******
Hardware accelerators are one promising solution to contend with the end of Dennard
scaling and the slowdown of Moore's law. For mature workloads that are regular and
have high compute per byte, hardening an application into one or more hardware modules
is a standard approach. However, for some applications, we find that a programmable
homogeneous architecture is preferable.This paper compares a previously proposed heterogeneous
hardware accelerator for analytical query processing to a homogeneous systolic array
alternative. We find that the heterogeneous and homogeneous accelerators are equivalent
for large designs, while for small designs the homogeneous is better. Our analysis
explains this counter-intuitive result, finding that the homogeneous architecture
has higher average resource utilization and lower relative costs for the communication
infrastructure.
******
Modern computer architectures suffer from lack of architectural innovations, mainly
due to the power wall and the memory wall. That is, architectural innovations become
infeasible because they can prohibitively increase power consumption and their performance
impacts are eventually bounded by slow memory accesses. To address the challenges,
making computer systems run at ultra-low temperatures (or cryogenic computer systems)
has emerged as a highly promising solution as both power consumption and wire resistivity
are expected to significantly reduce at ultra-low temperatures. However, cryogenic
computers have not been yet realized as computer architects do not fully understand
the behaviors of existing computer systems and their cost effectiveness at such ultra-low
temperatures.In this paper, we first develop CryoRAM, a validated computer architecture
simulation tool to incorporate cryogenic memory devices. For this work, we focus on
77K temperature (easily achieved by applying low-cost liquid nitrogen), at which modern
CMOS devices still reliably operate. We also focus on reducing the temperature of
memory devices only as a pilot study prior to building a full cryogenic computer.
Next, driven by the modeling tool, we propose our temperature-aware memory device
and architecture designs to improve the DRAM access speed by 3.8 times or reduce the
power consumption to 9.2%. Finally, we provide three promising case studies using
cryogenic memories to significantly improve (1) server performance (up to 2.5 times),
(2) server power (down to 6% on average), and (3) datacenter's power cost (by 8.4%).We
will release our modeling and simulation tools deliberately implemented on top of
only open-source simulators combined, even though some experiments were conducted
under industry-confidential environments.
******
Implementing secure Non-Volatile Memories (NVMs) is challeng- ing, mainly due to the necessity to persist security metadata along with data. Unlike conventional secure memories, NVM-equipped systems are expected to recover data after crashes and hence se- curity metadata must be recoverable as well. While prior work explored recovery of encryption counters, fewer efforts have been focused on recovering integrity-protected systems. In particular, how to recover Merkle Tree. We observe two major challenges for this. First, recovering parallelizable integrity trees, e.g., Intel’s SGX trees, requires very special handling due to inter-level depen- dency. Second, the recovery time of practical NVM sizes (terabytes are expected) would take hours. Most data centers, cloud systems, intermittent-power devices and even personal computers, are antic- ipated to recover almost instantly after power restoration. In fact, this is one of the major promises of NVMs.
In this paper, we propose Anubis, a novel hardware-only solution that speeds up recovery time by almost 107 times (from 8 hours to only 0.03 seconds). Moreover, we propose a novel and elegant way to recover inter-level dependent trees, as in Intel’s SGX. Most importantly, while ensuring recoverability of one of the most chal- lenging integrity-protection schemes among others, Anubis incurs performance overhead that is only 2% higher than the state-of- the-art scheme, Osiris, which takes hours to recover systems with general Merkle Tree and fails to recover SGX-style trees.
******
﻿Motivated by energy constraints, future heterogeneous multi-cores may contain a variety of accelerators, each targeting a subset of the application spectrum. Beyond energy, the growing number of faults steers accelerator research towards fault-tolerant accelerators.
In this article, we investigate a fault-tolerant and energy-efficient accelerator for signal processing applications. We depart from traditional designs by introducing an accelerator which relies on unary coding, a concept which is well adapted to the continuous real-world inputs of signal processing applications. Unary coding enables a number of atypical micro-architecture choices which bring down area cost and energy; moreover, unary coding provides graceful output degradation as the amount of transient faults increases.
We introduce a configurable hybrid digital/analog micro-architecture capable of implementing a broad set of signal processing applications based on these concepts, together with a back-end optimizer which takes advantage of the special nature of these applications. For a set of five signal applications, we explore the different design tradeoffs and obtain an accelerator with an area cost of 1.63mm2. On average, this accelerator requires only 2.3% of the energy of an Atom-like core to implement similar tasks. We then evaluate the accelerator resilience to transient faults, and its ability to trade accuracy for energy savings.
******
Future microprocessors may become so power constrained that not all transistors will be able to be powered on at once. These systems will be required to nimbly adapt to changes in the chip power that is allocated to general-purpose cores and to specialized accelerators.
This paper presents Flicker, a general-purpose multicore architecture that dynamically adapts to varying and potentially stringent limits on allocated power. The Flicker core microarchitecture includes deconfigurable lanes--horizontal slices through the pipeline--that permit tailoring an individual core to the running application with lower overhead than microarchitecture-level adaptation, and greater flexibility than core-level power gating.
To exploit Flicker's flexible pipeline architecture, a new online multicore optimization algorithm combines reduced sampling techniques, application of response surface models to online optimization, and heuristic online search. The approach efficiently finds a near-global-optimum configuration of lanes without requiring offline training, microarchitecture state, or foreknowledge of the workload. At high power allocations, core-level gating is highly effective, and slightly outperforms Flicker overall. However, under stringent power constraints, Flicker significantly outperforms core-level gating, achieving an average 27% performance improvement.
******
This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the kernels. Hence, by identifying key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications.
We present an example, the Convolution Engine (CE), specialized for the convolution-like data-flow that is common in computational photography, image processing, and video processing applications. CE achieves energy efficiency by capturing data reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We quantify the tradeoffs in efficiency and flexibility and demonstrate that CE is within a factor of 2-3x of the energy and area efficiency of custom units optimized for a single kernel. CE improves energy and area efficiency by 8-15x over a SIMD engine for most applications.
******
Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks.
Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.
******
Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDEC's DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency.
In this paper, we first conduct an analysis of DDR4 DRAM's FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application.
When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controller's command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD).
Our results show that AR does exploit DDR4's FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4's FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.
******
DRAM cells store data in the form of charge on a capacitor. This charge leaks off over time, eventually causing data to be lost. To prevent this data loss from occurring, DRAM cells must be periodically refreshed. Unfortunately, DRAM refresh operations waste energy and also degrade system performance by interfering with memory requests. These problems are expected to worsen as DRAM density increases.
The amount of time that a DRAM cell can safely retain data without being refreshed is called the cell's retention time. In current systems, all DRAM cells are refreshed at the rate required to guarantee the integrity of the cell with the shortest retention time, resulting in unnecessary refreshes for cells with longer retention times. Prior work has proposed to reduce unnecessary refreshes by exploiting differences in retention time among DRAM cells; however, such mechanisms require knowledge of each cell's retention time.
In this paper, we present a comprehensive quantitative study of retention behavior in modern DRAMs. Using a temperature-controlled FPGA-based testing platform, we collect retention time information from 248 commodity DDR3 DRAM chips from five major DRAM vendors. We observe two significant phenomena: data pattern dependence, where the retention time of each DRAM cell is significantly affected by the data stored in other DRAM cells, and variable retention time, where the retention time of some DRAM cells changes unpredictably over time. We discuss possible physical explanations for these phenomena, how their magnitude may be affected by DRAM technology scaling, and their ramifications for DRAM retention time profiling mechanisms.
******
DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates.
To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.
******
We hypothesize that performing processor-side analysis of load instructions, and providing this pre-digested information to memory schedulers judiciously, can increase the sophistication of memory decisions while maintaining a lean memory controller that can take scheduling actions quickly. This is increasingly important as DRAM frequencies continue to increase relative to processor speed. In this paper we propose one such mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from the processor side. Using a sophisticated multi-core simulator that includes a detailed quad-channel DDR3 DRAM model, we demonstrate that this mechanism can improve performance significantly on a CMP, with minimal overhead and virtually no changes to the processor itself. We show that our design compares favorably to several state-of-the-art schedulers.
******
One of the main driving forces of the growing adoption of virtualization is its dramatic simplification of the provisioning and dynamic management of IT resources. By decoupling running entities from the underlying physical resources, and by providing easy-to-use controls to allocate, deallocate and migrate virtual machines (VMs) across physical boundaries, virtualization opens up new opportunities for improving overall system resource use and power efficiency. While a range of techniques for dynamic, distributed resource management of virtualized systems have been proposed and have seen their widespread adoption in enterprise systems, similar techniques for dynamic power management have seen limited acceptance. The main barrier to dynamic, power-aware virtualization management stems not from the limitations of virtualization, but rather from the underlying physical systems; and in particular, the high latency and energy cost of power state change actions suited for virtualization power management.
In this work, we first explore the feasibility of low-latency power states for enterprise server systems and demonstrate, with real prototypes, their quantitative energy-performance trade offs compared to traditional server power states. Then, we demonstrate an end-to-end power-aware virtualization management solution leveraging these states, and evaluate the dramatically-favorable power-performance characteristics achievable with such systems. We present, via both real system implementations and scale-out simulations, that virtualization power management with low-latency server power states can achieve comparable overheads as base distributed resource management in virtualized systems, and thus can benefit from the same level of adoption, while delivering close to energy-proportional power efficiency.
******
Virtualization allows flexible mappings between physical resources and virtual entities, and improves allocation efficiency and agility. Unfortunately, most existing virtualization technologies are limited to resources in a single host. This paper presents the design, implementation and evaluation of a multi-host I/O device virtualization system called Ladon, which enables I/O devices to be shared among virtual machines running on multiple hosts in a secure and efficient way. Specifically, Ladon uses a PCIe network to connect multiple servers with PCIe devices and allows VMs running on these servers to directly interact with these PCIe devices without interfering with one another. Through an evaluation of a fully operational Ladon prototype, we show that there is no throughput and latency penalty of the multi-host I/O virtualization enabled by Ladon compared to those of the existing single-host I/O virtualization technology.
******
Virtualization has become an important technology that is used across many platforms, particularly servers, to increase utilization, multi-tenancy and security. Virtualization introduces additional overhead that often relates to memory management, interrupt handling and hypervisor mode switching. Among those, memory management and translation lookaside buffer (TLB) management have been shown to have a significant impact on the performance of systems. Two principal mechanisms for TLB management exist in today's systems, namely software and hardware managed TLBs. In this paper, we analyze and quantify the overhead of a pure software virtualization that is implemented over a software managed TLB. We then describe our design of hardware extensions to support virtualization in systems with software managed TLBs to remove the most dominant overheads. These extensions were implemented in the Power embedded A2 core, which is used in the PowerEN and in the Blue Gene/Q processors. They were used to implement a KVM port. We evaluate each of these hardware extensions to determine their overall contributions to performance and efficiency. Collectively these extensions demonstrate an average improvement of 232% over a pure software implementation.
******
SIMT architectures improve performance and efficiency by exploiting control and memory-access structure across data-parallel threads. Value structure occurs when multiple threads operate on values that can be compactly encoded, e.g., by using a simple function of the thread index. We characterize the availability of control, memory-access, and value structure in typical kernels and observe ample amounts of value structure that is largely ignored by current SIMT architectures. We propose three microarchitectural mechanisms to exploit value structure based on compact affine execution of arithmetic, branch, and memory instructions. We explore these mechanisms within the context of traditional SIMT microarchitectures (GP-SIMT), found in general-purpose graphics processing units, as well as fine-grain SIMT microarchitectures (FG-SIMT), a SIMT variant appropriate for compute-focused data-parallel accelerators. Cycle-level modeling of a modern GP-SIMT system and a VLSI implementation of an eight-lane FG-SIMT execution engine are used to evaluate a range of application kernels. When compared to a baseline without compact affine execution, our approach can improve GP-SIMT cycle-level performance by 4-17% and can improve FG-SIMT absolute performance by 20-65% and energy efficiency up to 30% for a majority of the kernels.
******
In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture.
Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.
******
Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread scheduling and fine-grained bottleneck acceleration. Unfortunately, there have been no proposals offered thus far to decide which code segments to accelerate in cases where both coarse-grained thread scheduling and fine-grained bottleneck acceleration could have value. This paper proposes Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs (UBA), a cooperative software/hardware mechanism for identifying and accelerating the most likely critical code segments from a set of multithreaded applications running on an ACMP. The key idea is a new Utility of Acceleration metric that quantifies the performance benefit of accelerating a bottleneck or a thread by taking into account both the criticality and the expected speedup. UBA outperforms the best of two state-of-the-art mechanisms by 11% for single application workloads and by 7% for two-application workloads on an ACMP with 52 small cores and 3 large cores.
******
Work in quantum computer architecture has focused on communication, layout and fault tolerance, largely driven by Shor's factorization algorithm. For the first time, we study a larger range of benchmarks and find that another critical issue is the generation of code sequences for quantum rotation operations. Specifically, quantum algorithms require arbitrary rotation angles, while quantum technologies and error correction codes provide only for discrete angles and operators. A sequence of quantum machine instructions must be generated to approximate the arbitrary rotation to the required precision.
While previous work has focused exclusively on static compilation, we find that some applications require dynamic code generation and explore the advantages and disadvantages of static and dynamic approaches. We find that static code generation can, in some cases, lead to a terabyte of machine code to support required rotations. We also find that some rotation angles are unknown until run time, requiring dynamic code generation. Dynamic code generation, however, exhibits significant trade-offs in terms of time overhead versus code size. Furthermore, dynamic code generation will be performed on classical (non-quantum) computing resources, which may or may not have a clock speed advantage over the target quantum technology. For example, operations on trapped ions run at kilohertz speeds, but superconducting qubits run at gigahertz speeds.
We introduce a new method for compiling arbitrary rotations dynamically, designed to minimize compilation time. The new method reduces compilation time by up to five orders of magnitude while increasing code size by one order of magnitude.
We explore the design space formed by these trade-offs of dynamic versus static code generation, code quality, and quantum technology. We introduce several techniques to provide smoother trade-offs for dynamic code generation and evaluate the viability of options in the design space.
******
Performing computation inside living cells offers life-changing applications, from improved medical diagnostics to better cancer therapy to intelligent drugs. Due to its bio-compatibility and ease of engineering, one promising approach for performing in-vivo computation is DNA strand displacement. This paper introduces computer architects to DNA strand displacement "circuits", discusses associated architectural challenges, and proposes a new organization that provides practical composability. In particular, prior approaches rely mostly on stochastic interaction of freely diffusing components. This paper proposes practical spatial isolation of components, leading to more easily designed DNA-based circuits. DNA nanotechnology is currently at a turning point, with many proposed applications being realized [20, 9]. We believe that it is time for the computer architecture community to take notice and contribute.
******
With technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the face of bandwidth and power limitations is to rely on associative memory systems. Recent work on a PCM-based, associative TCAM accelerator shows that associative search capability can reduce both off-chip bandwidth demand and overall system energy. Unfortunately, previously proposed resistive TCAM accelerators have limited flexibility: only a restricted (albeit important) class of applications can benefit from a TCAM accelerator, and the implementation is confined to resistive memory technologies with a high dynamic range (RHigh/RLow), such as PCM.
This work proposes AC-DIMM, a flexible, high-performance associative compute engine built on a DDR3-compatible memory module. AC-DIMM addresses the limited flexibility of previous resistive TCAM accelerators by combining two powerful capabilities---associative search and processing in memory. Generality is improved by augmenting a TCAM system with a set of integrated, user programmable microcontrollers that operate directly on search results, and by architecting the system such that key-value pairs can be co-located in the same TCAM row. A new, bit-serial TCAM array is proposed, which enables the system to be implemented using STT-MRAM. AC-DIMM achieves a 4.2X speedup and a 6.5X energy reduction over a conventional RAM-based system on a set of 13 evaluated applications.
******
We re-visit the issue of hardware consistency models in the new context of massively-threaded throughput-oriented processors (MTTOPs). A prominent example of an MTTOP is a GPGPU, but other examples include Intel's MIC architecture and some recent academic designs. MTTOPs differ from CPUs in many significant ways, including their ability to tolerate latency, their memory system organization, and the characteristics of the software they run. We compare implementations of various hardware consistency models for MTTOPs in terms of performance, energy-efficiency, hardware complexity, and programmability. Our results show that the choice of hardware consistency model has a surprisingly minimal impact on performance and thus the decision should be based on hardware complexity, energy-efficiency, and programmability. For many MTTOPs, it is likely that even a simple implementation of sequential consistency is attractive.
******
Although fences are designed for low-overhead concurrency coordination, they can be expensive in current machines. If fences were largely free, faster fine-grained concurrent algorithms could be devised, and compilers could guarantee Sequential Consistency (SC) at little cost.
In this paper, we present WeeFence (or WFence for short), a fence that is very cheap because it allows post-fence accesses to skip it. Such accesses can typically complete and retire before the pre-fence writes have drained from the write buffer. Only when an incorrect reordering of accesses is about to happen, does the hardware stall to prevent it. In the paper, we present the WFence design for TSO, and compare it to a conventional fence with speculation for 8-processor multicore simulations. We run parallel kernels that contain explicit fences and parallel applications that do not. For the kernels, WFence eliminates nearly all of the fence stall, reducing the kernels' execution time by an average of 11%. For the applications, a conservative compiler algorithm places fences in the code to guarantee SC. In this case, on average, WFences reduce the resulting fence overhead from 38% of the applications' execution time to 2% (in a centralized WFence design), or from 36% to 5% (in a distributed WFence design).
******
On the twentieth anniversary of the original publication [10], following ten years of intense activity in the research literature, hardware support for transactional memory (TM) has finally become a commercial reality, with HTM-enabled chips currently or soon-to-be available from many hardware vendors. In this paper we describe architectural support for TM added to a future version of the Power ISA™. Two imperatives drove the development: the desire to complement our weakly-consistent memory model with a more friendly interface to simplify the development and porting of multithreaded applications, and the need for robustness beyond that of some early implementations. In the process of commercializing the feature, we had to resolve some previously unexplored interactions between TM and existing features of the ISA, for example translation shootdown, interrupt handling, atomic read-modify-write primitives, and our weakly consistent memory model. We describe these interactions, the overall architecture, and discuss the motivation and rationale for our choices of architectural semantics, beyond what is typically found in reference manuals.
******
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.
To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware---base, limit and offset registers per core---to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed.
We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.
******
The global pool of data is growing at 2.5 quintillion bytes per day, with 90% of it produced in the last two years alone [24]. There is no doubt the era of big data has arrived. This paper explores targeted deployment of hardware accelerators to improve the throughput and energy efficiency of large-scale data processing. In particular, data partitioning is a critical operation for manipulating large data sets. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries.
To accelerate partitioning, this paper describes a hardware accelerator for range partitioning, or HARP, and a hardware-software data streaming framework. The streaming framework offers a seamless execution environment for streaming accelerators such as HARP. Together, HARP and the streaming framework provide an order of magnitude improvement in partitioning performance and energy. A detailed analysis of a 32nm physical design shows 7.8 times the throughput of a highly optimized and optimistic software implementation, while consuming just 6.9% of the area and 4.3% of the power of a single Xeon core in the same technology generation.
******
We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query language called LINQ. LINQits does not provide coverage for all possible applications---however, existing applications (re-)written with LINQ in mind benefit extensively from hardware acceleration. Furthermore, the LINQits framework offers a graceful and transparent migration path from software to hardware.
LINQits is prototyped on a 2W heterogeneous SoC called the ZYNQ processor, which combines dual ARM A9 processors with an FPGA on a single die in 28nm silicon technology. Our physical measurements show that LINQits improves energy efficiency by 8.9 to 30.6 times and performance by 10.7 to 38.1 times compared to optimized, multithreaded C programs running on conventional ARM A9 processors.
******
Online transaction processing (OLTP) workload performance suffers from instruction stalls; the instruction footprint of a typical transaction exceeds by far the capacity of an L1 cache, leading to ongoing cache thrashing. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation to the code base, or a sharp increase in the L1-I cache unit area and power. Others reduce instruction miss latency by better utilizing a shared L2 cache. SLICC [2], a recently proposed thread migration technique that exploits transaction instruction locality, is promising for high core counts but performs sub-optimally or may hurt performance when running on few cores.
This paper corroborates that OLTP transactions exhibit significant intra- and inter-thread overlap in their instruction footprint, and analyzes the instruction stall reduction benefits. This paper presents STREX, a hardware, programmer-transparent technique that exploits typical transaction behavior to improve instruction reuse in first level caches. STREX time-multiplexes the execution of similar transactions dynamically on a single core so that instructions fetched by one transaction are reused by all other transactions executing in the system as much as possible. STREX dynamically slices the execution of each transaction into cache-sized segments simply by observing when blocks are brought in the cache and when they are evicted. Experiments show that, when compared to baseline execution on 2--16 cores, STREX consistently improves performance while reducing the number of L1 instruction and data misses by 37% and 14% on average, respectively. Finally, this paper proposes a practical hybrid technique that combines STREX and SLICC, thereby guaranteeing performance benefits regardless of the number of available cores and the workload's footprint.
******
This paper examines the interaction between thermal management techniques and power boosting in a state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. We show that for classes of applications that utilize both the CPU and the GPU, modern boost algorithms that greedily seek to convert thermal headroom into performance can interact with thermal coupling effects between the CPU and the GPU to degrade performance. We first examine the causes of this behavior and explain the interaction between thermal coupling, performance coupling, and workload behavior. Then we propose a dynamic power-management approach called cooperative boosting (CB) to allocate power dynamically between CPU and GPU in a manner that balances thermal coupling against the needs of performance coupling to optimize performance under a given thermal constraint. Through real hardware-based measurements, we evaluate CB against a state-of-the-practice boost algorithm and show that overall application performance and power savings increase by 10% and 8% (up to 52% and 34%), respectively, resulting in average energy efficiency improvement of 25% (up to 76%) over a wide range of benchmarks.
******
Lowering supply voltage is one of the most effective approaches for improving the energy efficiency of microprocessors. Unfortunately, technology limitations, such as process variability and circuit aging, are forcing microprocessor designers to add larger voltage guardbands to their chips. This makes supply voltage increasingly difficult to scale with technology. This paper presents a new mechanism for dynamically reducing voltage margins while maintaining the chip operating frequency constant. Unlike previous approaches that rely on special hardware to detect and recover from timing violations caused by low-voltage execution, our solution is firmware-based and does not require additional hardware. Instead, it relies on error correction mechanisms already built into modern processors. The system dynamically reduces voltage margins and uses correctable error reports raised by the hardware to identify the lowest, safe operating voltage. The solution adapts to core-to-core variability by tailoring supply voltage to each core's safe operating level. In addition, it exploits variability in workload vulnerability to low voltage execution. The system was prototyped on an HP Integrity Server that uses Intel's Itanium 9560 processors. Evaluation using SPECjbb2005 and SPEC CPU2000 workloads shows core power savings ranging from 18% to 23%, with minimal performance impact.
******
Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.
******
Multiple networks have been used in several processor implementations to scale bandwidth and ensure protocol-level deadlock freedom for different message classes. In this paper, we observe that a multiple-network design is also attractive from a power perspective and can be leveraged to achieve energy proportionality by effective power gating.
Unlike a single-network design, a multiple-network design is more amenable to power gating, as its subnetworks (subnets) can be power gated without compromising the connectivity of the network. To exploit this opportunity, we propose the Catnap architecture which consists of synergistic subnet selection and power-gating policies. Catnap maximizes the number of consecutive idle cycles in a router, while avoiding performance loss due to overloading a subnet.
We evaluate a 256-core processor with a concentrated mesh topology using synthetic traffic and 35 applications. We show that the average network power of a power-gating optimized multiple-network design with four subnets could be 44% lower than a bandwidth equivalent single-network design for an average performance cost of about 5%.
******
In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future "far-ahead" warp while executing the current warp.
We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.
******
The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.
******
Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.
******
SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.
Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.
******
DRAM has been a de facto standard for main memory, and advances in process technology have led to a rapid increase in its capacity and bandwidth. In contrast, its random access latency has remained relatively stagnant, as it is still around 100 CPU clock cycles. Modern computer systems rely on caches or other latency tolerance techniques to lower the average access latency. However, not all applications have ample parallelism or locality that would help hide or reduce the latency. Moreover, applications' demands for memory space continue to grow, while the capacity gap between last-level caches and main memory is unlikely to shrink. Consequently, reducing the main-memory latency is important for application performance. Unfortunately, previous proposals have not adequately addressed this problem, as they have focused only on improving the bandwidth and capacity or reduced the latency at the cost of significant area overhead.
We propose asymmetric DRAM bank organizations to reduce the average main-memory access latency. We first analyze the access and cycle times of a modern DRAM device to identify key delay components for latency reduction. Then we reorganize a subset of DRAM banks to reduce their access and cycle times by half with low area overhead. By synergistically combining these reorganized DRAM banks with support for non-uniform bank accesses, we introduce a novel DRAM bank organization with center high-aspect-ratio mats called CHARM. Experiments on a simulated chip-multiprocessor system show that CHARM improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.
******
Increasingly, cyber attacks (e.g., kernel rootkits) target the inner rings of a computer system, and they have seriously undermined the integrity of the entire computer systems. To eliminate these threats, it is imperative to develop innovative solutions running below the attack surface. This paper presents MGuard, a new most inner ring solution for inspecting the system integrity that is directly integrated with the DRAM DIMM devices. More specifically, we design a programmable guard that is integrated with the advanced memory buffer of FB-DIMM to continuously monitor all the memory traffic and detect the system integrity violations. Unlike the existing approaches that are either snapshot-based or lack compatibility and flexibility, MGuard continuously monitors the integrity of all the outer rings including both OS kernel and hypervisor of interest, with a greater extendibility enabled by a programmable interface. It offers a hardware drop-in solution transparent to the host CPU and memory controller. Moreover, MGuard is isolated from the host software and hardware, leading to strong security for remote attackers. Our simulation-based experimental results show that MGuard introduces no speed overhead, and is able to detect nearly all the OS-kernel and hypervisor control data related rootkits we tested.
******
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip.
This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the page's residency in the cache --- i.e., the page's footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.
******
Write bandwidth is an inherent performance bottleneck for Phase Change Memory (PCM) for two reasons. First, PCM cells have long programming time, and second, only a limited number of PCM cells can be programmed concurrently due to programming current and write circuit constraints,
For each PCM write, the data bits of the write request are typically mapped to multiple cell groups and processed in parallel. We observed that an unbalanced distribution of modified data bits among cell groups significantly increases PCM write time and hurts effective write bandwidth. To address this issue, we first uncover the cyclical and cluster patterns for modified data bits. Next, we propose double XOR mapping (D-XOR) to distribute modified data bits among cell groups in a balanced way. D-XOR can reduce PCM write service time by 45% on average, which increases PCM write throughput by 1.8x. As error correction (redundant bits) is critical for PCM, we also consider the impact of redundancy information in mapping data and error correction bits to cell groups. Our techniques lead to a 51% average reduction in write service time for a PCM main memory with ECC, which increases IPC by 12%.
******
There are several emerging memory technologies looming on the horizon to compensate the physical scaling challenges of DRAM. Phase change memory (PCM) is one such candidate proposed for being part of the main memory in computing systems. One salient feature of PCM is its multi-level-cell (MLC) property, which can be used to multiply the memory capacity at the cell level. However, due to the nature of PCM that the value written to the cell can drift over time, PCM is prone to a unique type of soft errors, posing a great challenge for their practical deployment. This paper first quantitatively studied the current art for MLC PCM in dealing with the resistance drift problem and showed that the previously proposed techniques such as scrubbing or error correction mechanisms have significant reliability challenges to overcome. We then propose tri-level-cell PCM and demonstrate its ability to achieving 105 x lower soft error rate than four-level-cell PCM and 1.33 x higher information density than single-level-cell PCM. According to our findings, the tri-level-cell PCM shows 36.4% performance improvement over the four-level-cell PCM while achieving the soft error rate of DRAM.
******
Zombie is an endurance management framework that enables a variety of error correction mechanisms to extend the lifetimes of memories that suffer from bit failures caused by wearout, such as phase-change memory (PCM). Zombie supports both single-level cell (SLC) and multi-level cell (MLC) variants. It extends the lifetime of blocks in working memory pages (primary blocks) by pairing them with spare blocks, i.e., working blocks in pages that have been disabled due to exhaustion of a single block's error correction resources, which would be 'dead' otherwise. Spare blocks adaptively provide error correction resources to primary blocks as failures accumulate over time. This reduces the waste caused by early block failures, making working blocks in discarded pages a useful resource. Even though we use PCM as the target technology, Zombie applies to any memory technology that suffers stuck-at cell failures.
This paper describes the Zombie framework, a combination of two new error correction mechanisms (ZombieXOR for SLC and ZombieMLC for MLC) and the extension of two previously proposed SLC mechanisms (ZombieECP and ZombieERC). The result is a 58% to 92% improvement in endurance for Zombie SLC memory and an even more impressive 11x to 17x improvement for ZombieMLC, both with performance overheads of only 0.1% when memories using prior error correction mechanisms reach end of life.
******
Solid State Disks (SSDs) based on flash and other non-volatile memory technologies reduce storage latencies from 10s of milliseconds to 10s or 100s of microseconds, transforming previously inconsequential storage overheads into performance bottlenecks. This problem is especially acute in storage area network (SAN) environments where complex hardware and software layers (distributed file systems, block severs, network stacks, etc.) lie between applications and remote data. These layers can add hundreds of microseconds to requests, obscuring the performance of both flash memory and faster, emerging non-volatile memory technologies.
We describe QuickSAN, a SAN prototype that eliminates most software overheads and significantly reduces hardware overheads in SANs. QuickSAN integrates a network adapter into SSDs, so the SSDs can communicate directly with one another to service storage accesses as quickly as possible. QuickSAN can also give applications direct access to both local and remote data without operating system intervention, further reducing software costs. Our evaluation of QuickSAN demonstrates remote access latencies of 20 μs for 4 KB requests, bandwidth improvements of as much as 163x for small accesses compared with an equivalent iSCSI implementation, and 2.3-3.0x application level speedup for distributed sorting. We also show that QuickSAN improves energy efficiency by up to 96% and that QuickSAN's networking connectivity allows for improved cluster-level energy efficiency under varying load.
******
Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance computing markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While conventional off-chip memory provides ECC support by adding one or more extra chips, this may not be practical in a 3D stack. In this paper, we present a DRAM cache organization that uses error-correcting codes (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures, from traditional bit errors to large-scale row, column, bank, and channel failures. With only a modest performance degradation compared to a DRAM cache with no ECC support, our proposal can correct all single-bit failures, and 99.9993% of all row, column, and bank failures, providing more than a 54,000x improvement in the FIT rate of silent-data corruptions compared to basic SECDED ECC protection.
******
Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance computing markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While conventional off-chip memory provides ECC support by adding one or more extra chips, this may not be practical in a 3D stack. In this paper, we present a DRAM cache organization that uses error-correcting codes (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures, from traditional bit errors to large-scale row, column, bank, and channel failures. With only a modest performance degradation compared to a DRAM cache with no ECC support, our proposal can correct all single-bit failures, and 99.9993% of all row, column, and bank failures, providing more than a 54,000x improvement in the FIT rate of silent-data corruptions compared to basic SECDED ECC protection.
******
State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across a range of applications, not only results in wasted bandwidth to serve an individual thread's access needs, but also results in unnecessary coherence traffic for shared data. The additional bandwidth has a direct impact on both the scalability of parallel applications and overall energy consumption.
In this paper, we present the design of Protozoa, a family of coherence protocols that eliminate unnecessary coherence traffic and match data movement to an application's spatial locality. Protozoa continues to maintain metadata at a conventional fixed cache line granularity while 1) supporting variable read and write caching granularity so that data transfer matches application spatial granularity, 2) invalidating at the granularity of the write miss request so that readers to disjoint data can co-exist with writers, and 3) potentially supporting multiple non-overlapping writers within the cache line, thereby avoiding the traditional ping-pong effect of both read-write and write-write false sharing. Our evaluation demonstrates that Protozoa consistently reduce miss rate and improve the fraction of transmitted data that is actually utilized.
******
Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the ambiguity of the virtual address due to the possibility of synonyms. In this paper, we take a radically different approach than all prior work which is focused on reverse translation. We examine the problem from the perspective of the coherence protocol. We show that if a coherence protocol adheres to certain conditions, it operates effortlessly with virtual caches, without requiring reverse translations even in the presence of synonyms. We show that these conditions hold in a new class of simple and efficient request-response protocols that use both self-invalidation and self-downgrade. This results in a new solution for virtual-cache coherence, significantly less complex and more efficient than prior proposals. We study design choices for TLB placement under our proposal and compare them against those under a directory-MESI protocol. Our approach allows for choices that are particularly effective as for example combining all per-core TLBs in a single logical TLB in front of the last level cache. Significant area, energy, and performance benefits ensue as a result of simplifying the entire multicore memory organization.
******
Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.
******
Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware.
In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance.
To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.
******
The trend for multicore processors is towards increasing numbers of cores, with 100s of cores--i.e. large-scale chip multiprocessors (LCMPs)--possible in the future. The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this. In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches. Also, techniques have been developed to predict profiles across problem size and core count, enabling the analysis of configurations that are too large to simulate.
This paper applies RD analysis to study the scalability of multicore cache hierarchies. We present a framework based on CRD and PRD profiles for reasoning about the locality impact of core count and problem scaling. We find interference-based locality degradation is more significant than sharing-based locality degradation. For 256 cores running small problems, the former occurs at small cache sizes, allowing moderate capacity scaling of multicore caches to achieve the same cache performance (MPKI) as a single-core cache. At very large problems, interference-based locality degradation increases significantly in many of our benchmarks. For shared caches, this prevents most of our benchmarks from achieving constant-MPKI scaling within a 256 MB capacity budget; for private caches, all benchmarks cannot achieve constant-MPKI scaling within 256 MB.
******
General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.
******
General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.
******
Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.
We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.
We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.
******
This paper introduces a new heuristic condition for non-race concurrency bugs, named order-sensitive critical sections, and proposes a run-time bug detection scheme based on the condition. The order-sensitive critical sections are defined as a pair of critical sections that can lead to non-deterministic shared memory state depending on the order in which they execute. In a sense, the order-sensitive critical sections can be seen as extending the intuition in using data races as a potential bug condition to capture non-race bugs. Experiments show that the proposed scheme provides a good coverage for multiple types of non-race bugs, with a small number of false positives. For example, the scheme detected all 9 real-world non-race bugs that were tested as well as over 90% of injected non-race bugs. Additionally, this paper presents an efficient hardware architecture that supports the proposed scheme with minor hardware changes and a small amount of additional state - a 9-KB buffer per core and a 1-bit tag per data cache block. The hardware-based scheme could still detect all 9 real-world bugs that were tested and more than 84% of the injected non-race bugs. Moreover, the hardware supported scheme has a negligible impact on performance, with a 0.23% slowdown on average.
******
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality.
This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel.
This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
******
The share of silicon debug in the overall microprocessor chips development cycle is rapidly expanding due to the ever growing design complexity and the limited efficiency of pre-silicon validation methods. Massive application of short random test programs on the prototype microprocessor chips is one of the most effective parts of silicon debug. However, a major bottleneck and source of "noise" in this phase is that large numbers of random test programs fail due to the same or similar design bugs. This redundant behavior adds long delays in the debug flow since each failing random program must be separately examined, although it does not usually bring new debug information. The development of effective techniques that detect dominant modes of failure among random programs and triage them into common categories eliminate redundant debug sessions and significantly boost silicon debug.
We propose the employment of deconfigurable microprocessor architectures along with self-checking random test programs to reduce the redundant debug sessions and make the triage step of silicon debug more efficient. Several hardware components of high performance microprocessor micro-architectures can be deconfigured while keeping the functional completeness of the design. This is the property we exploit in our silicon debug methodology for the triaging of random test programs. We support our methodology by a hardware mechanism dedicated to silicon debug that groups the failing test programs into categories depending on the microprocessor hardware components that need to be deconfigured for a random test program to be correctly executed. Identical deconfiguration sequences for multiple test programs indicate the existence of redundancy among them and group them together. This grouping significantly reduces the number of failing tests that must be debugged afterwards. Detailed evaluation of the method on an x86 microprocessor demonstrates its efficiency in reducing the debug sessions and thus in accelerating silicon debug.
******
Modern "warehouse scale computers" (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored.
In this paper, we expose and quantify the performance impact of the "homogeneity assumption" for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google's web-search. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application's sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in "homogeneous" WSCs, we propose "Whare-Map," the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing "Whare-Map", we observe a cluster-wide performance improvement of 15% on average over heterogeneity--oblivious job placement and up to an 80% improvement for web-service applications that are particularly sensitive to heterogeneity.
******
Ensuring the quality of service (QoS) for latency-sensitive applications while allowing co-locations of multiple applications on servers is critical for improving server utilization and reducing cost in modern warehouse-scale computers (WSCs). Recent work relies on static profiling to precisely predict the QoS degradation that results from performance interference among co-running applications to increase the number of "safe" co-locations. However, these static profiling techniques have several critical limitations: 1) a priori knowledge of all workloads is required for profiling, 2) it is difficult for the prediction to capture or adapt to phase or load changes of applications, and 3) the prediction technique is limited to only two co-running applications.
To address all of these limitations, we present Bubble-Flux, an integrated dynamic interference measurement and online QoS management mechanism to provide accurate QoS control and maximize server utilization. Bubble-Flux uses a Dynamic Bubble to probe servers in real time to measure the instantaneous pressure on the shared hardware resources and precisely predict how the QoS of a latency-sensitive job will be affected by potential co-runners. Once "safe" batch jobs are selected and mapped to a server, Bubble-Flux uses an Online Flux Engine to continuously monitor the QoS of the latency-sensitive application and control the execution of batch jobs to adapt to dynamic input, phase, and load changes to deliver satisfactory QoS. Batch applications remain in a state of flux throughout execution. Our results show that the utilization improvement achieved by Bubble-Flux is up to 2.2x better than the prior static approach.
******
Power infrastructure contributes to a significant portion of datacenter expenditures. Overbooking this infrastructure for a high percentile of the needs is becoming more attractive than for occasional peaks. There exist several computing knobs to cap the power draw within such under-provisioned capacity. Recently, batteries and other energy storage devices have been proposed to provide a complementary alternative to these knobs, which when decentralized (or hierarchically placed), can temporarily take the load to suppress power peaks propagating up the hierarchy. With aggressive under-provisioning, the power hierarchy becomes as central a datacenter resource as other computing resources, making it imperative to carefully allocate, isolate and manage this resource (including batteries), across applications. Towards this goal, we present vPower, a software system to virtualize power distribution. vPower includes mechanisms and policies to provide a virtual power hierarchy for each application. It leverages traditional computing knobs as well as batteries, to apportion and manage the infrastructure between co-existing applications in the hierarchy. vPower allows applications to specify their power needs, performs admission control and placement, dynamically monitors power usage, and enforces allocations for fairness and system efficiency. Using several datacenter applications, and a 2-level power hierarchy prototype containing batteries at both levels, we demonstrate the effectiveness of vPower when working in an under-provisioned power infrastructure, using the right computing knobs and the right batteries at the right time. Results show over 50% improved system utilization and scale-out for vPower's over-booking, and between 12-28% better application performance than traditional power-capping control knobs. It also ensures isolation between applications competing for power.
******
As multicore processors find increasing adoption in domains such as aerospace and medical devices where failures have the potential to be catastrophic, strong performance isolation and security become first-class design constraints. When cores are used to run separate pieces of the system, strong time and space partitioning can help provide such guarantees. However, as the number of partitions or the asymmetry in partition bandwidth allocations grows, the additional latency incurred by time multiplexing the network can significantly impact performance.
In this paper, we introduce SurfNoC, an on-chip network that significantly reduces the latency incurred by temporal partitioning. By carefully scheduling the network into waves that flow across the interconnect, data from different domains carried by these waves are strictly non-interfering while avoiding the significant overheads associated with cycle-by-cycle time multiplexing. We describe the scheduling policy and router microarchitecture changes required, and evaluate the information-flow security of a synthesizable implementation through gate-level information flow analysis. When comparing our approach for varying numbers of domains and network sizes, we find that in many cases SurfNoC can reduce the latency overhead of implementing cycle-level non-interference by up to 85%.
******
Keeping user data private is a huge problem both in cloud computing and computation outsourcing. One paradigm to achieve data privacy is to use tamper-resistant processors, inside which users' private data is decrypted and computed upon. These processors need to interact with untrusted external memory. Even if we encrypt all data that leaves the trusted processor, however, the address sequence that goes off-chip may still leak information. To prevent this address leakage, the security community has proposed ORAM (Oblivious RAM). ORAM has mainly been explored in server/file settings which assume a vastly different computation model than secure processors. Not surprisingly, naïvely applying ORAM to a secure processor setting incurs large performance overheads.
In this paper, a recent proposal called Path ORAM is studied. We demonstrate techniques to make Path ORAM practical in a secure processor setting. We introduce background eviction schemes to prevent Path ORAM failure and allow for a performance-driven design space exploration. We propose a concept called super blocks to further improve Path ORAM's performance, and also show an efficient integrity verification scheme for Path ORAM. With our optimizations, Path ORAM overhead drops by 41.8%, and SPEC benchmark execution time improves by 52.4% in relation to a baseline configuration. Our work can be used to improve the security level of previous secure processors.
******
The proliferation of computers in any domain is followed by the proliferation of malware in that domain. Systems, including the latest mobile platforms, are laden with viruses, rootkits, spyware, adware and other classes of malware. Despite the existence of anti-virus software, malware threats persist and are growing as there exist a myriad of ways to subvert anti-virus (AV) software. In fact, attackers today exploit bugs in the AV software to break into systems.
In this paper, we examine the feasibility of building a malware detector in hardware using existing performance counters. We find that data from performance counters can be used to identify malware and that our detection techniques are robust to minor variations in malware programs. As a result, after examining a small set of variations within a family of malware on Android ARM and Intel Linux platforms, we can detect many variations within that family. Further, our proposed hardware modifications allow the malware detector to run securely beneath the system software, thus setting the stage for AV implementations that are simpler and less buggy than software AV. Combined, the robustness and security of hardware AV techniques have the potential to advance state-of-the-art online malware detection.
******
﻿Lossless data compression is highly desirable in enterprise and cloud environments for storage and memory cost savings and improved utilization I/O and network. While the value provided by compression is recognized, its application in practice is often limited because it's a processor intensive operation resulting low throughput and high elapsed time for compression intense workloads.The IBM POWER9 and IBM z15 systems overcome the shortcomings of existing approaches by including a novel on-chip integrated data compression accelerator. The accelerator reduces processor cycles, I/O traffic, memory and storage footprint of many applications practically with zero hardware cost. The accelerator also eliminates the cost and I/O slots that would have been necessary with FPGA/ASIC based compression adapters. On the POWER9 chip, a single accelerator uses less than 0.5% of the processor chip area, but provides a 388x speedup factor over the zlib compression software running on a general-purpose core and provides a 13x speedup factor over the entire chip of cores. On a POWER9 system, the accelerators provide an end-to-end 23% speedup to Apache Spark TPC-DS workload compared to the software baseline. The z15 chip doubles the compression rate of POWER9 resulting in even much higher speedup factors over the compression software running on general-purpose cores. On a maximally configured z15 system topology, on-chip compression accelerators provide up to 280 GB/s data compression rate, the highest in the industry. Overall, the on-chip accelerators significantly advance the state of the art in terms of area, throughput, latency, compression ratio, reduced processor utilization, power/energy efficiency, and integration into the system stack.This paper describes the architecture, and novel elements of the POWER9 and z15 compression/decompression accelerators with emphasis on trade-offs that made the on-chip implementation possible.
******
Demand for high performance deep learning (DL) inference in software applications is growing rapidly. DL workloads run on myriad platforms, including general purpose processors (CPU), system-on-chip (SoC) with accelerators, graphics processing units (GPU), and neural processing unit (NPU) adding cards. DL software engineers typically must choose between relatively slow general hardware (e.g., CPUs, SoCs) or relatively expensive, large, power-hungry hardware (e.g., GPUs, NPUs). This paper describes Centaur Technology's Ncore, the industry's first high-performance DL coprocessor technology integrated into an x86 SoC with server-class CPUs. Ncore's 4096 byte-wide SIMD architecture supports INT8, UINT8, INT16, and BF16 data types, with 20 tera-operations-per-second compute capability. Ncore shares the SoC ring bus for low-latency communication and work sharing with eight 64-bit x86 cores, offering flexible support for new and evolving models. The x86 SoC platform can further scale out performance via multiple sockets, systems, or third-party PCIe accelerators. Ncore's software stack automatically converts quantized models for Ncore consumption and leverages existing DL frameworks. In MLPerf's Inference v0.5 closed division benchmarks, Ncore achieves 1218 IPS throughput and 1.05ms latency on ResNet-50v1.5 and achieves lowest latency of all Mobilenet-V1 submissions (329μs). Ncore yields 23x speedup over other x86 vendor percore throughput, while freeing its own x86 cores for other work. Ncore is the only integrated solution among the memory intensive neural machine translation (NMT) submissions.
******
The design of the modern, enterprise-class IBM z15 branch predictor is described. Implemented as a multilevel look-ahead structure, the branch predictor is capable of predicting branch direction and target addresses, augmented with multiple auxiliary direction, target, and power predictors. Predictions are made asynchronously, and later integrated into the processor pipeline. The design is optimized for the unique workloads executed on these enterprise-class systems, including compute intensive and both large instruction and data footprint workloads. This paper highlights the major operations and functions of the IBM z15 branch predictor, including its pipeline, prediction structures and verification methodology. Explanations as to how the design matured to its current state are also provided.
******
The Samsung Exynos family of cores are high performance “big” processors developed at the Samsung Austin Research & Design Center (SARC) starting in late 2011. This paper discusses selected aspects of the microarchitecture of these cores - specifically perceptron-based branch prediction, Spectre v2 security enhancements, micro-operation cache algorithms, prefetcher advancements, and memory latency optimizations. Each micro-architecture item evolved over time, both as part of continuous yearly improvement, and in reaction to changing mobile workloads.
******
The open source RISC-V ISA has been quickly gaining momentum. This paper presents Xuantie-910, an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division. It is fully based on the RV64GCV instruction set and it features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations. It also implements the 0.7.1 stable release of RISCV vector extension specification for high efficiency vector processing. Xuantie-910 supports multi-core multi-cluster SMP with cache coherence. Each cluster contains 1 to 4 core(s) capable of booting the Linux operating system. Each single core utilizes the state-of-the-art 12-stage deep pipeline, out-of-order, multi-issue superscalar architecture, achieving a maximum clock frequency of 2.5 GHz in the typical process, voltage and temperature condition in a TSMC 12nm FinFET process technology. Each single core with the vector execution unit costs an area of 0.8 mm2, (excluding the L2 cache). The toolchain is enhanced significantly to support the vector extension and custom extensions. Through hardware and toolchain co-optimization, to date Xuantie-910 delivers the highest performance (in terms of IPC, speed, and power efficiency) for a number of industrial control flow and data computing benchmarks, when compared with its predecessors in the RISC-V family. Xuantie-910 FPGA implementation has been deployed in the data centers of Alibaba Cloud, for application specific acceleration (e.g., blockchain transaction). The ASIC deployment at low-cost SoC applications, such as IoT endpoints and edge computing, is planned to facilitate Alibaba's end-to-end and cloud-to-edge computing infrastructure.
******
The frontend stalls caused by instruction and BTB misses are a significant source of performance degradation in server processors. Prefetchers are commonly employed to mitigate frontend bottleneck. However, next-line prefetchers, which are available in server processors, are incapable of eliminating a considerable number of L1 instruction misses. Temporal instruction prefetchers, on the other hand, effectively remove most of the instruction and BTB misses but impose significant area overhead.Recently, an old idea of using BTB-directed instruction prefetching is revived to address the limitations of temporal instruction prefetchers. While this approach leads to prefetchers with low area overhead, it requires significant changes to the frontend of a processor. Moreover, as this approach relies on the BTB content for prefetching, BTB misses stall the prefetcher, and likely lead to costly instruction misses. Especially as instruction misses are usually more expensive than BTB misses, the dependence of instruction prefetching to the BTB content is harmful to workloads with very large instruction footprints. Moreover, BTB-directed instruction prefetchers, as proposed in prior work, cannot be applied to variable-length ISAs.In this work, we showcase the harmful effects of making instruction prefetchers depend on the BTB content. Moreover, we divide the frontend bottleneck into three categories and use a divide-and-conquer approach to propose simple and effective solutions for each one. Sequential misses can be covered by an accurate and timely sequential prefetcher named SN4L, a lightweight discontinuity prefetcher named Dis eliminates discontinuity misses, and the BTB misses are reduced by pre-decoding the prefetched blocks. We also discuss how our proposal can be used for variable-length ISAs with low storage overhead. Our proposal, SN4L+ Dis+BTB, imposes the same area overhead as the state-of-the-art BTB-directed prefetcher, and at the same time, outperforms it by 5% on average and up to 16%.
******
Value Prediction was proposed to speculatively break true data dependencies, thereby allowing Out of Order (OOO) processors to achieve higher instruction level parallelism (ILP) and gain performance. State-of-the-art value predictors try to maximize the number of instructions that can be value predicted, with the belief that a higher coverage will unlock more ILP and increase performance. Unfortunately, this comes at increased complexity with implementations that require multiple different types of value predictors working in tandem, incurring substantial area and power cost.In this paper we motivate towards lower coverage, but focused, value prediction. Instead of aggressively increasing the coverage of value prediction, at the cost of higher area and power, we motivate refocusing value prediction as a mechanism to achieve an early execution of instructions that frequently create performance bottlenecks in the OOO processor. Since we do not aim for high coverage, our implementation is light-weight, needing just 1.2 KB of storage. Simulation results on 60 diverse workloads show that we deliver 3.3% performance gain over a baseline similar to the Intel Skylake processor. This performance gain increases substantially to 8.6% when we simulate a futuristic up-scaled version of Skylake. In contrast, for the same storage, state-of-the-art value predictors deliver a much lower speedup of 1.7% and 4.7% respectively. Notably, our proposal is similar to these predictors in performance, even when they are given nearly eight times the storage and have 60% more prediction coverage than our solution.
******
Advancements in branch predictors have allowed modern processors to aggressively speculate and gain significant performance with every generation of increasing out-of-order depth and width. Unfortunately, there are branches that are still hard-to-predict (H2P) and mis-speculation on these branches is severely limiting the performance scalability of future processors. One potential solution to mitigate this problem is to predicate branches by substituting control dependencies with data dependencies. Predication is very costly for performance as it inhibits instruction level parallelism. To overcome this limitation, prior works selectively applied predication at run-time on H2P branches that have low confidence of branch prediction. However, these schemes do not fully comprehend the delicate trade-offs involved in suppressing speculation and can suffer from performance degradation on certain workloads. Additionally, they need significant changes not just to the hardware but also to the compiler and the instruction set architecture, rendering their implementation complex and challenging.In this paper, by analyzing the fundamental trade-offs between branch prediction and predication, we propose Auto-Predication of Critical Branches (ACB) - an end-to-end hardware-based solution that intelligently disables speculation only on branches that are critical for performance. Unlike existing approaches, ACB uses a sophisticated performance monitoring mechanism to gauge the effectiveness of dynamic prediction, and hence does not suffer from performance inversions. Our simulation results show that, with just 386 bytes of additional hardware and no software support, ACB delivers 8% performance gain over a baseline similar to the Skylake processor. We also show that ACB reduces pipeline flushes because of mis-speculations by 22%, thus effectively helping both power and performance.
******
Delinquent branches and loads remain key performance limiters in some applications. One approach to mitigate them is pre-execution. Broadly, there are two classes of pre-execution: one class repeatedly forks small helper threads, each targeting an individual dynamic instance of a delinquent branch or load; the other class begins with two redundant threads in a leader-follower arrangement, and speculatively reduces the leading thread. The objective of this paper is to design a new pre-execution microarchitecture that meets four criteria: (i) retains the simpler coordination of a leader-follower microarchitecture, (ii) is fully automated with just hardware, (iii) targets both branches and loads, (iv) and is effective. We review prior pre execution proposals and show that none of them meet all four criteria. We develop Slipstream 2.0 to meet all four criteria. The key innovation in the space of leader-follower architectures is to remove the forward control-flow slices of delinquent branches and loads, from the leading thread. This innovation overcomes key limitations in the only other hardware-only leader-follower prior works: Slipstream and Dual Core Execution (DCE). Slipstream removes backward slices of confident branches to pre-execute unconfident branches, which is ineffective in phases dominated by unconfident branches when branch pre-execution is most needed. DCE is very effective at tolerating cache-missed loads, unless their dependent branches are mispredicted. Removing forward control-flow slices of delinquent branches and delinquent loads enables two firsts, respectively: (1) leader-follower-style branch pre-execution without relying on confident instruction removal, and (2) tolerance of cache-missed loads that feed mispredicted branches. For SPEC 2006/2017 SimPoints wherein Slipstream 2.0 is auto-enabled, it achieves geomean speedups of 67%, 60%, and 12%, over baseline (one core), Slipstream, and DCE.
******
Hardware prefetching is one of the common off-chip DRAM latency hiding techniques. Though hardware prefetchers are ubiquitous in the commercial machines and prefetching techniques are well studied in the computer architecture community, the “memory wall” problem still exists after decades of microarchitecture research and is considered to be an essential problem to solve. In this paper, we make a case for breaking the memory wall through data prefetching at the L1 cache.We propose a bouquet of hardware prefetchers that can handle a variety of access patterns driven by the control flow of an application. We name our proposal Instruction Pointer Classifier based spatial Prefetching (IPCP). We propose IPCP in two flavors: (i) an L1 spatial data prefetcher that classifies instruction pointers at the L1 cache level, and issues prefetch requests based on the classification, and (ii) a multi-level IPCP where the IPCP at the L1 communicates the classification information to the L2 IPCP so that it can kick-start prefetching based on this classification done at the L1. Overall, IPCP is a simple, lightweight, and modular framework for L1 and multi-level spatial prefetching. IPCP at the L1 and L2 incurs a storage overhead of 740 bytes and 155 bytes, respectively.Our empirical results show that, for memory-intensive single-threaded SPEC CPU 2017 benchmarks, compared to a baseline system with no prefetching, IPCP provides an average performance improvement of 45.1%. For the entire SPEC CPU 2017 suite, it provides an improvement of 22%. In the case of multicore systems, IPCP provides an improvement of 23.4% (evaluated over more than 1000 mixes). IPCP outperforms the already high-performing state-of-the-art prefetchers like SPP with PPF and Bingo by demanding 30X to 50X less storage.
******
The disclosure of the Spectre speculative-execution attacks in January 2018 has left a severe vulnerability that systems are still struggling with how to patch. The solutions that currently exist tend to have incomplete coverage, perform badly, or have highly undesirable performance edge cases.MuonTrap allows processors to continue to speculate, avoiding significant reductions in performance, without impacting security. We instead prevent the propagation of any state based on speculative execution, by placing the results of speculative cache accesses into a small, fast L0 filter cache, that is non-inclusive, non-exclusive with the rest of the cache hierarchy. This isolates all parts of the system that can't be quickly cleared on any change in the threat domain. MuonTrap uses these speculative filter caches, which are cleared on context and protection-domain switches, along with a series of extensions to the cache coherence protocol and prefetcher. This renders systems immune to cross-domain information leakage via Spectre and a host of similar attacks based on speculative execution, with low performance impact and few changes to the CPU design.
******
In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one- a 4× improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its 25×29 mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.
******
Multicores are now ubiquitous, but programmers still write sequential code. Speculative parallelization is an enticing approach to parallelize code while retaining the ease of sequential programming, making parallelism pervasive. However, prior speculative parallelizing compilers and architectures achieved limited speedups due to high costs of recovering from misspeculation and hardware scalability bottlenecks. We present T4, a parallelizing compiler that successfully leverages recent hardware features for speculative execution, which present new opportunities and challenges for automatic parallelization. T4 transforms sequential programs into trees of tiny timestamped tasks. T4 introduces novel compiler techniques to expose parallelism aggressively across the entire program, breaking applications into tiny tasks of tens of instructions each. Task trees unfold their branches in parallel to enable high task-spawn throughput while exploiting selective aborts to recover from misspeculation cheaply. T4 exploits parallelism across function calls, loops, and loop nests; performs new transformations to reduce task spawn costs and avoid false sharing; and exploits data locality among fine-grain tasks. As a result, T4 scales several hard-to-parallelize SPECCPU2006 benchmarks to tens of cores, on which prior work attained little or no speedup.
******
Manycore processors, with tens to hundreds of tiny cores but no hardware-based cache coherence, can offer tremendous peak throughput on highly parallel programs while being complexity and energy efficient. Manycore processors can be combined with a few high-performance big cores for executing operating systems, legacy code, and serial regions. These systems use heterogeneous cache coherence (HCC) with hardware-based cache coherence between big cores and software-centric cache coherence between tiny cores. Unfortunately, programming these heterogeneous cache-coherent systems to enable collaborative execution is challenging, especially when considering dynamic task parallelism. This paper seeks to address this challenge using a combination of light-weight software and hardware techniques. We provide a detailed description of how to implement a work-stealing runtime to enable dynamic task parallelism on heterogeneous cache-coherent systems. We also propose direct task stealing (DTS), a new technique based on user-level interrupts to bypass the memory system and thus improve the performance and energy efficiency of work stealing. Our results demonstrate that executing dynamic task-parallel applications on a 64-core system (4 big, 60 tiny) with complexity-effective HCC and DTS can achieve: $7 \times$ speedup over a single big core; $1.4 \times$ speedup over an area-equivalent eight bigcore system with hardware-based cache coherence; and 21% better performance and similar energy efficiency compared to a 64-core system (4 big, 60 tiny) with full-system hardware-based cache coherence.
******
Heterogeneous-ISA multi-core systems have performance and power consumption benefits. Today, numerous system components, such as NVRAMs and Smart NICs, already have built-in processor cores with ISAs different from that of the host CPUs, making many modern systems heterogeneous-ISA multi-core systems. Unfortunately, programming and using such systems efficiently is difficult and requires extensive support from the host operating systems. Existing programming solutions are complex, require dramatic changes to the systems, and often incur significant performance overheads. To address this challenge, we propose Flick: Fast and Lightweight ISA-Crossing Call, for migrating threads in heterogeneous-ISA multi-core systems. By leveraging hardware virtual memory support and standard operating system mechanisms, a software thread can transparently migrate between cores with different ISAs. We prototype a heterogeneous-ISA multi-core system using FPGAs with off-the-shelf hardware and software to evaluate Flick. Experiments with microbenchmarks and a BFS application show that Flick requires only minor changes to the existing OS and software, and incurs only 18ps round trip overhead for migrating a thread through PCIe, which is at least 23x faster than prior work.
******
Large-scale online services are commonly structured as a network of software tiers, which communicate over the datacenter network using RPCs. Ongoing trends towards software decomposition have led to the prevalence of tiers receiving and generating RPCs with runtimes of only a few microseconds. With such small software runtimes, even the smallest latency overheads in RPC handling have a significant relative performance impact. In particular, we find that growing network bandwidth introduces queuing effects within a server's memory hierarchy, considerably hurting the response latency of fine-grained RPCs. In this work we introduce NEBULA, an architecture optimized to accelerate the most challenging microsecond-scale RPCs, by leveraging two novel mechanisms to drastically improve server throughput under strict tail latency goals. First, NEBULA reduces detrimental queuing at the memory controllers via hardware support for efficient in-LLC network buffer management. Second, NEBULA's network interface steers incoming RPCs into the CPU cores' L1 caches, improving RPC startup latency. Our evaluation shows that NEBULA boosts the throughput of a state-of-the-art key-value store by 1.25- 2.19 x compared to existing proposals, while maintaining strict tail latency goals.
******
Printed electronics holds the promise of meeting the cost and conformality needs of emerging disposable and ultra-low cost margin applications. Recent printed circuits technologies also have low supply voltage and can, therefore, be battery-powered. In this paper, we explore the design space of microprocessors implemented in such printing technologies - these printed microprocessors will be needed for battery-powered applications with requirements of low cost, conformality, and programmability. To enable this design space exploration, we first present the standard cell libraries for EGFET and CNT-TFT printed technologies - to the best of our knowledge, these are the first synthesis and physical design ready standard cell libraries for any low voltage printing technology. We then present an area, power, and delay characterization of several off-the-shelf low gate count microprocessors (Z80, light8080, ZPU, and openMSP430) in EGFET and CNT-TFT technologies. Our characterization shows that several printing applications can be feasibly targeted by battery-powered printed microprocessors. However, our results also show the need to significantly reduce the area and power of such printed microprocessors. We perform a design space exploration of printed microprocessor architectures over multiple parameters - datawidths, pipeline depth, etc. We show that the best cores outperform pre-existing cores by at least one order of magnitude in terms of power and area. Finally, we show that printing-specific architectural and low-level optimizations further improve area and power characteristics of low voltage battery-compatible printed microprocessors. Program-specific ISA, for example, improves power, and area by up to 4.18x and 1.93x respectively. Crosspoint-based instruction ROM outperforms a RAM-based design by 5.77x, 16.8x, and 2.42x respectively in terms of power, area, and delay.
******
There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and memory domains can operate at a higher frequency and voltage than necessary, increasing power consumption and 2) the unused power budget of the IO and memory domains cannot be used to increase the throughput of the compute domain, hampering performance. To avoid these issues, it is crucial to dynamically orchestrate the distribution of the SoC power budget across the three domains based on their actual performance demands. We propose SysScale, a new multi-domain power management technique to improve the energy efficiency of mobile SoCs. SysScale is based on three key ideas. First, SysScale introduces an accurate algorithm to predict the performance (e.g., bandwidth and latency) demands of the three SoC domains. Second, SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to distribute the SoC power to each domain according to the predicted performance demands. This mechanism is designed to minimize the significant latency overheads associated with applying DVFS across multiple domains. Third, in addition to using a global DVFS mechanism, SysScale uses domain-specialized techniques to optimize the energy efficiency of each domain at different operating points. We implement SysScale on an Intel Skylake microprocessor for mobile devices and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and 8.9% (9.2% and 7.9% on average), respectively. For battery life workloads, which typically have fixed performance demands, SysScale reduces the average power consumption by up to 10.7% (8.5% on average), while meeting performance demands.
******
The emergence of virtual reality (VR) and augmented reality (AR) has revolutionized our lives by enabling a 360° artificial sensory stimulation across diverse domains, including, but not limited to, sports, media, healthcare, and gaming. Unlike the conventional planar video processing, where memory access is the main bottleneck, in 360° VR videos the compute is the primary bottleneck and contributes to more than 50% energy consumption in battery-operated VR headsets. Thus, improving the computational efficiency of the video processing pipeline in a VR is critical. While prior efforts have attempted to address this problem through acceleration using a GPU or FPGA, none of them has analyzed the 360° VR pipeline to examine if there is any scope to optimize the computation with known techniques such as memoization.Thus, in this paper, we analyze the VR computation pipeline and observe that there is significant scope to skip computations by leveraging the temporal and spatial locality in head orientation and eye correlations, respectively, resulting in computation reduction and energy efficiency. The proposed Déjà View design takes advantage of temporal reuse by memoizing head orientation and spatial reuse by establishing a relationship between left and right eye projection, and can be implemented either on a GPU or an FPGA. We propose both software modifications for existing compute pipeline and microarchitectural additions for further enhancement. We evaluate our design by implementing the software enhancements on an NVIDIA Jetson TX2 GPU board and our microarchitectural additions on a Xilinx Zynq-7000 FPGA model using five video workloads. Experimental results show that Déjà View can provide 34% computation reduction and 17% energy saving, compared to the state-of-the-art design.
******
In this paper, we describe our vision to accelerate algorithms in the domain of genomic data analysis by proposing a framework called Genesis (genome analysis) that contains an interface and an implementation of a system that processes genomic data efficiently. This framework can be deployed in the cloud and exploit the FPGAs-as-a-service paradigm to provide cost-efficient secondary DNA analysis. We propose conceptualizing genomic reads and associated read attributes as a very large relational database and using extended SQL as a domain-specific language to construct queries that form various data manipulation operations. To accelerate such queries, we design a Genesis hardware library which consists of primitive hardware modules that can be composed to construct a dataflow architecture specialized for those queries. As a proof of concept for the Genesis framework, we present the architecture and the hardware implementation of several genomic analysis stages in the secondary analysis pipeline corresponding to the best known software analysis toolkit, GATK4 workflow proposed by the Broad Institute. We walk through the construction of genomic data analysis operations using a sequence of SQL-style queries and show how Genesis hardware library modules can be utilized to construct the hardware pipelines designed to accelerate such queries. We exploit parallelism and data reuse by utilizing a dataflow architecture along with the use of on-chip scratchpads as well as non-blocking APIs to manage the accelerators, allowing concurrent execution of the accelerator and the host. Our accelerated system deployed on the cloud FPGA performs up to 19.3× better than GATK4 running on a commodity multi-core Xeon server and obtains up to 15× better cost savings. We believe that if a software algorithm can be mapped onto a hardware library to utilize the underlying accelerator(s) using an already-standardized software interface such as SQL, while allowing the efficient mapping of such interface to primitive hardware modules as we have demonstrated here, it will expedite the acceleration of domain specific algorithms and allow the easy adaptation of algorithm changes.
******
Domain-specific hardware accelerators can provide orders of magnitude speedup and energy efficiency over general purpose processors. However, they require extensive manual effort in hardware design and software stack development. Automated ASIC generation (eg. HLS) can be insufficient, because the hardware becomes inflexible. An ideal accelerator generation framework would be automatable, enable deep specialization to the domain, and maintain a uniform programming interface. Our insight is that many prior accelerator architectures can be approximated by composing a small number of hardware primitives, specifically those from spatial architectures. With careful design, a compiler can understand how to use available primitives, with modular and composable transformations, to take advantage of the features of a given program. This suggests a paradigm where accelerators can be generated by searching within such a rich accelerator design space, guided by the affinity of input programs for hardware primitives and their interactions. We use this approach to develop the DSAGEN framework, which automates the hardware/software co-design process for reconfigurable accelerators. For several existing accelerators, our evaluation demonstrates that the compiler can achieve 89% of the performance of manually tuned versions. For automated design space exploration, we target multiple sets of workloads which prior accelerators are designed for; the generated hardware has mean 1.3× perf2/mm2 over prior programmable accelerators.
******
Sorting is a key computational kernel in many big data applications. Most sorting implementations focus on a specific input size, record width, and hardware configuration. This has created a wide array of sorters that are optimized only to a narrow application domain.In this work we show that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes. We introduce a novel merge tree architecture and develop Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time. FPGA programmability allows us to leverage Bonsai to quickly implement the optimal merge tree configuration for any problem size and memory hierarchy.Using Bonsai, we develop a state-of-the-art sorter which specifically targets DRAM-scale sorting on AWS EC2 F1 instances. For 4-32 GB array size, our implementation has a minimum of 2.3x, 1.3x, 1.2x and up to 2.5x, 3.7x, 1.3x speedup over the best designs on CPUs, FPGAs, and GPUs, respectively. Our design exhibits 3.3x better bandwidth-efficiency compared to the best previous sorting implementations. Finally, we demonstrate that Bonsai can tune our design over a wide range of problem sizes(megabyte to terabyte) and memory hierarchies including DDR DRAMs, high-bandwidth memories (HBMs) and solid-state disks (SSDs).
******
Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework of OpenCL for FPGAs, called SOFF. It automatically synthesizes a datapath to execute many OpenCL kernel threads in a pipelined manner. It also synthesizes an efficient memory subsystem for the datapath based on the characteristics of OpenCL kernels. Unlike previous high-level synthesis techniques, we propose a formal way to handle variable latency instructions, complex control flows, OpenCL barriers, and atomic operations that appear in real-world OpenCL kernels. SOFF is the first OpenCL framework that correctly compiles and executes all applications in the SPEC ACCEL benchmark suite except three applications that require more FPGA resources than are available. In addition, SOFF achieves the speedup of 1.33 over Intel FPGA SDK for OpenCL without any explicit user annotation or source code modification.
******
Accelerator deployment in data centers remains limited despite domain-specific architectures' promise of higher performance. Rapidly-changing applications and high nre cost make deploying fixed-function accelerators at scale untenable. More flexible than dsas, fpgas are gaining traction but remain hampered by cumbersome programming models, long synthesis times, and slow clocks. Coarse-grained reconfigurable architectures (cgra) are a compelling alternative and offer efficiency while retaining programmability-by providing general-purpose hardware and communication patterns, a single cgra targets multiple application domains.One emerging application is in-database machine learning: a high-performance, low-friction interface for analytics on large databases. We co-locate database and machine learning processing in a unified reconfigurable data analytics accelerator, Gorgon, which flexibly shares resources between db and ml without compromising performance or incurring excessive overheads in either domain. We distill and integrate database parallel patterns into an existing ML-focused cgra, increasing area by less than 4% while outperforming a multicore software baseline by 1500X. We also explore the performance impact of unifying db and ml in a single accelerator, showing up to 4x speedup over split accelerators.
******
Object serialization and deserialization (S/D) is an essential feature for efficient communication between distributed computing nodes with potentially non-uniform execution environments. S/D operations are widely used in big data analytics frameworks for remote procedure calls and massive data transfers like shuffles. However, frequent S/D operations incur significant performance and energy overheads as they must traverse and process a large object graph. Prior approaches improve S/D throughput by effectively hiding disk or network I/O latency with computation, increasing compression ratio, and/or application-specific customization. However, inherent dependencies in the existing (de)serialization formats and algorithms eventually become the major performance bottleneck. Thus, we propose Cereal, a specialized hardware accelerator for memory object serialization. By co-designing the serialization format with hardware architecture, Cereal effectively utilizes abundant parallelism in the S/D process to deliver high throughput. Cereal also employs an efficient object packing scheme to compress metadata such as object reference offsets and a space-efficient bitmap representation for the object layout. Our evaluation of Cereal using both a cycle-level simulator and synthesizable Chisel RTL demonstrates that Cereal delivers 43.4× higher average S/D throughput than 88 other S/D libraries on Java Serialization Benchmark Suite. For six Spark applications Cereal achieves 7.97× and 4.81× speedups on average for S/D operations over Java built-in serializer and Kryo, respectively, while saving S/D energy by 227.75× and 136.28×.
******
Cryogenic computing can achieve high performance and power efficiency by dramatically reducing the device's leakage power and wire resistance at low temperatures. Recent advances towards cryogenic computing focus on developing cryogenic-optimal cache and memory devices to overcome memory capacity, latency, and power walls. However, little research has been conducted to develop a cryogenic-optimal core architecture despite its high potentials in performance, power, and area efficiency. Once a cryogenic-optimal core becomes available, it will also take full advantage of the cryogenic-optimal cache and memory devices, which leads to a cryogenic-optimal computer.In this paper, we first develop CryoCore-Model (CC-Modet), a cryogenic processor modeling framework which can accurately estimate the maximum clock frequency of processor models running at 77K. Next, driven by the modeling tool, we design CryoCore, a 77K-optimal core microarchitecture to maximize the core's performance and area efficiency while minimizing the cooling cost. The key idea of CryoCore is to architect a core in a way to reduce the size and number of cooling-unfriendly microarchitecture units and maximize the potential of a voltage and frequency scaling at 77K. Finally, we propose two half sized, but differently voltage-scaled CryoCore designs aiming for either the maximum performance or power efficiency. With both conventional and our design integrated with cryogenic memories, our high-performance CryoCore design achieves 41% higher single-thread performance for the same power budget and 2x higher multi-thread performance for the same die area. Our low power CryoCore design reduces the power cost by 38% without sacrificing the single-thread performance.
******
Spiking neural networks (SNNs) are expected to be part of the future AI portfolio, with heavy investment from industry and government, e.g., IBM TrueNorth, Intel Loihi. While Artificial Neural Network (ANN) architectures have taken large strides, few works have targeted SNN hardware efficiency. Our analysis of SNN baselines shows that at modest spike rates, SNN implementations exhibit significantly lower efficiency than accelerators for ANNs. This is primarily because SNN dataflows must consider neuron potentials for several ticks, introducing a new data structure and a new dimension to the reuse pattern. We introduce a novel SNN architecture, SpinalFlow, that processes a compressed, time-stamped, sorted sequence of input spikes. It adopts an ordering of computations such that the outputs of a network layer are also compressed, time-stamped, and sorted. All relevant computations for a neuron are performed in consecutive steps to eliminate neuron potential storage overheads. Thus, with better data reuse, we advance the energy efficiency of SNN accelerators by an order of magnitude. Even though the temporal aspect in SNNs prevents the exploitation of some reuse patterns that are more easily exploited in ANNs, at 4-bit input resolution and 90% input sparsity, SpinalFlow reduces average energy by 1.8×, compared to a 4-bit Eyeriss baseline. These improvements are seen for a range of networks and sparsity/resolution levels; SpinalFlow consumes 5× less energy and 5.4× less time than an 8-bit version of Eyeriss. We thus show that, depending on the level of observed sparsity, SNN architectures can be competitive with ANN architectures in terms of latency and energy for inference, thus lowering the barrier for practical deployment in scenarios demanding real-time learning.
******
Brain-inspired cognitive computing has so far followed two major approaches - one uses multi-layered artificial neural networks (ANNs) to perform pattern-recognition-related tasks, whereas the other uses spiking neural networks (SNNs) to emulate biological neurons in an attempt to be as efficient and fault-tolerant as the brain. While there has been considerable progress in the former area due to a combination of effective training algorithms and acceleration platforms, the latter is still in its infancy due to the lack of both. SNNs have a distinct advantage over their ANN counterparts in that they are capable of operating in an event-driven manner, thus consuming very low power. Several recent efforts have proposed various SNN hardware design alternatives, however, these designs still incur considerable energy overheads.In this context, this paper proposes a comprehensive design spanning across the device, circuit, architecture and algorithm levels to build an ultra low-power architecture for SNN and ANN inference. For this, we use spintronics-based magnetic tunnel junction (MTJ) devices that have been shown to function as both neuro-synaptic crossbars as well as thresholding neurons and can operate at ultra low voltage and current levels. Using this MTJ-based neuron model and synaptic connections, we design a low power chip that has the flexibility to be deployed for inference of SNNs, ANNs as well as a combination of SNN-ANN hybrid networks - a distinct advantage compared to prior works. We demonstrate the competitive performance and energy efficiency of the SNNs as well as hybrid models on a suite of workloads. Our evaluations show that the proposed design, NEBULA, is up to 7.9× more energy efficient than a state-of-the-art design, ISAAC, in the ANN mode. In the SNN mode, our design is about 45× more energy-efficient than a contemporary SNN architecture, INXS. Power comparison between NEBULA ANN and SNN modes indicates that the latter is at least 6.25× more power-efficient for the observed benchmarks.
******
General matrix multiplication (GEMM) is universal in various applications, such as signal processing, machine learning, and computer vision. Conventional GEMM hardware architectures based on binary computing exhibit low area and energy efficiency as they scale due to the spatial nature of number representation and computing. Unary computing, on the other hand, can be performed with extremely simple processing units, often just with a single logic gate. But currently there exist no efficient architectures for unary GEMM. In this paper, we present uGEMM, an area- and energy-efficient unary GEMM architecture enabled by novel arithmetic units. The proposed design relaxes previously-imposed constraints on input bit streams-low correlation and long stream length- and achieves superior area and energy efficiency over existing unary systems. Furthermore, uGEMM's output bit streams exhibit higher accuracy and faster convergence, enabling dynamic energy-accuracy scaling on resource-constrained systems.
******
Brain-computer interfaces (BCIs) offer avenues to treat neurological disorders, shed light on brain function, and interface the brain with the digital world. Their wider adoption rests, however, on achieving adequate real-time performance, meeting stringent power constraints, and adhering to FDA-mandated safety requirements for chronic implantation. BCIs have, to date, been designed as custom ASICs for specific diseases or for specific tasks in specific brain regions. General-purpose architectures that can be used to treat multiple diseases and enable various computational tasks are needed for wider BCI adoption, but the conventional wisdom is that such systems cannot meet necessary performance and power constraints. We present HALO (Hardware Architecture for LOw-power BCIs), a general-purpose architecture for implantable BCIs. HALO enables tasks such as treatment of disorders (e.g., epilepsy, movement disorders), and records/processes data for studies that advance our understanding of the brain. We use electrophysiological data from the motor cortex of a non-human primate to determine how to decompose HALO's computational capabilities into hardware building blocks. We simplify, prune, and share these building blocks to judiciously use available hardware resources while enabling many modes of brain-computer interaction. The result is a configurable heterogeneous array of hardware processing elements (PEs). The PEs are configured by a low-power RISC-V micro-controller into signal processing pipelines that meet the target performance and power constraints necessary to deploy HALO widely and safely.
******
Warm water cooling has been regarded as a promising method to improve the energy efficiency of water-cooled datacenters. In warm water-cooling systems, hot spots occur as a common problem where the hybrid cooling architecture integrating thermoelectric coolers (TECs) emerges as a new remedy. Equipped with this architecture, the inlet water temperature can be raised higher, which provides more opportunities for heat recycling. However, currently, the heat absorbed from the server components is ejected directly into the water without being recycled, which leads to energy wasting. In order to further improve the energy efficiency, we propose Heat to Power (H2P), an economical and energy-recycling warm water cooling architecture, where thermoelectric generators (TEGs) harvest thermal energy from the “used” warm water and generate electricity for reusing in datacenters. Specifically, we propose some efficient optimization methods, including an economical water circulation design, fine-grained adjustments of the cooling setting and dynamic workload scheduling for increasing the power generated by TEGs. We evaluate H2P based on a real hardware prototype and cluster traces from Google and Alibaba. Experiment results show that TEGs equipped with our optimization methods can averagely generate 4.349 W, 4.203 W, and 3.979 W (4.177 W averagely) electricity on one CPU under the drastic, irregular and common workload traces, respectively. The power reusing efficiency (PRE) can reach 12.8%~16.2% (14.23% averagely) and the total cost of ownership (TCO) of datacenters can be reduced by up to 0.57%.
******
It is of vital importance to efficiently process large graphs for many data-intensive applications. As a result, a large collection of graph analytic frameworks has been proposed to improve the per-iteration performance on a single kind of computation resource. However, heavy coordination and synchronization overhead make it hard to scale out graph analytic frameworks from single platform to heterogeneous platforms. Furthermore, increasing the convergence rate, i.e. reducing the number of iterations, which is equally vital for improving the overall performance of iterative graph algorithms, receives much less attention. In this paper, we introduce the Block Coordinate Descent (BCD) view of graph algorithms and propose an asynchronous heterogeneous graph analytic framework, GraphABCD, using the BCD view. The BCD view offers key insights and trade-offs on achieving a high convergence rate of iterative graph algorithms. GraphABCD features fast convergence under the algorithm design options suggested by BCD. GraphABCD offers algorithm and architectural supports for asynchronous execution, without undermining its fast convergence properties. With minimum synchronization overhead, GraphABCD is able to scale out to heterogeneous and distributed accelerators efficiently. To demonstrate GraphABCD, we prototype its whole system on Intel HARPv2 CPU-FPGA heterogeneous platform. Evaluations on HARPv2 show that GraphABCD achieves geo-mean speedups of 4.8x and 2.0x over GraphMat, a state-of-the-art framework in terms of convergence rate and execution time, respectively.
******
Graph analytics applications are ubiquitous in this era of a connected world. These applications have very low compute to byte-transferred ratios and exhibit poor locality, which limits their computational efficiency on general purpose computing systems. Conventional hardware accelerators employ custom dataflow and memory hierarchy organization to overcome these challenges. Processing-in-memory (PIM) accelerators leverage massively parallel compute capable memory arrays to perform the in-situ operations on graph data or employ custom compute elements near the memory to leverage larger internal bandwidths. In this work, we present GaaS-X, a graph analytics accelerator that inherently supports the sparse graph data representations using an in-situ compute-enabled crossbar memory architectures. We alleviate the overheads of redundant writes, sparse to dense conversions, and redundant computations on the invalid edges that are present in the state of the art crossbar-based PIM accelerators. GaaS-X achieves 7.7× and 2.4× performance and 22× and 5.7×, energy savings, respectively, over two state-of-the-art crossbar accelerators and offers orders of magnitude improvements over GPU and CPU solutions.
******
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.
******
Computation demands on mobile and edge devices are increasing dramatically. Mobile devices, such as smart phones, incorporate a large number of dedicated accelerators and fixed-function hardware blocks to deliver the required performance and power efficiency. Due to the heterogeneous nature of these devices, they feature vastly larger design spaces than traditional systems featuring only a CPU. Currently, academia struggles to fully evaluate such heterogeneous systems on chip due to the limited access and availability of proprietary workloads. To address these challenges, we propose Mocktails: a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices. We focus on capturing the interspersed address streams of the workload and the burstiness of the injection process for proprietary compute devices commonly found in mobile systems. We evaluate Mocktails in simulation with proprietary memory traces of IP blocks. Mocktails accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics including read row hits (at most 7.3% error) and write row hits (at most 2.8% error). Architects can use Mocktails in their simulations as a substitute for a proprietary compute device, making the tool a useful conduit between industry and academia.
******
In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA's native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim's performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.
******
Hardware resource sharing has proven to be an efficient way to increase resource utilization, save energy, and decrease operational cost. Modern-day servers accommodate hundreds of Virtual Machines (VMs) running concurrently, and lightweight software abstractions like containers enable the consolidation of an even larger number of independent tenants per server. The increasing number of hardware accelerators along with growing interconnection bandwidth creates a new class of devices available for sharing. To fully utilize the potential of these devices, I/O architecture needs to be carefully designed for both processors and devices. This paper presents the design and analysis of scalable Hypertenant TRanslation of I/O addresses (HyperTRIO) for shared devices. HyperTRIO provides isolation and performance guarantees at low hardware cost by supporting multiple in-flight address translations, partitioning translation caches, and utilizing both inter- and intra-tenant access patterns for translation prefetching. This work also constructs a Hyper-tenant Simulator of I/O address accesses (HyperSIO) for 1000-tenant systems which we open-sourced. This work characterizes tenant access patterns and uses these insights to address identified challenges. Overall, the HyperTRIO design enables the system to utilize full available I/O bandwidth in a hyper-tenant environment.
******
Cloud computing has begun a transformation from using virtual machines to containers. Containers are attractive because multiple of them can share a single kernel, and add minimal performance overhead. Cloud providers leverage the lean nature of containers to run hundreds of them on a few cores. Furthermore, containers enable the serverless paradigm, which leads to the creation of short-lived processes.In this work, we identify that containerized environments create page translations that are extensively replicated across containers in the TLB and in page tables. The result is high TLB pressure and redundant kernel work during page table management. To remedy this situation, this paper proposes BabelFish, a novel architecture to share page translations across containers in the TLB and in page tables. We evaluate BabelFish with simulations of an 8-core processor running a set of Docker containers in an environment with conservative container co-location. On average, under BabelFish, 53% of the translations in containerized workloads and 93% of the translations in serverless workloads are shared. As a result, BabelFish reduces the mean and tail latency of containerized data-serving workloads by 11% and 18%, respectively. It also lowers the execution time of containerized compute workloads by 11%. Finally, it reduces serverless function bring-up time by 8% and execution time by 10%-55%.
******
We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging applies to the hypervisor and guest OS memory manager independently, as well as to native systems. Moreover, CA paging benefits any address translation scheme that leverages contiguous mappings. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations in both native and virtualized systems. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and (ii) SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.
******
Trapped ions (TI) are a leading candidate for building Noisy Intermediate-Scale Quantum (NISQ) hardware. TI qubits have fundamental advantages over other technologies such as superconducting qubits, including high qubit quality, coherence and connectivity. However, current TI systems are small in size, with 5-20 qubits and typically use a single trap architecture which has fundamental scalability limitations. To progress towards the next major milestone of 50-100 qubit TI devices, a modular architecture termed the Quantum Charge Coupled Device (QCCD) has been proposed. In a QCCD-based TI device, small traps are connected through ion shuttling. While the basic hardware components for such devices have been demonstrated, building a 50-100 qubit system is challenging because of a wide range of design possibilities for trap sizing, communication topology and gate implementations and the need to match diverse application resource requirements.Towards realizing QCCD-based TI systems with 50-100 qubits, we perform an extensive application-driven architectural study evaluating the key design choices of trap sizing, communication topology and operation implementation methods. To enable our study, we built a design toolflow which takes a QCCD architecture's parameters as input, along with a set of applications and realistic hardware performance models. Our toolflow maps the applications onto the target device and simulates their execution to compute metrics such as application run time, reliability and device noise rates. Using six applications and several hardware design points, we show that trap sizing and communication topology choices can impact application reliability by up to three orders of magnitude. Microarchitectural gate implementation choices influence reliability by another order of magnitude. From these studies, we provide concrete recommendations to tune these choices to achieve highly reliable and performant application executions. With industry and academic efforts underway to build TI devices with 50-100 qubits, our insights have the potential to influence QC hardware in the near-future and accelerate the progress towards practical QC systems.
******
In the last decades, we have witnessed the rapid growth of Quantum Computing. In the current Noisy Intermediate-Scale Quantum (NISQ) era, the capability of a quantum machine is limited by the decoherence time, gate fidelity and the number of Qubits. Current quantum computing applications are far from the real “quantum supremacy” due to the fragile physical Qubits, which can only be entangled for a few microseconds. Recent works use quantum optimal control to reduce the latency of quantum circuits, thereby effectively increasing quantum volume. However, the key challenge of this technique is the large overhead due to long compilation time. In this paper, we propose AccQOC, a comprehensive static/dynamic hybrid workflow to transform gate groups (equivalent to matrices) to pulses using QOC (Quantum Optimal Control) with a reasonable compilation time budget. AccQOC is composed of static pre-compilation and accelerated dynamic compilation. After the quantum program is mapped to the quantum circuit with our heuristic mapping algorithm considering crosstalk, we leverage static pre-compilation to generate pulses for the frequently used groups to eliminate the dynamic compilation time for them. The pulse is generated using QOC with binary search to determine the latency. For a new program, we use the same policy to generate groups, thus avoid incurring overhead for the “covered” groups. The dynamic compilation deals with “un-covered” groups with accelerated pulse generation. The key insight is that the pulse of a group can be generated faster based on the generated pulse of a similar group. We propose to reduce the compilation time by generating an ordered sequence of groups in which the sum of similarity among consecutive groups in the sequence is minimized. We can find the sequence by constructing a similarity graph - a complete graph in which each vertex is a gate group and the weight of an edge is the similarity between the two groups it connects, then construct a Minimum Spanning Tree (MST) for SG. With the methodology of AccQOC, we reached a balanced point of compilation time and overall latency. The results show that accelerated compilation based on MST achieves 9.88× compilation speedup compared to the standard compilation of each group while maintaining an average 2.43× latency reduction compared with gate-based compilation.
******
Quantum computers are growing in size, and design decisions are being made now that attempt to squeeze more computation out of these machines. In this spirit, we design a method to boost the computational power of nearterm quantum computers by adapting protocols used in quantum error correction to implement “Approximate Quantum Error Correction (AQEC):” By approximating fully-fledged error correction mechanisms, we can increase the compute volume (qubits $\times$ gates, or “Simple Quantum Volume (SQV)”) of near-term machines. The crux of our design is a fast hardware decoder that can approximately decode detected error syndromes rapidly. Specifically, we demonstrate a proof-of-concept that approximate error decoding can be accomplished online in near-term quantum systems by designing and implementing a novel algorithm in superconducting Single Flux Quantum (SFQ) logic technology. This avoids a critical decoding backlog, hidden in all offline decoding schemes, that leads to idle time exponential in the number of T gates in a program [58]. Our design utilizes one SFQ processing module per physical quantum bit. Employing state-of-the-art SFQ synthesis tools, we show that the circuit area, power, and latency are within the constraints of typical, contemporary quantum system designs. Under a pure dephasing error model, the proposed accelerator and AQEC solution is able to expand SQV by factors between 3,402 and 11,163 on expected near-term machines. The decoder achieves a 5% accuracy threshold as well as pseudo-thresholds of approximately 5%, 4.75%, 4.5%, and 3.5% physical error rates for code distances 3, 5, 7, and 9, respectively. Decoding solutions are achieved in a maximum of $\sim$20 nanoseconds on the largest code distances studied. By avoiding the exponential idle time in offline decoders, we achieve a 10x reduction in required code distances to achieve the same logical performance as alternative designs.
******
Compiling high-level quantum programs to machines that are size constrained (i.e. limited number of quantum bits) and time constrained (i.e. limited number of quantum operations) is challenging. In this paper, we present SQUARE (Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles allocation and reclamation of scratch qubits (called ancilla) in modular quantum programs. At its core, SQUARE strategically performs uncomputation to create opportunities for qubit reuse.Current Noisy Intermediate-Scale Quantum (NISQ) computers and forward-looking Fault-Tolerant (FT) quantum computers have fundamentally different constraints such as data locality, instruction parallelism, and communication overhead. Our heuristic-based ancilla-reuse algorithm balances these considerations and fits computations into resource-constrained NISQ or FT quantum machines, throttling parallelism when necessary. To precisely capture the workload of a program, we propose an improved metric, the “active quantum volume,” and use this metric to evaluate the effectiveness of our algorithm. Our results show that SQUARE improves the average success rate of NISQ applications by 1. 47X. Surprisingly, the additional gates for uncomputation create ancilla with better locality, and result in substantially fewer swap gates and less gate noise overall. SQUARE also achieves an average reduction of 1. 5X (and up to 9. 6X) in active quantum volume for FT machines.
******
Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both.In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate HOOP with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that HOOP achieves low critical-path 1atency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to $ 1.7\times$, while reducing the write amplification by up to $ 2.1\times$. HOOP also demonstrates scalable data recovery capability on multi-core systems.
******
Bulk operations, such as Copy-on-Write (CoW), have been heavily used in most operating systems. In particular, CoW brings in significant savings in memory space and improvement in performance. CoW mainly relies on the fact that many allocated virtual pages are not written immediately (if ever written). Thus, assigning them to a shared physical page can eliminate much of the copy/initialization overheads in addition to improving the memory space efficiency. By prohibiting writes to the shared page, and merely copying the page content to a new physical page at the first write, CoW achieves significant performance and memory space advantages. Unfortunately, with the limited write bandwidth and slow writes of emerging Non-Volatile Memories (NVMs), such bulk writes can throttle the memory system. Moreover, it can add significant delays on the first write access to each page due to the need to copy or initialize a new page. Ideally, we need to enable CoW at fine-granularity, and hence only the updated cache blocks within the page need to be copied. To do this, we propose Lelantus, a novel approach that leverages secure memory metadata to allow fine-granularity CoW operations. Lelantus relies on a novel hardware-software co-design to allow tracking updated blocks of copied pages and hence delay the copy of the rest of the blocks until written. The impact of Lelantus becomes more significant when huge pages are deployed, e.g., 2MB or 1GB, as expected with emerging NVMs.
******
Byte-addressable non-volatile memory (NVM) is emerging as an alternative for main memory. Non-volatile main memory (NVMM) systems are required to support atomic persistence and deal with the high overhead of programming NVM cells. To this end, recent studies propose hardware logging and data encoding designs for NVMM systems. However, prior hardware logging designs incur either extra ordering constraints or redundant log data. Moreover, existing data encoding designs are unaware of the characteristics of log data, resulting in writing unnecessary log bits.In this paper, we propose a morphable hardware logging design (MorLog) that only logs the data necessary for recovery and dynamically selects encoding methods with least write overhead. We observe that (1) only the oldest undo and the newest redo data in each transaction are necessary for recovery, and (2) the log data for clean bits are clean. The first motivates our morphable logging mechanism. This mechanism logs both undo and redo data for the first update to the data in a transaction, and then logs only redo data. Undo data are eagerly written to NVMM to ensure atomicity, while redo data are buffered in a volatile log buffer and L1 caches to write only the newest redo data to NVMM. The second motivates our selective log data encoding mechanism. This mechanism simultaneously encodes log data with different methods, and writes the encoded log data with the least write cost to NVMM. We devise a differential log data compression method to exploit the characteristics of log data. This method directly discards clean bits from log data and compresses remained dirty bits. Our evaluation shows that MorLog improves performance by 72.5%, reduces NVMM write traffic by 41.1%, and decreases NVMM write energy by 49.9% compared with the state-of-the-art design.
******
Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables systems to detect and recover from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. We propose to offload the update and verification of system-level redundancy to TVARAK, a new hardware controller co-located with the last-level cache. TVARAK enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows that TVARAK is efficient. For example, TVARAK reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.
******
RowHammer is a circuit-level DRAM vulnerability, first rigorously analyzed and introduced in 2014, where repeatedly accessing data in a DRAM row can cause bit flips in nearby rows. The RowHammer vulnerability has since garnered significant interest in both computer architecture and computer security research communities because it stems from physical circuit-level interference effects that worsen with continued DRAM density scaling. As DRAM manufacturers primarily depend on density scaling to increase DRAM capacity, future DRAM chips will likely be more vulnerable toRowHammer than those of the past. Many RowHammer mitigation mechanisms have been proposed by both industry and academia, but it is unclear whether these mechanisms will remain viable solutions forfuture devices, as their overheads increase with DRAM's vulnerability to RowHammer. In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, wefirst present an experimental characterization of RowHammer on 1580 DRAM chips (408X DDR3, 652X DDR4, and 520X LPDDR4) from 300 DRAM modules (60X DDR3, 110X DDR4, and 130X LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three major DRAM manufacturers. Our studies definitively show that newer DRAM chips are more vulnerable to RowHammer: as device feature size reduces, the number of activations needed to induce a RowHammer bit flip also reduces, to as few as 9.6k (4.8k to two rows each) in the most vulnerable chip we tested. We evaluate five state-of-the-art RowHammer mitigation mechanisms using cycle-accurate simulation in the context of real data taken from our chips to study how the mitigation mechanisms scale with chip vulnerability. Wefind that existing mechanisms either are not scalable or suffer from prohibitively large performance overheads in projected future devices given our observed trends of RowHammer vulnerability. Thus, it is critical to research more effective solutions to RowHammer.
******
The following topics are dealt with: microprocessor chips; storage management; cache storage; graphics processing units; multiprocessing systems; learning (artificial intelligence); field programmable gate arrays; DRAM chips; power aware computing; and memory architecture.
******
DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance under very different and dynamic workload demands.This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost. CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to operate as a single low-latency logical cell driven by a single logical sense amplifier.We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.
******
Persistent memory has appealing properties in serving as main memory. While file access is protected by system calls, an attached persistent memory object (PMO) is one load/store away from accidental (or malicious) reads or writes, which may arise from use of just one buggy library. The recent progress in intra-process isolation could potentially protect PMO by enabling a process to partition sensitive data and code into isolated components. However, the existing intra-process isolations (e.g., Intel MPK) support isolation of only up to 16 domains, forming a major barrier for PMO protections. Although there is some recent effort trying to virtualize MPK to circumvent the limit, it suffers large overhead. This paper presents two novel architecture supports, which provide 11 - 52 × higher efficiency while offering the first known domain-based protection for PMOs.
******
Persistent key-value store supports journaling and checkpointing to maintain data consistency and to prevent data loss. However, conventional data consistency mechanisms are not suitable for efficient management of flash memories in SSDs due to that they write the same data twice and induce redundant flash operations. As a result, query processing is delayed by heavy traffics during checkpointing. The checkpointing accompanies many write operations by nature, and a write operation consumes severe time and energy in SSDs; worse, it can introduce the write amplification problem and shorten the lifetime of the flash memory. In this paper, we propose an in-storage checkpointing mechanism, named Check-In, based on the cooperation between the storage engine of a host and the flash translation layer (FTL) of an SSD. Compared to the existing mechanism, our proposed mechanism reduces the tail latency due to checkpointing by 92.1 % and reduces the number of duplicate writes by 94.3 %. Overall, the average throughput and latency are improved by 8.1 % and 10.2 %, respectively.
******
Speculative execution attacks are an enormous security threat. In these attacks, malicious speculative execution reads and exfiltrates potentially arbitrary program data through microarchitectural covert channels. Correspondingly, prior work has shown how to comprehensively block such attacks by delaying the execution of covert channel-creating instructions until their operands are a function of non-speculative data. This paper's premise is that it is safe to execute these potentially dangerous instructions early, improving performance, as long as their execution does not require operand-dependent hardware resource usage, i.e., is data oblivious. While secure, this idea can easily reduce, not improve, performance. Intuitively, data obliviousness implies doing the worst case work all the time. Our key idea to get net speedup is that it is safe to predict what will be, and to subsequently perform, the work needed to satisfy the common case, as long as the prediction itself does not leak privacy. We call the complete scheme-predicting the form of data-oblivious execution-Speculative Data-Oblivious Execution (SDO). We build SDO on top of a recent comprehensive and state-of-the-art protection called STT. Extending security arguments from STT, we show how the predictions do not reveal private information, enabling safe and efficient speculative execution. We evaluate the combined scheme, STT + SDO, on a set of SPEC17 workloads and find that it improves the performance of stand-alone STT by an average 36.3% to 55.1%, depending on the microarchitecture and attack model-and without changing STT's security guarantees.
******
This paper presents Packet Chasing, an attack on the network that does not require access to the network, and works regardless of the privilege level of the process receiving the packets. A spy process can easily probe and discover the exact cache location of each buffer used by the network driver. Even more useful, it can discover the exact sequence in which those buffers are used to receive packets. This then enables packet frequency and packet sizes to be monitored through cache side channels. This allows both covert channels between a sender and a remote spy with no access to the network, as well as direct attacks that can identify, among other things, the web page access patterns of a victim on the network. In addition to identifying the potential attack, this work proposes a software-based short-term mitigation as well as a light-weight, adaptive, cache partitioning mitigation that blocks the interference of I/O and CPU requests in the last-level cache.
******
The memory system is vulnerable to a number of security breaches, e.g., an attacker can interfere with program execution by disrupting values stored in memory. Modern Intel® Software Guard Extension (SGX) systems already support integrity trees to detect such malicious behavior. However, in spite of recent innovations, the bandwidth overhead of integrity+replay protection is non-trivial; state-of-the-art solutions like Synergy introduce average slowdowns of 2.3× for memory-intensive benchmarks. Prior work also implements a tree that is shared by multiple applications, thus introducing a potential side channel. In this work, we build on the Synergy and SGX baselines, and introduce three new techniques. First, we isolate each application by implementing a separate integrity tree and metadata cache for each application; this improves metadata cache efficiency and improves performance by 39%, while eliminating the potential side channel. Second, we reduce the footprint of the metadata. Synergy uses a combination of integrity and error correction metadata to provide low-overhead support for both. We share error correction metadata across multiple blocks, thus lowering its footprint (by 16×) while preventing error correction only in rare corner cases. However, we discover that shared error correction metadata, even with caching, does not improve performance. Third, we observe that thanks to its lower footprint, the error correction metadata can be embedded into the integrity tree. This reduces the metadata blocks that must be accessed to support both integrity verification and chipkill reliability. The proposed Isolated Tree with Embedded Shared Parity (ITESP) yields an overall performance improvement of 64%, relative to baseline Synergy.
******
Tamper-proof hardware designs present a great challenge to computer architects. Most existing research limits hardware trusted computing base (TCB) to a CPU chip and anything off the CPU chip is vulnerable to probing and tampering. This paper introduces a new hardware design that provides strong defenses against physical attacks on interconnecting buses between chips in a computer system thereby extending the hardware TCB beyond CPU chips. The new approach is referred to as DIVOT: Detecting Impedance Variations Of Transmission-lines (Tx-lines). Every Tx-line in a computer system, such as a bus and interconnection wire has a unique, intrinsic, and fingerprint-like property: Impedance Inhomogeneity Pattern (IIP), i.e. the impedance distribution over distance. Such unpredictable, uncontrollable, and non-reproducible IIP fingerprints can be used to authenticate a Tx-line to ensure the confidentiality and integrity of data being transmitted. In addition, physical probes perturb the electromagnetic (EM) field around a Tx-line, leading to an altered IIP. As a result, runtime monitoring of IIPs can also be used to actively detect physical probing, snooping, and wire-tapping on buses. While the physics behind the IIP is known, the major technical breakthrough of DIVOT is the new integrated time domain reflectometer, iTDR, that is capable of carrying out in-situ and runtime monitoring of a Tx-line without interfering with normal data transfers. The iTDR is based on two innovations: analog-to-probability conversion (APC) and probability density modulation (PDM). The iTDR performs runtime IIP measurements noninvasively and is CMOS-compatible allowing it to be integrated with any interface logic connected to a bus. DIVOT is a generic, scalable, cost-effective, and low-overhead security solution for any computer system from servers to embedded computers in smart mobile devices and IoTs. To demonstrate the proposed architecture, a working prototype of DIVOT has been built on an FPGA as a proof of concept. Experimental results clearly showed the feasibility and performance of DIVOT for both hardware authentication and tamperproof applications. More specifically, the probability of correctly identifying a bus is close to 1 with an equal error rate (EER) of less than 0.06% at room temperature. We present an example design that incorporates DIVOT into an off-chip memory bus to protect against physical attacks including probing/snooping, tampering, and cold boot attacks.
******
This work introduces the CHEx86 processor architecture for securing applications, including legacy binaries, against a wide array of security exploits that target temporal and spatial memory safety vulnerabilities such as out-of-bounds accesses, use-after-free, double-free, and uninitialized reads, by instrumenting the code at the microcode-level, completely under-the-hood, with only limited access to source-level symbol information. In addition, this work presents a novel scheme for speculatively tracking pointer arithmetic and pointer movement, including the detection of pointer aliases in memory, at the machine code-level using a configurable set of automatically constructed rules. This architecture outperforms the address sanitizer, a state-of-the-art software-based mitigation by 59%, while eliminating porting, deployment, and verification costs that are invariably associated with recompilation.
******
Although hardware-based trusted execution environments (TEEs) have evolved to provide strong isolation with efficient hardware supports, their current monolithic model poses challenges in representing common software structures with modules produced from potentially untrusted 3rd parties. For better mapping of such modular software designs to trusted execution environments, it is necessary to extend the current monolithic model to a hierarchical one, which provides multiple inner TEEs within a TEE. For such hierarchical compartmentalization within a TEE, this paper proposes a novel hierarchical TEE called nested enclave, which extends the enclave support from Intel SGX. Inspired by the multi-level security model, nested enclave provides multiple inner enclaves sharing the same outer enclave. Inner enclaves can access the context of the outer enclave, but they are protected from the outer enclave and non-enclave execution. Peer inner enclaves are isolated from each other while accessing the execution environment of the shared outer enclave. Both of the inner and outer enclaves are protected from vulnerable privileged software and physical attacks. Such fine-grained nested enclaves allow secure multitiered environments using software modules from untrusted 3rd parties. The security-sensitive modules run on the inner enclave with the higher security level, while the 3rd party modules on the outer enclave. It can be further extended to provide a separate inner module for each user to process privacy-sensitive data while sharing the same library with efficient hardwareprotected communication channels. This study investigates three case scenarios implemented with an emulated nested enclave support, proving the feasibility and security improvement of the nested enclave model.
******
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software cooptimization techniques such as memory-side caching, tableaware packet scheduling, and hot entry profiling are studied, providing up to 9.8× memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2× throughput improvement and 45.8% memory energy savings.
******
Image processing is becoming an increasingly important domain for many applications on workstations and the datacenter that require accelerators for high performance and energy efficiency. GPU, which is the state-of-the-art accelerator for image processing, suffers from the memory bandwidth bottleneck. To tackle this bottleneck, near-bank architecture provides a promising solution due to its enormous bank-internal bandwidth and low-energy memory access. However, previous work lacks hardware programmability, while image processing workloads contain numerous heterogeneous pipeline stages with diverse computation and memory access patterns. Enabling programmable near-bank architecture with low hardware overhead remains challenging.This work proposes iPIM, the first programmable in-memory image processing accelerator using near-bank architecture. We first design a decoupled control-execution architecture to provide lightweight programmability support. Second, we propose the SIMB (Single-Instruction-Multiple-Bank) ISA to enable flexible control flow and data access. Third, we present an end-to-end compilation flow based on Halide that supports a wide range of image processing applications and maps them to our SIMB ISA. We further develop iPIM-aware compiler optimizations, including register allocation, instruction reordering, and memory order enforcement to improve performance. We evaluate a set of representative image processing applications on iPIM and demonstrate that on average iPIM obtains 11.02× acceleration and 79.49% energy saving over an NVIDIA Tesla V100 GPU. Further analysis shows that our compiler optimizations contribute 3.19× speedup over the unoptimized baseline.
******
Near-data accelerators (NDAs) that are integrated with the main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and NDAs in a way that permits both regular memory access by some applications and accelerating others with an NDA, avoids copying data, enables collaborative processing, and simultaneously offers high performance for both host and NDA. We identify and solve new challenges in this context: mitigating row-locality interference from host to NDAs, reducing read/write-turnaround overhead caused by fine-grain interleaving of host and NDA requests, architecting a memory layout that supports the locality required for NDAs and sophisticated address interleaving for host performance, and supporting both packetized and traditional memory interfaces. We demonstrate our approach in a simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory modules. We show that our mechanisms enable effective and efficient concurrent access using a set of microbenchmarks, then demonstrate the potential of the system for the important stochastic variance-reduced gradient (SVRG) algorithm.
******
Resistive-random-access-memory (ReRAM) based processing-in-memory (R2PIM) accelerators show promise in bridging the gap between Internet of Thing devices' constrained resources and Convolutional/Deep Neural Networks' (CNNs/DNNs') prohibitive energy cost. Specifically, R2PIM accelerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density through ReRAM's high density. However, the energy efficiency is still limited by the dominant energy cost of input and partial sum (Psum) movements and the cost of digital-to-analog (D/A) and analog-to-digital (A/D) interfaces. In this work, we identify three energy-saving opportunities in R2PIM accelerators: analog data locality, time-domain interfacing, and input access reduction, and propose an innovative R2PIM accelerator called TIMELY, with three key contributions: (1) TIMELY adopts analog local buffers (ALBs) within ReRAM crossbars to greatly enhance the data locality, minimizing the energy overheads of both input and Psum movements; (2) TIMELY largely reduces the energy of each single D/A (and A/D) conversion and the total number of conversions by using time-domain interfaces (TDIs) and the employed ALBs, respectively; (3) we develop an only-once input read (O2IR) mapping method to further decrease the energy of input accesses and the number of D/A conversions. The evaluation with more than 10 CNN/DNN models and various chip configurations shows that, TIMELY outperforms the baseline R2PIM accelerator, PRIME, by one order of magnitude in energy efficiency while maintaining better computational density (up to 31.2×) and throughput (up to 736.6×). Furthermore, comprehensive studies are performed to evaluate the effectiveness of the proposed ALB, TDI, and O2IR in terms of energy savings and area reduction.
******
Associative processing (AP) is a promising PIM paradigm that overcomes the von Neumann bottleneck (memory wall) by virtue of a radically different execution model. By decomposing arbitrary computations into a sequence of primitive memory operations (i.e., search and write), AP's execution model supports concurrent SIMD computations in-situ in the memory array to eliminate the need for data movement. This execution model also provides a native support for flexible data types and only requires a minimal modification on the existing memory design (low hardware complexity). Despite these advantages, the execution model of AP has two limitations that substantially increase the execution time, i.e., 1) it can only search a single pattern in one search operation and 2) it needs to perform a write operation after each search operation. In this paper, we propose the Highly Performant Associative Processor (Hyper- AP) to fully address the aforementioned limitations. The core of Hyper- AP is an enhanced execution model that reduces the number of search and write operations needed for computations, thereby reducing the execution time. This execution model is generic and improves the performance for both CMOS-based and RRAM-based AP, but it is more beneficial for the RRAMbased AP due to the substantially reduced write operations. We then provide complete architecture and micro-architecture with several optimizations to efficiently implement Hyper-AP. In order to reduce the programming complexity, we also develop a compilation framework so that users can write C-like programs with several constraints to run applications on Hyper- AP. Several optimizations have been applied in the compilation process to exploit the unique properties of Hyper- AP. Our experimental results show that, compared with the recent work IMP, Hyper- AP achieves up to 54×/4.4× better power-/area-efficiency for various representative arithmetic operations. For the evaluated benchmarks, Hyper-AP achieves 3.3× speedup and 23.8× energy reduction on average compared with IMP. Our evaluation also confirms that the proposed execution model is more beneficial for the RRAM-based AP than its CMOS-based counterpart.
******
A reduction in the time it takes to train machine learning models can be translated into improvements in accuracy. important factor that increases training time in deep neural networks (DNNs) is the need to store large amounts of temporary data during the back-propagation algorithm. To enable training very large models this temporary data can be offloaded from limited size GPU memory to CPU memory but this data movement incurs large performance overheads. We observe that in one important class of DNNs, convolutional neural networks (CNNs), there is spatial correlation in these temporary values. We propose JPEG for ACTivations (JPEGACT), a lossy activation offload accelerator for training CNNs that works by discarding redundant spatial information. JPEGACT adapts the well-known JPEG algorithm from 2D image compression to activation compression. We show how to optimize the JPEG algorithm so as to ensure convergence and maintain accuracy during training. JPEG-ACT achieves 2.4× higher training performance compared to prior offload accelerators, and 1.6× compared to prior activation compression methods. An efficient hardware implementation allows JPEG-ACT to consume less than 1% of the power and area of a modern GPU.
******
Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program' s execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis. To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation [23], [29]. Given x86t_elt and a synthesis bound (on program size) as input, TransForm' s synthesis engine successfully produces a complete set of ELTs (within a 9-instruction bound) including relevant hand-curated ELTs from prior work, plus 100 more.
******
We present HieraGen, a new tool for automatically generating hierarchical cache coherence protocols. HieraGen's inputs are the simple, atomic, stable state protocols for each level of the hierarchy. HieraGen's output is a highly concurrent hierarchical protocol, in the form of the finite state machines for all of the cache and directory controllers. HieraGen thus reduces the complexity that architects face, by offloading the challenging tasks of composing protocols and managing concurrency. Experiments show that HieraGen can automatically generate correct-by-construction MOESI family of hierarchical protocols with dozens of states and hundreds of transitions. We have verified all of the generated protocols for safety and deadlock freedom using a model checker.
******
Main memory capacity continues to soar, resulting in TLB misses becoming an increasingly significant performance bottleneck. Current coarse grained page sizes, the solution from Intel, ARM, and others, have not helped enough. We propose Tailored Page Sizes (TPS), a mechanism that allows pages of size 2n, for all n greater than a default minimum. For x86, the default minimum page size is 212 (4KB). TPS means one page table entry (PTE) for each large contiguous virtual memory space mapped to an equivalent-sized large contiguous physical frame. To make this work in a clean, seamless way, we suggest small changes to the ISA, the microarchitecture, and the O/S allocation operation. The result: TPS can eliminate approximately 98% of page walk memory accesses and 97% of all L1 TLB misses across a variety of SPEC17 and big data memory intensive benchmarks.
******
The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-movable pages, or the need to split a large page into regular pages when part of the large page is forced to have a different permission status from the rest of the page. Furthermore, they can also be expensive due to memory bloating caused by sparse accesses to application data. In this work, we enable the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages. Perforated pages permit the OS to punch 4KB page-sized holes in the physical address range allocated to a large page and re-map them to other addresses as needed. This not only enables the system to benefit from large pages in the presence of fragmentation, but also allows for different permissions to exist within a large page, enhancing sharing flexibility. In addition, it allows unused parts of a large page to be used elsewhere, mitigating memory bloating. To minimize changes to the system, perforated pages reuse the 4KBlevel page table entries to store the hole locations and translates holes into regular 4KB pages. For performance, the proposed technique caches the translations for hole pages in the TLBs and track holes via cached bitmaps in the L2 TLB. By enabling large pages in the presence of physical memory fragmentation, perforated pages increase the applicability and resulting benefits of large pages with only minor changes to the hardware and OS. In this work, we evaluate the effectiveness of perforated pages with timing simulations under diverse and realistic fragmentation scenarios. Our results show that even with fragmented memory, perforated pages accomplish 93.2% to 99.9% of the performance achievable by ideal memory allocation, and 2.0% to 11.5% better performance over the conventional system running with fragmented memory.
******
GPUs accelerate high-throughput applications, which require orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, the capacity of such high-bandwidth memory tends to be relatively small. Buddy Compression is an architecture that makes novel use of compression to utilize a larger buddy-memory from the host or disaggregated memory, effectively increasing the memory capacity of the GPU. Buddy Compression splits each compressed 128B memory-entry between the high-bandwidth GPU memory and a slower-but-larger buddy memory such that compressible memory-entries are accessed completely from GPU memory, while incompressible entries source some of their data from off-GPU memory. With Buddy Compression, compressibility changes never result in expensive page movement or re-allocation. Buddy Compression achieves on average 1.9× effective GPU memory expansion for representative HPC applications and 1.5× for deep learning training, performing within 2% of an unrealistic system with no memory limit. This makes Buddy Compression attractive for performance-conscious developers that require additional GPU memory capacity.
******
A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks.In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator's computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.
******
We present SmartExchange, an algorithm-hardware co-design framework to trade higher-cost memory storage/access for lower-cost computation, for energy-efficient inference of deep neural networks (DNNs). We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2. To our best knowledge, this algorithm is the first formulation that integrates three mainstream model compression ideas: sparsification or pruning, decomposition, and quantization, into one unified framework. The resulting sparse and readily-quantized DNN thus enjoys greatly reduced energy consumption in data movement as well as weight storage. On top of that, we further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance. Extensive experiments show that 1) on the algorithm level, SmartExchange outperforms stateof-the-art compression techniques, including merely sparsification or pruning, decomposition, and quantization, in various ablation studies based on nine models and four datasets; and 2) on the hardware level, SmartExchange can boost the energy efficiency by up to 6.7× and reduce the latency by up to 19.2× over four state-of-the-art DNN accelerators, when benchmarked on seven DNN models (including four standard DNNs, two compact DNN models, and one segmentation model) and three datasets.
******
Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed workload characterization on personalized recommendations and identifies two significant performance limiters: memory-intensive embedding layers and compute-intensive multi-layer perceptron (MLP) layers. We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers. We implement and demonstrate our proposal on an Intel HARPv2, a package-integrated CPU+FPGA device, which shows a 1.7-17.2× performance speedup and 1.7-19.5× energy efficiency improvement than conventional approaches.
******
Neural personalized recommendation is the cornerstone of a wide collection of cloud services and products, constituting significant compute demand of cloud infrastructure. Thus, improving the execution efficiency of recommendation directly translates into infrastructure capacity saving. In this paper, we propose DeepRecSched, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems. By carefully optimizing task versus data-level parallelism, DeepRecSched improves system throughput on server class CPUs by 2× across eight industry-representative models. Next, we deploy and evaluate this optimization in an at-scale production datacenter which reduces end-to-end tail latency across a wide variety of recommendation models by 30%. Finally, DeepRecSched demonstrates the role and impact of specialized AI hardware in optimizing system level performance (QPS) and power efficiency (QPS/watt) of recommendation inference. In order to enable the design space exploration of customized recommendation systems shown in this paper, we design and validate an end-to-end modeling infrastructure, DeepRecInfra. DeepRecInfra enables studies over a variety of recommendation use cases, taking into account at-scale effects, such as query arrival patterns and recommendation query sizes, observed from a production datacenter, as well as industry-representative models and tail latency targets.
******
The slowdown of single-chip performance scaling combined with the growing demands of computing ever larger problems efficiently has led to a renewed interest in distributed architectures and specialized hardware. Dedicated accelerators for common or critical operations are becoming cost-effective additions to processors, peripherals, and networks. In this paper we focus on one such operation, the All-Reduce, which is both a common and critical feature of neural network training. All-Reduce is impossible to fully parallelize and difficult to amortize, so it benefits greatly from hardware acceleration. We are proposing an accelerator-centric, shared-memory network that improves All-Reduce performance through in-network reductions, as well as accelerating other collectives like Multicast. We propose switch designs to support in-network computation, including two reduction methods that offer trade-offs in implementation complexity and performance. Additionally, we propose network endpoint modifications to further improve collectives. We present simulation results for a 16 GPU system showing that our collective acceleration design improves the All-Reduce operation by up to 2x for large messages and up to 18x for small messages when compared with a state-of-the-art software algorithm, leading up to 1.4x faster DL training times for networks like Transformer. We demonstrate that this design is scalable to large systems and present results for up to 128 GPUs.
******
Quantization is an effective technique for Deep Neural Network (DNN) inference acceleration. However, conventional quantization techniques are either applied at network or layer level that may fail to exploit fine-grained quantization for further speedup, or only applied on kernel weights without paying attention to the feature map dynamics that may lead to lower NN accuracy. In this paper, we propose a dynamic region-based quantization, namely DRQ, which can change the precision of a DNN model dynamically based on the sensitive regions in the feature map to achieve greater acceleration while reserving better NN accuracy. We propose an algorithm to identify the sensitive regions and an architecture that utilizes a variable-speed mixed-precision convolution array to enable the algorithm with better performance and energy efficiency. Our experiments on a wide variety of networks show that compared to a coarse-grained quantization accelerator like “Eyeriss”, DRQ can achieve 92% performance gain and 72% energy reduction with less then 1% accuracy loss. Compared to the state-of-the-art mixed-precision quantization accelerator “OLAccel”, DRQ can also achieve 21% performance gain and 33% energy reduction with 3% prediction accuracy improvement which is quite impressive for inference.
******
GPUs have evolved from providing highly-constrained programmability for a single kernel to using pre-emption to ensure independent forward progress for concurrently executing kernels. However, modern GPUs do not ensure independent forward progress for kernels that use fine-grain synchronization to coordinate inter-work-group execution. Enabling independent forward progress among work-groups (WGs) is challenging as pre-empted kernels may be rescheduled with fewer hardware resources. This can lead to oversubscribed execution scenarios that deadlock current hardware even for correctly written code. Prior work addresses this problem by requiring programmers to specify resource requirements and assuming static resource allocation, which adds scheduling constraints and reduces portability. We propose a family of novel hardware approaches - trading off hardware complexity for performance - that provide independent forward progress in the presence of fine-grain inter-WG synchronization and dynamic resource allocation. Additionally, we propose new waiting atomic instructions compatible with proposed C++ 20 extensions. Our final design, Autonomous Work-Groups (AWG), uses hints from regular and waiting atomics to cooperatively schedule WGs within a kernel, improving efficiency and virtualizing hardware resources. In non-oversubscribed scenarios, AWG outperforms a busy-waiting baseline (which deadlocks in oversubscribed scenarios) by 12× on average for benchmarks that use different mutexes and barriers for fine-grained, WG granularity synchronization. Furthermore, AWG outperforms other solutions that do not deadlock in the oversubscribed case, such as fixed-interval round-robin context switching or naively extending monitor/mwait to GPUs, by 2.6× and 2.2×, respectively.
******
GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irregular interaction among the concurrent threads. Consequently, they need complex synchronization. To enable both high performance and adequate synchronization, GPU vendors have introduced scoped synchronization operations that allow a programmer to synchronize within a subset of concurrent threads (a.k. a., scope) that she deems adequate. Scoped-synchronization avoids the performance overhead of synchronization across thousands of GPU threads while ensuring correctness when used appropriately. This flexibility, however, could be a new source of incorrect synchronization where a race can occur due to insufficient scope of the synchronization operation, and not due to missing synchronization as in a typical race. We introduce ScoRD, a race detector that enables hardware support for efficiently detecting global memory races in a GPU program, including those that arise due to insufficient scopes of synchronization operations. We show that ScoRD can detect a variety of races with a modest performance overhead (on average, 35%). In the process of this study, we also created a benchmark suite consisting of seven applications and three categories of microbenchmarks that use scoped synchronization operations.
******
Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significantly improves performance over conventional virtual memory.
******
We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating SSD firmware in the GPU's MMU to reap the benefits of hardware accelerations. Although flash arrays within the SSD can deliver high accumulated bandwidth, only a small fraction of such bandwidth can be utilized by GPU's memory requests due to mismatches of their access granularity. To address this, ZnG employs a large L2 cache and flash registers to buffer the memory requests. Our evaluation results indicate that ZnG can achieve 7.5× higher performance than prior work.
******
Data movement is a significant and growing consumer of energy in modern systems, from specialized low-power accelerators to GPUs with power budgets in the hundreds of Watts. Given the importance of the problem, prior work has proposed designing interconnects on which the energy cost of transmitting a 0 is significantly lower than that of transmitting a 1. With such an interconnect, data movement energy is reduced by encoding the transmitted data such that the number of 1s is minimized. Although promising, these data encoding proposals do not take full advantage of application level semantics. As an example of a neglected optimization opportunity, consider the case of a dot product computation as part of a neural network inference task. The order in which the neural network weights are fetched and processed does not affect correctness, and can be optimized to further reduce data movement energy.This paper presents commutative data reordering (CDR), a hardware-software approach that leverages the commutative property in linear algebra to strategically select the order in which weight matrix coefficients are fetched from memory. To find a low-energy transmission order, weight ordering is modeled as an instance of one of two well-studied problems, the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. This reduction makes it possible to leverage the vast body of work on efficient approximation methods to find a good transmission order. CDR exploits the indirection inherent to sparse matrix formats such that no additional metadata is required to specify the selected order. The hardware modifications required to support CDR are minimal, and incur an area penalty of less than 0.01% when implemented on top of a mobile-class GPU. When applied to 7 neural network inference tasks running on a GPU-based system, CDR respectively reduces average DRAM IO energy by 53.1% and 22.2% over the data bus invert encoding scheme used by LPDDR4, and the recently proposed Base + XOR encoding. These savings are attained with no changes to the mobile system software and no runtime performance penalty.
******
The Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNNs) are a popular class of machine learning models for analyzing sequential data. Their training on modern GPUs, however, is limited by the GPU memory capacity. Our profiling results of the LSTM RNN-based Neural Machine Translation (NMT) model reveal that feature maps of the attention and RNN layers form the memory bottleneck, and runtime is unevenly distributed across different layers when training on GPUs. Based on these two observations, we propose to recompute the feature maps of the attention and RNN layers rather than stashing them persistently in the GPU memory. While the idea of feature map recomputation has been considered before, existing solutions fail to deliver satisfactory footprint reduction, as they do not address two key challenges. For each feature map recomputation to be efficient, its effect on (1) the total memory footprint, and (2) the total execution time has to be carefully estimated. To this end, we propose Echo, a new compiler-based optimization scheme that addresses the first challenge with a practical mechanism that estimates the memory benefits of recomputation over the entire computation graph, and the second challenge by non-conservatively estimating the recomputation runtime overhead leveraging layer specifics. Echo reduces the GPU memory footprint automatically and transparently without any changes required to the training source code, and is effective for models beyond LSTM RNNs. We evaluate Echo on numerous state-of-the-art machine learning workloads, including NMT, DeepSpeech2, Transformer, and ResNet, on real systems with modern GPUs and observe footprint reduction ratios of 1. 89x on average and 3. 13x maximum. Such reduction can be converted into faster training with a larger batch size, savings in GPU energy consumption (e.g., training with one GPU as fast as with four), and/or an increase in the maximum number of layers under the same GPU memory budget. Echo is open-sourced as a part of the MXNet 2.0 framework. 11https://issues.apache.org/jirdprojects/MXNET/issues/MXNET-1450
******
The virtual memory system is pervasive in today's computer systems, and demand paging is the key enabling mechanism for it. At a page miss, the CPU raises an exception, and the page fault handler is responsible for fetching the requested page from the disk. The OS typically performs a context switch to run other threads as traditional disk access is slow. However, with the widespread adoption of high-performance storage devices, such as low-latency solid-state drives (SSDs), the traditional OS-based demand paging is no longer effective because a considerable portion of the demand paging latency is now spent inside the OS kernel. Thus, this paper makes a case for hardware-based demand paging that mostly eliminates OS involvement in page miss handling to provide a near-disk-access-time latency for demand paging. To this end, two architectural extensions are proposed: LBA-augmented page table that moves I/O stack operations to the control plane and Storage Management Unit that enables CPU to directly issue I/O commands without OS intervention in most cases. OS support is also proposed to detach tasks for memory resource management from the critical path. The evaluation results using both a cycle-level simulator and a real x86 machine with an ultra-low latency SSD show that the proposed scheme reduces the demand paging latency by 37.0%, and hence improves the performance of FIO read random benchmark by up to 57.1% and a NoSQL server by up to 27.3% with real-world workloads. As a side effect of eliminating OS intervention, the IPC of the user-level code is also increased by up to 7.0%.
******
﻿Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semi-conductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSA); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5X annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.
******
Of late, deep neural networks have become ubiquitous in mobile applications. As mobile devices generally require immediate response while maintaining user privacy, the demand for on-device machine learning technology is on the increase. Nevertheless, mobile devices suffer from restricted hardware resources, whereas deep neural networks involve considerable computation and communication. Therefore, the implementation of a neural-network specialized hardware accelerator, generally called neural processing unit (NPU), has started to gain attention for the mobile application processor (AP). However, NPUs for commercial mobile AP face two challenges that are difficult to realize simultaneously: execution of a wide range of applications and efficient performance.In this paper, we propose a flexible but efficient NPU architecture for a Samsung flagship mobile system-on-chip (SoC). To implement an efficient NPU, we design an energy-efficient inner-product engine that utilizes the input feature map sparsity. We propose a re-configurable MAC array to enhance the flexibility of the proposed NPU, dynamic internal memory port assignment to maximize on-chip memory bandwidth utilization, and efficient architecture to support mixed-precision arithmetic. We implement the proposed NPU using the Samsung 5nm library. Our silicon measurement experiments demonstrate that the proposed NPU achieves 290.7 FPS and 13.6 TOPS/W, when executing an 8-bit quantized Inception-v3 model [1] with a single NPU core. In addition, we analyze the proposed zero-skipping architecture in detail. Finally, we present the findings and lessons learned when implementing the commercial mobile NPU and interesting avenues for future work.
******
We present the novel micro-architectural features, supported by an innovative and novel pre-silicon methodology in the design of POWER10. The resulting projected energy efficiency boost over POWER9 is 2.6x at core level (for SPECint) and up to 3x at socket level. In addition, a new feature supporting inline AI acceleration was added to the POWER ISA and incorporated into the POWER10 processor core design. The resulting boost in SIMD/AI socket performance is projected to be up to 10x for FP32 and 21x for INT8 models of ResNet-50 and BERT-Large. In this paper, we describe the novel methodology deployed and used not only to obtain these efficiency boosts for traditional workloads, but also to infuse AI/ML/HPC capability directly into the POWER10 core.
******
Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2× and 3.5×, respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5×, and the overall energy efficiency of the system running the applications by 3.2×.
******
For decades, Moore’s Law has delivered the ability to integrate an exponentially increasing number of devices in the same silicon area at a roughly constant cost. This has enabled tremendous levels of integration, where the capabilities of computer systems that previously occupied entire rooms can now fit on a single integrated circuit.In recent times, the steady drum beat of Moore’s Law has started to slow down. Whereas device density historically doubled every 18-24 months, the rate of recent silicon process advancements has declined. While improvements in device scaling continue, albeit at a reduced pace, the industry is simultaneously observing increases in manufacturing costs.In response, the industry is now seeing a trend toward reversing direction on the traditional march toward more integration. Instead, multiple industry and academic groups are advocating that systems on chips (SoCs) be "disintegrated" into multiple smaller "chiplets." This paper details the technology challenges that motivated AMD to use chiplets, the technical solutions we developed for our products, and how we expanded the use of chiplets from individual processors to multiple product families.
******
The most widely used last-level cache (LLC) architecture in the microprocessors has been the inclusive LLC design. The popularity of the inclusive design stems from the bandwidth optimization and simplification it offers to the implementation of the cache coherence protocols. However, inclusive LLCs have always been associated with the curse of inclusion victims. An inclusion victim is a block that must be forcefully replaced from the inner levels of the cache hierarchy when the copy of the block is replaced from the inclusive LLC. This tight coupling between the LLC victims and the inner-level cache contents leads to three major drawbacks. First, live inclusion victims can lead to severe performance degradation depending on the LLC replacement policies. Second, a process can victimize the blocks of another process in an LLC shared by multiple cores and this can be exploited to leak information through well-known eviction-based timing side-channels. An inclusive LLC makes these channels much less noisy due to the presence of inclusion victims which allow the malicious processes to control the contents of the percore private caches through LLC evictions. Third, to reduce the impact of the aforementioned two drawbacks, the inner-level caches, particularly the mid-level cache in a three-level inclusive cache hierarchy, must be kept small even if a larger mid-level cache could have been beneficial in the absence of inclusion victims.We observe that inclusion victims are not fundamental to the inclusion property, but arise due to the way the contents of an inclusive LLC are managed. Motivated by this observation, we introduce a fundamentally new inclusive LLC design named the Zero Inclusion Victim (ZIV) LLC that guarantees freedom from inclusion victims while retaining all advantages of an inclusive LLC. This is the first inclusive LLC design proposal to offer such a guarantee, thereby completely isolating the core caches from LLC evictions. We observe that the root cause of inclusion victims is the constraint that an LLC victim must be chosen from the set pointed to by the set indexing function. The ZIV LLC relaxes this constraint only when necessary by efficiently and minimally enabling a global victim selection scheme in the inclusive LLC to avoid generation of inclusion victims. Detailed simulations conducted with a chip-multiprocessor model using multi-programmed and multi-threaded workloads show that the ZIV LLC gracefully supports large mid-level caches (e.g., half the size of the LLC) and delivers performance close to a non-inclusive LLC for different classes of LLC replacement policies. We also show that the ZIV LLC comfortably outperforms the existing related proposals and its performance lead grows with increasing mid-level cache capacity.
******
Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching page table entries (PTEs) ahead of demand TLB accesses can mitigate the address translation performance bottleneck, but each prefetch requires traversing the page table, triggering additional accesses to the memory hierarchy. Therefore, TLB prefetching is a costly technique that may undermine performance when the prefetches are not accurate.In this paper we exploit the locality in the last level of the page table to reduce the cost and enhance the effectiveness of TLB prefetching by fetching cache-line adjacent PTEs "for free". We propose Sampling-Based Free TLB Prefetching (SBFP), a dynamic scheme that predicts the usefulness of these "free" PTEs and prefetches only the ones most likely to prevent TLB misses. We demonstrate that combining SBFP with novel and state-of-the-art TLB prefetchers significantly improves miss coverage and reduces most memory accesses due to page walks.Moreover, we propose Agile TLB Prefetcher (ATP), a novel composite TLB prefetcher particularly designed to maximize the benefits of SBFP. ATP efficiently combines three low-cost TLB prefetchers and disables TLB prefetching for those execution phases that do not benefit from it. Unlike state-of-the-art TLB prefetchers that correlate patterns with only one feature (e.g., strides, PC, distances), ATP correlates patterns with multiple features and dynamically enables the most appropriate TLB prefetcher per TLB miss.To alleviate the address translation performance bottleneck, we propose a unified solution that combines ATP and SBFP. Across an extensive set of industrial workloads provided by Qualcomm, ATP coupled with SBFP improves geometric speedup by 16.2%, and eliminates on average 37% of the memory references due to page walks. Considering the SPEC CPU 2006 and SPEC CPU 2017 benchmark suites, ATP with SBFP increases geometric speedup by 11.1%, and eliminates page walk memory references by 26%. Applied to big data workloads (GAP suite, XSBench), ATP with SBFP yields a geometric speedup of 11.8% while reducing page walk memory references by 5%. Over the best state-of-the-art TLB prefetcher for each benchmark suite, ATP with SBFP achieves speedups of 8.7%, 3.4%, and 4.2% for the Qualcomm, SPEC, and GAP+XSBench workloads, respectively.
******
Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms.This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers.
******
In modern server CPUs, last-level cache (LLC) is a critical hardware resource that exerts significant influence on the performance of the workloads, and how to manage LLC is a key to the performance isolation and QoS in the cloud with multi-tenancy. In this paper, we argue that in addition to CPU cores, high-speed I/O is also important for LLC management. This is because of an Intel architectural innovation – Data Direct I/O (DDIO) – that directly injects the inbound I/O traffic to (part of) the LLC instead of the main memory. We summarize two problems caused by DDIO and show that (1) the default DDIO configuration may not always achieve optimal performance, (2) DDIO can decrease the performance of non-I/O workloads that share LLC with it by as high as 32%.We then present, the first LLC management mechanism that treats the I/O as the first-class citizen. Iat monitors and analyzes the performance of the core/LLC/DDIO using CPU’s hardware performance counters and adaptively adjusts the number of LLC ways for DDIO or the tenants that demand more LLC capacity. In addition, Iat dynamically chooses the tenants that share its LLC resource with DDIO to minimize the performance interference by both the tenants and the I/O. Our experiments with multiple microbenchmarks and real-world applications demonstrate that with minimal overhead, Iat can effectively and stably reduce the performance degradation caused by DDIO.
******
Although DRAM capacity and bandwidth have increased sharply by the advances in technology and standards, its latency and energy per access have remained almost constant in recent generations. The main portion of DRAM power/energy is dissipated by Read, Write, and Refresh operations, all initiated by a Precharge phase. Precharge phase not only imposes a large amount of energy consumption, but also increases the delay of closing a row in a memory block to open another one. By reduction of row-hit rate in recent workloads, especially in multi-core systems, precharge rate increases which exacerbates DRAM power dissipation and access latency. This work proposes a novel DRAM structure, called Precharge-Free DRAM (PF-DRAM), that eliminates the Precharge phase of DRAM. PF-DRAM uses the charge on bitlines from the previous Activation phase, as the starting point for the next Activation. The difference between PF-DRAM and conventional DRAM structure is limited to precharge and equalizer circuitry and simple modifications in sense amplifier, which are all limited to subarray level. PF-DRAM is compatible with the mainstream JEDEC memory standards like DDRx and HBM, with minimum modifications in memory controller. Furthermore, almost all of the previously proposed power/energy reduction techniques in DRAM are still applicable to PF-DRAM for further improvement. Our experimental results on a 8GB memory system running SPEC CPU2017 and PARSEC2.1 workloads show an average of 35.3% memory power consumption reduction (up to 54.2%) achieved by the system using PF-DRAM with respect to the system using conventional DRAM. Moreover, the overall performance is improved by 8.6%, in average (up to 24.3%). According to our analysis, all such improvements are achieved for less than 9% area overhead.
******
Despite continuing research into inter-GPU communication mechanisms, extracting performance from multi-GPU systems remains a significant challenge. Inter-GPU communication via bulk DMA-based transfers exposes data transfer latency on the GPU’s critical execution path because these large transfers are logically interleaved between compute kernels. Conversely, fine-grained peer-to-peer memory accesses during kernel execution lead to memory stalls that can exceed the GPUs’ ability to cover these operations via multi-threading. Worse yet, these sub-cacheline transfers are highly inefficient on current inter-GPU interconnects. To remedy these issues, we propose PROACT, a system enabling remote memory transfers with the programmability and pipeline advantages of peer-to-peer stores, while achieving interconnect efficiency that rivals bulk DMA transfers. Combining compile-time instrumentation with fine-grain tracking of data block readiness within each GPU, PROACT enables interconnect-friendly data transfers while hiding the transfer latency via pipelining during kernel execution. This work describes both hardware and software implementations of PROACT and demonstrates the effectiveness of a PROACT software prototype on three generations of GPU hardware and interconnects. Achieving near-ideal interconnect efficiency, PROACT realizes a mean speedup of 3.0× over single-GPU performance for 4-GPU systems, capturing 83% of available performance opportunity. On a 16-GPU NVIDIA DGX-2 system, we demonstrate an 11.0× average strong-scaling speedup over single-GPU performance, 5.3× better than a bulk DMA-based approach.
******
The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.
******
Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT’s "Keep it close" policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing light-weight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3× increase in convolution performance/Watt with a 2× to 3.94× scaling in raw performance. Similarly, REDUCT achieves a 1.8× increase in inner-product performance/Watt with 2.8× scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.
******
Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MultiTree all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MultiTree achieves 2.3× and 1.56× communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
******
The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microar-chitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchi-tectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79× performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.
******
Unlimited vector extension (UVE) is a novel instruction set architecture extension that takes streaming and SIMD processing together into the modern computing scenario. It aims to overcome the shortcomings of state-of-the-art scalable vector extensions by adding data streaming as a way to simultaneously reduce the overheads associated with loop control and memory access indexing, as well as with memory access latency. This is achieved through a new set of instructions that pre-configure the loop memory access patterns. These attain accurate and timely data prefetching on predictable access patterns, such as in multidimensional arrays or in indirect memory access patterns. Each of the configured data streams is associated to a general- purpose vector register, which is then used to interface with the streams. In particular, iterating over a given stream is simply achieved by reading/writing to the corresponding input/output stream, as the data is instantly consumed/produced. To evaluate the proposed UVE, a proof-of-concept gem5 implementation was integrated in an out-of-order processor model, based on the ARM Cortex-A76, thus taking into consideration the typical speculative and out-of-order execution paradigms found in high- performance computing processors. The evaluation was carried out with a set of representative kernels, by assessing the number of executed instructions, its impact on the memory bus and its overall performance. Compared to other state-of-the-art solutions, such as the upcoming ARM Scalable Vector Extension (SVE), the obtained results show that the proposed extension attains average performance speedups over 2.4 × for the same processor configuration, including vector length.
******
While industry continues to develop SIMD vector ISAs by providing new instructions and wider data-paths, modern SIMD architectures still rely on the programmer or compiler to transform code to vector form only when it is safe. Limitations in the power of a compiler's memory alias analysis and the presence of infrequent memory data dependences mean that whole regions of code cannot be safely vectorised without risking changing the semantics of the application, restricting the available performance.We present a new SIMD architecture to address this issue, which relies on speculation to identify and catch memory- dependence violations that occur during vector execution. Once identified, only those SIMD lanes that have used erroneous data are replayed; other lanes, both older and younger, keep the results of their latest execution. We use the compiler to mark loops with possible cross-iteration dependences and safely vectorise them by executing on our architecture, termed selective-replay vectorisation (SRV). Evaluating on a range of general-purpose and HPC benchmarks gives an average loop speedup of 2.9 ×, and up to 5.3 × in the best case, over already-vectorised code. This leads to a whole-program speedup of up to 1.19 × (average 1.06 ×) over already-vectorised applications.
******
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM- based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per- DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design ABC-DIMM, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33 × geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59 × and 2.93 × on average.
******
The rapid influx of biosequence data, coupled with the stagnation of the processing power of modern computing systems, highlights the critical need for exploring high-performance accelerators that can meet the ever-increasing throughput demands of modern bioinformatics applications. This work argues that processing in memory (PIM) is an effective solution to enhance the performance of k-mer matching, a critical bottleneck stage in standard bioinformatics pipelines, that is characterized by random access patterns and low computational intensity.This work proposes three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), dubbed Sieve, that leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs, lightweight matching circuitry for fast pattern matching, and an early termination mechanism that prunes unnecessary DRAM row activation to reduce latency and save energy. Evaluation of Sieve using state-of-the-art workloads with real-world datasets shows that the most aggressive design provides an average of 326x/32x speedup and 74X/48x energy savings over multi-core-CPU/GPU baselines for k-mer matching.
******
Recent work demonstrated the promise of using resistive random access memory (ReRAM) as an emerging technology to perform inherently parallel analog domain in-situ matrix-vector multiplication—the intensive and key computation in deep neural networks (DNNs). One key problem is the weights that are signed values. However, in a ReRAM crossbar, weights are stored as conductance of the crossbar cells, and the in-situ computation assumes all cells on each crossbar column are of the same sign. The current architectures either use two ReRAM crossbars for positive and negative weights (PRIME), or add an offset to weights so that all values become positive (ISAAC). Neither solution is ideal: they either double the cost of crossbars, or incur extra offset circuity. To better address this problem, we propose FORMS, a fine-grained ReRAM-based DNN accelerator with algorithm/hardware co-design. Instead of trying to represent the positive/negative weights, our key design principle is to enforce exactly what is assumed in the in-situ computation— ensuring that all weights in the same column of a crossbar have the same sign. It naturally avoids the cost of an additional crossbar. Such polarized weights can be nicely generated using alternating direction method of multipliers (ADMM) regularized optimization during the DNN training, which can exactly enforce certain patterns in DNN weights. To achieve high accuracy, we divide the crossbar into logical sub-arrays and only enforce this property within the fine-grained sub-array columns. Crucially, the small sub-arrays provides a unique opportunity for input zero-skipping, which can significantly avoid unnecessary computations and reduce computation time. At the same time, it also makes the hardware much easier to implement and is less susceptible to non-idealities and noise than coarse-grained architectures. Putting all together, with the same optimized DNN models, FORMS achieves 1.50× and 1.93× throughput improvement in terms of $\frac{{GOPs}}{{s \times m{m^2}}}$ and $\frac{{GOPs}}{W}$ compared to ISAAC, and 1.12× ~2.4 × speed up in terms of frame per second over optimized ISAAC with almost the same power/area cost. Interestingly, FORMS optimization framework can even speed up the original ISAAC from 10.7 × up to 377.9×, reflecting the importance of software/hardware co-design optimizations.
******
Search is one of the most popular and important web services. The inverted index is the standard data structure adopted by most full-text search engines. Recently, custom hardware accelerators for inverted index search have emerged to demonstrate much higher throughput than the conventional CPU or GPU. However, less attention has been paid to addressing the memory capacity pressure with inverted index. The conventional DDRx DRAM memory system significantly increases the system cost to make a terabyte-scale main memory. Instead, a shared memory pool composed of storage-class memory (SCM) devices is a promising alternative for scaling memory capacity at a much lower cost. However, this SCM-based pooled memory poses new challenges caused by the limited bandwidth of both SCM devices and the shared interconnect to the host CPU. Thus, we propose BOSS, the first near-data processing (NDP) architecture for inverted index search on SCM-based pooled memory, which maintains high throughput of query processing in this bandwidth- constrained environment. BOSS mitigates the impact of low bandwidth of SCM devices by employing early-termination search algorithms, reducing the footprint of intermediate data, and introducing a programmable decompression module that can select the best compression scheme for a given inverted index. Furthermore, BOSS includes a top-k selection module in hardware to substantially reduce the host-accelerator bandwidth consumption. Compared to Apache Lucene, a production-grade search engine library, running on 8 CPU cores, BOSS achieves a geomean speedup of 8.1× on various complex query types, while reducing the average energy consumption by 189×.
******
Multi-core architectures have enabled data centers to increasingly co-locate multiple jobs to improve resource utilization and lower the operational cost. Unfortunately, naively co-locating multiple jobs may lead to only a modest increase in system throughput. Worse, some users may observe proportionally higher performance degradation compared to other users co-located on the same physical multi-core system. SATORI is a novel strategy to partition multi-core architectural resources to achieve two conflicting goals simultaneously: increasing system throughput and achieving fairness among the co-located jobs.
******
Serverless computing has become a fact of life on modern clouds. A serverless function may process sensitive data from clients. Protecting such a function against untrusted clouds using hardware enclave is attractive for user privacy. In this work, we run existing serverless applications in SGX enclave, and observe that the performance degradation can be as high as 5.6× to even 422.6×. Our investigation identifies these slowdowns are related to architectural features, mainly from page-wise enclave initialization. Leveraging insights from our overhead analysis, we revisit SGX hardware design and make minimal modification to its enclave model. We extend SGX with a new primitive—region-wise plugin enclaves that can be mapped into existing enclaves to reuse attested common states amongst functions. By remapping plugin enclaves, an enclave allows in-situ processing to avoid expensive data movement in a function chain. Experiments show that our design reduces the enclave function latency by 94.74-99.57%, and boosts the autoscaling throughput by 19-179×.
******
Cloud providers, like Amazon and Microsoft, must guarantee high availability for a large fraction of their workloads. For this reason, they build datacenters with redundant infrastructures for power delivery and cooling. Typically, the redundant resources are reserved for use only during infrastructure failure or maintenance events, so that workload performance and availability do not suffer. Unfortunately, the reserved resources also produce lower power utilization and, consequently, require more datacenters to be built. To address these problems, in this paper we propose "zero-reserved-power" datacenters and the Flex system to ensure that workloads still receive their desired performance and availability. Flex leverages the existence of software-redundant workloads that can tolerate lower infrastructure availability, while imposing minimal (if any) performance degradation for those that require high infrastructure availability. Flex mainly comprises (1) a new offline workload placement policy that reduces stranded power while ensuring safety during failure or maintenance events, and (2) a distributed system that monitors for failures and quickly reduces the power draw while respecting the workloads’ requirements, when it detects a failure. Our evaluation shows that Flex produces less than 5% stranded power and increases the number of deployed servers by up to 33%, which translates to hundreds of millions of dollars in construction cost savings per datacenter site. We end the paper with lessons from our experience bringing Flex to production in Microsoft’s datacenters.
******
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand for GPU computational power. Emerging workloads contain hundreds or thousands of GPU kernel launches, which incur high overheads, and exhibit data-dependent behavior between kernels, which requires synchronization, leading to GPU under-utilization. Task-based execution models have been proposed to solve these issues, but they require significant programmer effort to port applications to proprietary task-based programming models in order to specify tasks and task dependencies. To address this need, we propose BlockMaestro, a software-hardware solution that combines command queue reordering, kernel-launch-time static analysis, and runtime hardware support to dynamically identify and resolve thread-block level data dependencies between kernels. Through static analysis of memory access patterns at kernel-launch-time, BlockMaestro can extract inter-kernel thread block-level data dependencies. BlockMaestro also introduces kernel pre-launching to reduce the kernel launch overheads experienced by multiple dependent kernels. Correctness is enforced by dynamically resolving thread block-level data dependency at runtime through hardware support. BlockMaestro achieves an average speedup of 51.76% (up to 2.92x) on data-dependent benchmarks, and requires minimal hardware overhead.
******
Microarchitectural attacks have plunged Computer Architecture into a security crisis. Yet, as the slowing of Moore’s law justifies the use of ever more exotic microarchitecture, it is likely we have only seen the tip of the iceberg.To better anticipate this security crisis, this paper performs a systematic security-centric analysis of the Computer Architecture literature. Our rationale is that when implementing current and future processors, microarchitects will (quite reasonably) look to previously-proposed ideas. Our study uncovers seven classes of microarchitectural optimization with novel security implications, proposes a conceptual framework through which to study them and demonstrates several proofs-of-concept to show their efficacy. The optimizations we study range from those that leak as much privacy as Spectre/Meltdown (but without exploiting speculative execution) to those that otherwise undermine security-critical programs in a variety of ways. Many have storied histories— ranging from industry patents to media/3rd party speculation regarding current implementation status to recent renewed interest in the academic community. This paper’s goal is to perform an early (hopefully not too late) analysis to inform their development moving forward.
******
Modern Intel, AMD, and ARM processors translate complex instructions into simpler internal micro-ops that are then cached in a dedicated on-chip structure called the micro-op cache. This work presents an in-depth characterization study of the micro-op cache, reverse-engineering many undocumented features, and further describes attacks that exploit the micro-op cache as a timing channel to transmit secret information. In particular, this paper describes three attacks – (1) a same thread cross-domain attack that leaks secrets across the user-kernel boundary, (2) a cross-SMT thread attack that transmits secrets across two SMT threads via the micro-op cache, and (3) transient execution attacks that have the ability to leak an unauthorized secret accessed along a misspeculated path, even before the transient instruction is dispatched to execution, breaking several existing invisible speculation and fencing-based solutions that mitigate Spectre.
******
Timing side channels have been used to extract cryptographic keys and sensitive documents even from trusted enclaves. Specifically, cache side channels created by reuse of shared code or data in the memory hierarchy have been exploited by several known attacks, e.g., evict+reload for recovering an RSA key and Spectre variants for leaking speculatively loaded data.In this paper, we present TimeCache, a cache design that incorporates knowledge of prior cache line access to eliminate cache side channels due to reuse of shared software (code and data). Our goal is to retain the benefits of a shared cache of allowing each process access to the entire cache and of cache occupancy by a single copy of shared software. We achieve our goal by implementing per-process cache line visibility so that the processes do not benefit from cached data brought in by another process until they have incurred a corresponding miss penalty. Our design achieves low overhead by using a novel combination of timestamps and a hardware design to allow efficient parallel comparisons of the timestamps. The solution works at all the cache levels without the need to limit the number of security domains, and defends against an attacker process running on the same core, on a another hyperthread, or on another core.Our implementation in the gem5 simulator demonstrates that the system is able to defend against RSA key extraction. We evaluate performance using SPEC2006 and PARSEC and observe the overhead of TimeCache to be 1.13% on average. Delay due to first access misses adds the majority of the overhead, with the security context bookkeeping incurred at the time of a context switch contributing 0.02% of the 1.13%.
******
Read alignment is a time-consuming step in genome sequencing analysis. The most widely used software for read alignment, BWA-MEM, and the recently published faster version BWA-MEM2 are based on the seed-and-extend paradigm for read alignment. The seeding step of read alignment is a major bottleneck contributing ~40% to the overall execution time of BWA-MEM2 when aligning whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high bandwidth requirements, primarily due to its character-by-character processing of reads. For instance, to seed each read (101 DNA base-pairs stored in 37.8 bytes), the FMD-Index solution in BWA-MEM2 requires ~68.5 KB of index data.We propose a novel indexing data structure named Enumerated Radix Tree (ERT) and design a custom seeding accelerator based on it. ERT improves bandwidth efficiency of BWA-MEM2 by 4.5× while guaranteeing 100% identical output to the original software, and still fitting in 64 GB DRAM. Overall, the proposed seeding accelerator implemented on AWS F1 FPGA (f1.4xlarge) improves seeding throughput of BWA-MEM2 by 3.3×. When combined with seed-extension accelerators, we observe a 2.1× improvement in overall read alignment throughput over BWA-MEM2. The software implementation of ERT is integrated into BWA-MEM2 (ert branch: https://github.com/bwa-mem2/bwa-mem2/tree/ert) and is open sourced for the benefit of the research community.
******
Data analytics pipelines increasingly rely on databases to select, filter, and pre-process reams of data. These databases use data structures with irregular control flow like trees and hash tables which map poorly to existing database accelerators, leaving architects with a choice between CPUS— with stagnant performance—or accelerators that handle this complexity by relying on simpler but asymptotically sub-optimal algorithms.To bridge this gap, we propose Aurochs: a reconfigurable dataflow accelerator (RDA) that matches a CPU asymptotically but outperforms it by over 100 × on constant factors. We introduce a threading model for vector dataflow accelerators that extracts massive parallelism from irregular data structures using lightweight thread contexts. To implement this model, we add only a sparse scratchpad to an existing database accelerator— increasing area by 5 %. We reformulate common data structures using dataflow threads and evaluate Aurochs on ridesharing queries—outperforming a GPU by 8 ×.
******
Zero-knowledge proof (ZKP) is a promising cryptographic protocol for both computation integrity and privacy. It can be used in many privacy-preserving applications including verifiable cloud outsourcing and blockchains. The major obstacle of using ZKP in practice is its time-consuming step for proof generation, which consists of large-size polynomial computations and multi-scalar multiplications on elliptic curves. To efficiently and practically support ZKP in real-world applications, we propose PipeZK, a pipelined accelerator with two subsystems to handle the aforementioned two intensive compute tasks, respectively. The first subsystem uses a novel dataflow to decompose large kernels into smaller ones that execute on bandwidth-efficient hardware modules, with optimized off-chip memory accesses and on-chip compute resources. The second subsystem adopts a lightweight dynamic work dispatch mechanism to share the heavy processing units, with minimized resource underutilization and load imbalance. When evaluated in 28 nm, PipeZK can achieve 10x speedup on standard cryptographic benchmarks, and 5x on a widely-used cryptocurrency application, Zcash.
******
We live in a new Cambrian Explosion of hardware devices. The end of conventional processor scaling has driven research and industry practice to explore a new generation of approaches. The old DNA of architecture design, including vectors, threads, shared or private memories, coherence or message passing, dataflow or von Neumann execution, are hybridized together in new and exciting ways. Each new architecture exposes a unique hardware-level API. Performance and energy efficiency are critically dependent on how well programs can use these APIs. One approach is to implement custom libraries for each new hardware architecture and application domain. A more scalable approach is to utilize a portable compiler infrastructure tailored to the application domain that makes it easy to generate efficient code for a diverse set of architectures with minimal porting effort.We propose the Unified GraphIt Compiler framework (UGC), which does exactly this for graph applications. UGC achieves portability with reasonable effort by decoupling the architecture-independent algorithm from the architecture-specific schedules and backends. We introduce a new domain-specific intermediate representation, GraphIR, that is key to this decoupling. GraphIR encodes high-level algorithm and optimization information needed for hardware-specific code generation, making it easy to develop different backends (GraphVMs) for diverse architectures, including CPUs, GPUs, and next-generation hardware such as Swarm and the HammerBlade manycore. We also build scheduling language extensions that make it easy to expose optimization decisions like load balancing strategies, blocking for locality, and other data structure choices. We evaluate UGC on five algorithms and 10 input graphs on these 4 distinct architectures and show that UGC enables implementing optimizations that can provide up to 53× speedup over programmer-generated straightforward implementations.
******
As mainstream computing is poised to embrace the advent of byte-addressable non-volatile memory (NVM), an important roadblock has remained largely unnoticed, support of legacy libraries on NVM. Libraries underpin modern software everywhere. As current NVM programming interfaces all designate special types and constructs for NVM objects and references, legacy libraries, being incompatible with these data types, will face major obstacles for working with future applications written for NVM. This paper introduces a simple approach to mitigating the issue. The novel approach centers around user-transparent persistent reference, a new concept that allows programmers to reference a persistent object in the same way as reference a normal (volatile) object. The paper presents the implementation of the concept, carefully examines its soundness, and describes compiler and simple architecture support for keeping performance overheads very low.
******
Fence instructions are a coarse-grained mechanism to enforce the order of instruction execution in an out-of-order pipeline. They are an overkill for cases when only one instruction must wait for the completion of one other instruction. For example, this is the case when performing undo logging in Non-Volatile Memory (NVM) systems: while the update of a variable needs to wait until the corresponding undo log entry is persisted, all other instructions can be reordered. Unfortunately, current ISAs do not provide a way to describe such an execution dependence between two instructions that have no register or memory dependences. As a result, programmers must place fences, which unnecessarily serialize many unrelated instructions.To remedy this limitation, we propose an ISA extension capable of describing these execution dependences. We call the proposal Execution Dependence Extension (EDE), and add it to Arm’s AArch64 ISA. We also present two hardware realizations of EDE that enforce execution dependences at different stages of the pipeline: one in the issue queue (IQ) and another in the write buffer (WB). We implement IQ and WB in a simulator and test them with several NVM applications. Overall, by using EDE with IQ and WB rather than fences, we attain average workload speedups of 18% and 26%, respectively.
******
With field-programmable gate arrays (FPGAs) being widely deployed into data centers, an efficient virtualization support is required to fully unleash the potential of cloud FPGAs. Nevertheless, existing FPGA virtualization solutions only support a homogeneous FPGA cluster comprising identical FPGA devices. Representative work such as ViTAL provides sufficient system support for scale-out acceleration and improves the overall resource utilization through a fine-grained spatial sharing. While these existing solutions (including ViTAL) can efficiently virtualize a homogeneous cluster, it is hard to extend them to virtualizing a heterogeneous cluster which comprises multiple types of FPGAs. We expect the future cloud FPGAs are likely to be more heterogeneous due to hardware rolling upgrade.In this paper, we rethink FPGA virtualization from ground up and propose Hetero-ViTAL to virtualize heterogeneous FPGA clusters. We identify the conflicting requirements of runtime management and offline compilation when designing the abstraction for a heterogeneous cluster, which is also the fundamental reason why the single-level abstraction as proposed in ViTAL (and other prior works) cannot be trivially extended to the heterogeneous case. To decouple these conflicting requirements, we provide a two-level system abstraction in Hetero-ViTAL. Specifically, the high-level abstraction is FPGA-agnostic and provides a simple and homogeneous view of the FPGA resources to simplify the runtime management. On the contrary, the low-level abstraction is FPGA-specific and exposes sufficient spatial resource constraints to the compilation framework to ensure the mapping quality. Rather than simply adding a layer on top of the single-level abstraction as proposed in ViTAL and other prior work, we judiciously determine how much hardware details should be exposed at each level to balance the management complexity, mapping quality and compilation cost. We then develop a compilation framework to map applications onto this two-level abstraction with several optimization techniques to further improve the mapping quality. We also provide a runtime management policy to alleviate the fragmentation issue, which becomes more severe in a heterogeneous cluster due to the distinct resource capacities of diverse FPGAs.We evaluate Hetero-ViTAL on a custom-built FPGA cluster and demonstrate its effectiveness using machine learning and image processing applications. Results show that Hetero-ViTAL reduces the average response time (a critical metric for QoS) by 79.2% for a heterogeneous cluster compared to the non-virtualized baseline. When virtualizing a homogeneous cluster, Hetero-ViTAL also reduces the average response time by 42.0% compared with ViTAL due to a better system design.
******
DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed for each DRAM command. Unfortunately, the use of fixed internal timings limits the types of operations that DRAM can perform and hinders the implementation of new functionalities and custom mechanisms that improve DRAM reliability, performance and energy. To overcome these limitations, we propose enabling programmable DRAM internal timings for controlling in-DRAM components.To this end, we design CODIC, a new low-cost DRAM substrate that enables fine-grained control over four previously fixed internal DRAM timings that are key to many DRAM operations. We implement CODIC with only minimal changes to the DRAM chip and the DDRx interface. To demonstrate the potential of CODIC, we propose two new CODIC-based security mechanisms that outperform state-of-the-art mechanisms in several ways: (1) a new DRAM Physical Unclonable Function (PUF) that is more robust and has significantly higher throughput than state-of-the-art DRAM PUFs, and (2) the first cold boot attack prevention mechanism that does not introduce any performance or energy overheads at runtime.
******
The ability to capture frequent (per millisecond) persistent snapshots to NVM would enable a number of compelling use cases. Unfortunately, existing NVM snapshotting techniques suffer from a combination of persistence barrier stalls, write amplification to NVM, and/or lack of scalability beyond a single socket. In this paper, we present NVOverlay, which is a scalable and efficient technique for capturing frequent persistent snapshots to NVM such that they can be randomly accessed later. NVOverlay uses Coherent Snapshot Tracking to efficiently track changes to memory (since the previous snapshot) across multi-socket parallel systems, and it uses Multi-snapshot NVM Mapping to store these snapshots to NVM while avoiding excessive write amplification. Our experiments demonstrate that NVOverlay successfully hides the overhead of capturing these snapshots while reducing write amplification by 29%–47% compared with state-of-the-art logging-based snapshotting techniques.
******
Computer systems designers are building cache hierarchies with higher capacity to capture the ever-increasing working sets of modern workloads. Cache hierarchies with higher capacity improve system performance but shift the performance bottleneck to address translation. We propose Midgard, an intermediate address space between the virtual and the physical address spaces, to mitigate address translation overheads without program-level changes.Midgard leverages the operating system concept of virtual memory areas (VMAs) to realize a single Midgard address space where VMAs of all processes can be uniquely mapped. The Midgard address space serves as the namespace for all data in a coherence domain and the cache hierarchy. Because real-world workloads use far fewer VMAs than pages to represent their virtual address space, virtual to Midgard translation is achieved with hardware structures that are much smaller than TLB hierarchies. Costlier Midgard to physical address translations are needed only on LLC misses, which become much less frequent with larger caches. As a consequence, Midgard shows that instead of amplifying address translation overheads, memory hierarchies with large caches can reduce address translation overheads.Our evaluation shows that Midgard achieves only 5% higher address translation overhead as compared to traditional TLB hierarchies for 4KB pages when using a 16MB aggregate LLC. Midgard also breaks even with traditional TLB hierarchies for 2MB pages when using a 256MB aggregate LLC. For cache hierarchies with higher capacity, Midgard’s address translation overhead drops to near zero as secondary and tertiary data working sets fit in the LLC, while traditional TLBs suffer even higher degrees of address translation overhead.
******
As technologies continue to shrink, memory system failure rates have increased, demanding support for stronger forms of reliability. In this work, we take inspiration from the two-tier approach that decouples correction from detection and explore a novel extrapolation. We propose Dvé, a hardware-driven replication mechanism where data blocks are replicated in 2 different sockets across a cache-coherent NUMA system. Each data block is also accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Such an organization has the advantage of offering two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory, upto and including the memory controller, and (b) higher performance by providing another nearer point of memory access. Dvé realizes both of these benefits via Coherent Replication, a technique that builds on top of existing cache coherence protocols for not only keeping the replicas in sync for reliability, but also to provide coherent access to the replicas during fault-free operation for performance. Dvé can flexibly provide these benefits on-demand by simply using the provisioned memory capacity which, as reported in recent studies, is often underutilized in today’s systems. Thus, Dvé introduces a unique design point that offers higher reliability and performance for workloads that do not require the entire memory capacity.
******
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator’s compute and memory for both DL computations and communication.This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint’s compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5× on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44× (up to 2.67×), resulting in an average of 1.41× (up to 1.51×), 1.12× (up to 1.17×), and 1.13× (up to 1.19×) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.
******
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime parameters to the programmers who need to explicitly manage how computation is scheduled both spatially and temporally. In fact, different scheduling choices can lead to wide variations in performance and efficiency, motivating the need for a fast and efficient search strategy to navigate the vast scheduling space.To address this challenge, we present CoSA, a constrained-optimization-based approach for scheduling DNN accelerators. As opposed to existing approaches that either rely on designers’ heuristics or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem that can be deterministically solved using mathematical optimization techniques. Specifically, CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a mixed-integer programming (MIP) problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot. We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5× across a wide range of DNN networks while improving the time-to-solution by 90×.
******
Recently, the recurrent neural network, or its most popular type—the Long Short Term Memory (LSTM) network— has achieved great success in a broad spectrum of real-world application domains, such as autonomous driving, natural language processing, sentiment analysis, and epidemiology. Due to the complex features of the real-world tasks, current LSTM models become increasingly bigger and more complicated for enhancing the learning ability and prediction accuracy. However, through our in-depth characterization on the state-of-the-art general-purpose deep-learning accelerators, we observe that the LSTM training execution grows inefficient in terms of storage, performance, and energy consumption, under an increasing model size. With further algorithmic and architectural analysis, we identify the root cause for large LSTM training inefficiency: massive intermediate variables. To enable a highly-efficient LSTM training solution for the ever-growing model size, we exploit some unique memory-saving and performance improvement opportunities from the LSTM training procedure, and leverage them to propose the first cross-stack training solution, η-LSTM, for large LSTM models. η-LSTM comprises both software-level and hardware-level innovations that effectively lower the memory footprint upper-bound and excessive data movements during large LSTM training, while also drastically improving training performance and energy efficiency. Experimental results on six real-world large LSTM training benchmarks demonstrate that η-LSTM reduces the required memory footprint by an average of 57.5% (up to 75.8%) and brings down the data movements for weight matrices, activation data, and intermediate variables by 40.9%, 32.9%, and 80.0%, respectively. Furthermore, it outperforms the state-of-the-art GPU implementation for LSTM training by an average of 3.99× (up to 5.73×) on performance and 2.75× (up to 4.25) on energy. We hope this work can shed some light on how to design high logic utilization for future NPUs.
******
Graph pattern mining (GPM) is a class of algorithms widely used in many real-world applications in bio-medicine, e-commerce, security, social sciences, etc. GPM is a computationally intensive problem with an enormous amount of coarse-grain parallelism and therefore, attractive for hardware acceleration. Unfortunately, existing GPM accelerators have not used the best known algorithms and optimizations, and thus offer questionable benefits over software implementations.We present FlexMiner, a software/hardware co-designed GPM accelerator that improves the efficiency without compromising the generality or productivity of state-of-the-art software GPM frameworks. FlexMiner exploits massive amount of coarse-grain parallelism in GPM by deploying a large number of specialized processing elements. For efficient searches, the FlexMiner hardware accepts pattern-specific execution plans, which are generated automatically by the FlexMiner compiler from the given pattern(s). To avoid repetitive computation on neighborhood connectivity, we provide dedicated on-chip storage to memoize reusable connectivity information in a connectivity map (c-map ) which is implemented with low-cost yet high-throughput hardware. The on-chip memories in FlexMiner are managed dynamically using heuristics derived by the compiler, and thus are fully utilized. We have evaluated FlexMiner with 4 GPM applications on a wide range of real-world graphs. Our cycle-accurate simulation shows that FlexMiner with 64 PEs achieves 10.6× speedup on average over the state-of-the-art software system executing 20 threads on a 10-core Intel CPU.
******
Because of the importance of graph workloads and the limitations of CPUs/GPUs, many graph processing accelerators have been proposed. The basic approach of prior accelerators is to focus on a single graph algorithm variant (eg. bulk-synchronous + slicing). While helpful for specialization, this leaves performance potential from flexibility on the table and also complicates understanding the relationship between graph types, workloads, algorithms, and specialization.In this work, we explore the value of flexibility in graph processing accelerators. First, we identify a taxonomy of key algorithm variants. Then we develop a template architecture (PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each.Overall we find that flexibility in graph acceleration is critical. If only one variant can be supported, asynchronous-updates/priority-vertex-scheduling/graph-slicing is the best design, achieving 1.93× speedup over the best-performing accelerator, GraphPulse. However, static flexibility per-workload can further improve performance by 2.71×. With dynamic flexibility per-phase, performance further improves by up to 50%.
******
Efficient large-scale graph processing is crucial to many disciplines. Yet, while graph algorithms naturally expose massive parallelism opportunities, their performance is limited by the memory system because of irregular memory accesses. State-of-the-art FPGA graph processors, such as ForeGraph and FabGraph, address the memory issues by using scratchpads and regularly streaming edges from DRAM, but then they end up wasting bandwidth on unneeded data. Yet, where classic caches and scratchpads fail to deliver, FPGAs make powerful unorthodox solutions possible. In this paper, we resort to extreme nonblocking caches that handle tens of thousands of outstanding read misses. They significantly increase the ability of memory systems to coalesce multiple accelerator accesses into fewer DRAM memory requests; essentially, when latency is not the primary concern, they bring the advantages expected from a very large cache at a fraction of the cost. We prove our point with an adaptable graph accelerator running on Amazon AWS f1; our implementation takes into account all practical aspects of such a design, including the challenges involved when working with modern multidie FPGAs. Running classic algorithms (PageRank, SCC, and SSSP) on large graphs, we achieve 3× geometric mean speedup compared to state-of-the-art FPGA accelerators, 1.1-5.8 × higher bandwidth efficiency and 3.0-15.3 × better power efficiency than multicore CPUs, and we support much larger graphs than the state-of-the-art on GPUs.
******
Cloud providers typically use air-based solutions for cooling servers in datacenters. However, increasing transistor counts and the end of Dennard scaling will result in chips with thermal design power that exceeds the capabilities of air cooling in the near future. Consequently, providers have started to explore liquid cooling solutions (e.g., cold plates, immersion cooling) for the most power-hungry workloads. By keeping the servers cooler, these new solutions enable providers to operate server components beyond the normal frequency range (i.e., overclocking them) all the time. Still, providers must tradeoff the increase in performance via overclocking with its higher power draw and any component reliability implications.In this paper, we argue that two-phase immersion cooling (2PIC) is the most promising technology, and build three prototype 2PIC tanks. Given the benefits of 2PIC, we characterize the impact of overclocking on performance, power, and reliability. Moreover, we propose several new scenarios for taking advantage of overclocking in cloud platforms, including oversubscribing servers and virtual machine (VM) auto-scaling. For the auto-scaling scenario, we build a system that leverages overclocking for either hiding the latency of VM creation or postponing the VM creations in the hopes of not needing them. Using realistic cloud workloads running on a tank prototype, we show that overclocking can improve performance by 20%, increase VM packing density by 20%, and improve tail latency in auto-scaling scenarios by 54%. The combination of 2PIC and overclocking can reduce platform cost by up to 13% compared to air cooling.
******
Cryogenic computing, which runs a computer device at an extremely low temperature, is highly promising thanks to the significant reduction of the wire latency and leakage current. A recently proposed cryogenic DRAM design achieved the promising performance improvement, but it also reveals that it must reduce the DRAM’s dynamic power to overcome the huge cooling cost at 77 K. Therefore, researchers now target to reduce the cryogenic DRAM’s refresh power by utilizing its significantly increased retention time driven by the reduced leakage current. To achieve the goal, however, architects should first answer many fundamental questions regarding the reliability and then design a refresh-free, but still robust cryogenic DRAM by utilizing the analysis result.In this work, we propose a near refresh-free, but robust cryogenic DRAM (NRFC-DRAM), which can almost eliminate its refresh overhead while ensuring reliable operations at 77 K. For the purpose, we first evaluate various DRAM samples of multiple vendors by conducting a thorough analysis to accurately estimate the cryogenic DRAM’s retention time and reliability. Our analysis identifies a new critical challenge such that reducing DRAM’s refresh rate can make the memory highly unreliable because normal memory operations can now appear as row-hammer attacks at 77 K. Therefore, NRFC-DRAM requires a cost-effective, cryogenic-friendly protection mechanism against the new row-hammer-like "faults" at 77 K.To resolve the challenge, we present CryoGuard, our cryogenic-friendly row-hammer protection method to ensure the NRFC-DRAM’s reliable operations at 77 K. With CryoGuard applied, NRFC-DRAM reduces the overall power consumption by 25.9 % even with its cooling cost included, whereas the existing cryogenic DRAM fails to reduce the power consumption.
******
Although superconducting single flux quantum (SFQ) technologies offer the potential for low-latency operation with energy dissipation of the order of attojoules per gate, their inherently pulse-driven nature and stateful cells have led to designs in which every logic gate is clocked. This means that clocked buffers must be added to equalize logic path lengths, and every gate becomes a pipeline stage. We propose a different approach, where gates are clock-free and synchronous designs have a conventional look-and-feel. Despite being clock-free, however, the gates are state machines by nature. To properly manage these state machines, the logical clock cycle is composed of two synchronous alternating phases: the first of which implements the desired function, and the second of which returns the state machines to the ground state. Moreover, to address the challenges associated with the asynchronous implementation of Boolean NOT operations in pulse-based systems, values are represented as unordered binary codes – in particular, dual-rail codes. With unordered codes, AND and OR operations are functionally complete.We demonstrate that our new approach, xSFQ, with its dual-rail construction and alternating clock phases, along with "double-pumped" logical latches and a timing optimization through latch decomposition, is capable of implementing arbitrary digital designs without gate-level pipelining and the overheads that come with it. We evaluate energy-delay trade-offs enabled by this approach through a mix of detailed analog circuit modeling, pulse-level discrete-event simulation, and high-level pipeline efficiency analysis. The resulting systems are shown to deliver energy-delay product (EDP) gains over conventional SFQ even with pipeline hazard ratios (HR) below 1%. For hazard ratios equal to 15% and 20% and a design resembling a RISC-V RV32I core (excluding the cost of interlock logic), xSFQ achieves 22x and 31x EDP savings, respectively.
******
Energy harvesting systems support the deployment of low-power microcontrollers untethered by constant power sources or batteries, enabling long-lived deployments in a variety of applications previously limited by power or size constraints. However, the limitations of harvested energy mean that even the lowest-power microcontrollers operate intermittently—waiting for the harvester to slowly charge a buffer capacitor and rapidly discharging the capacitor to support a brief burst of computation. The challenges of the intermittent operation brought on by harvested energy drive a variety of hardware and software techniques that first enabled long-running computation, then focused on improving performance. Many of the most promising systems demand dynamic updates of available energy to inform checkpointing and mode decisions.Unfortunately, existing energy monitoring solutions based on analog circuits (e.g., analog-to-digital converters) are ill-matched for the task because their signal processing focus sacrifices power efficiency for increased performance—performance not required by current or future intermittent computation systems. This results in existing solutions consuming as much energy as the microcontroller, stealing energy from useful computation. To create a low-power energy monitoring solution that provides just enough performance for intermittent computation use cases, we design and implement Failure Sentinels, an on-chip, fully-digital energy monitor. Failure Sentinels leverages the predictable propagation delay response of digital logic gates to supply voltage fluctuations to measure available energy. Our design space exploration shows that Failure Sentinels provides 30–50mV of resolution at sample rates up to 10kHz, while consuming less than 2µA of current. Experiments show that Failure Sentinels increases the energy available for software computation by up to 77%, compared to current solutions. We also implement a RISC-V-based FPGA prototype that validates our design space exploration and shows the overheads of incorporating Failure Sentinels into a system-on-chip.
******
Personalized recommendation systems have become a major AI application in modern data centers. The main challenges in processing personalized recommendation inferences are the large memory footprint and high bandwidth requirement of embedding layers. To overcome the capacity limit and bandwidth congestion of on-chip memory, near memory processing (NMP) can be a promising solution. Recent work on accelerating personalized recommendations proposes a DIMMbased NMP design to solve the bandwidth problem and increases memory capacity. The performance of NMP is determined by the internal bandwidth and the prior DIMM-based approach utilizes more DIMMs to achieve higher operation throughput. However, extending the number of DIMMs could eventually lead to significant power consumption due to inefficient scaling. We propose SPACE, a novel heterogeneous memory architecture, which is efficient in terms of performance and energy. SPACE exploits a compute-capable 3D-stacked DRAM with DIMMs for personalized recommendations. Prior to designing the proposed system, we give a quantitative analysis of the user/item interactions and define the two localities: gather locality and reduction locality. In gather operations, we find only a small proportion of items are highly-accessed by users, and we call this gather locality. Also, we define reduction locality as the reusability of the gathered items in reduction operations. Based on the gather locality, SPACE allocates highly-accessed embedding items to the 3D-stacked DRAM to achieve the maximum bandwidth. Subsequently, by exploiting reduction locality, we utilize the remaining space of the 3D-stacked DRAM to store and reuse repeated partial sums, thereby minimizing the required number of element-wise reduction operations. As a result, the evaluation shows that SPACE achieves 3.2× performance improvement and 56% energy saving over the previous DIMM-based NMPs leveraging 3D-stacked DRAM with a 1/8 size of DIMMs. Also, compared to the state-of-the-art DRAM cache designs with the same NMP configuration, SPACE achieves an average 32.7% of performance improvement.
******
The self-attention mechanism is rapidly emerging as one of the most important key primitives in neural networks (NNs) for its ability to identify the relations within input entities. The self-attention-oriented NN models such as Google Transformer and its variants have established the state-of-the-art on a very wide range of natural language processing tasks, and many other self-attention-oriented models are achieving competitive results in computer vision and recommender systems as well. Unfortunately, despite its great benefits, the self-attention mechanism is an expensive operation whose cost increases quadratically with the number of input entities that it processes, and thus accounts for a significant portion of the inference runtime. Thus, this paper presents ELSA (Efficient, Lightweight Self-Attention), a hardware-software co-designed solution to substantially reduce the runtime as well as energy spent on the self-attention mechanism. Specifically, based on the intuition that not all relations are equal, we devise a novel approximation scheme that significantly reduces the amount of computation by efficiently filtering out relations that are unlikely to affect the final output. With the specialized hardware for this approximate self-attention mechanism, ELSA achieves a geomean speedup of 58.1× as well as over three orders of magnitude improvements in energy efficiency compared to GPU on self-attention computation in modern NN models while maintaining less than 1% loss in the accuracy metric.
******
Deep neural network (DNN) training is notoriously time-consuming, and quantization is promising to improve the training efficiency with reduced bandwidth/storage requirements and computation costs. However, state-of-the-art quantized algorithms with negligible training accuracy loss, which require on-the-fly statistic-based quantization over a great amount of data (e.g., neurons and weights) and high-precision weight update, cannot be effectively deployed on existing DNN accelerators. To address this problem, we propose the first customized architecture for efficient quantized training with negligible accuracy loss, which is named as Cambricon-Q. Cambricon-Q features a hybrid architecture consisting of an ASIC acceleration core and a near-data-processing (NDP) engine. The acceleration core mainly targets at improving the efficiency of statistic-based quantization with specialized computing units for both statistical analysis (e.g., determining maximum) and data reformating, while the NDP engine avoids transferring the high-precision weights from the off-chip memory to the acceleration core. Experimental results show that on the evaluated benchmarks, Cambricon-Q improves the energy efficiency of DNN training by 6.41× and 1.62×, performance by 4.20× and 1.70× compared to GPU and TPU, respectively, with only ⩽ 0.4% accuracy degradation compared with full precision training.
******
Accelerating tensor applications on spatial architectures provides high performance and energy-efficiency, but requires accurate performance models for evaluating various dataflow alternatives. Such modeling relies on the notation of tensor dataflow and the formulation of performance metrics. Recent proposed compute-centric and data-centric notations describe the dataflow using imperative directives. However, these two notations are less expressive and thus lead to limited optimization opportunities and inaccurate performance models.In this paper, we propose a framework TENET that models hardware dataflow of tensor applications. We start by introducing a relation-centric notation, which formally describes the hardware dataflow for tensor computation. The relation-centric notation specifies the hardware dataflow, PE interconnection, and data assignment in a uniform manner using relations. The relation-centric notation is more expressive than the compute-centric and data-centric notations by using more sophisticated affine transformations. Another advantage of relation-centric notation is that it inherently supports accurate metrics estimation, including data reuse, bandwidth, latency, and energy. TENET computes each performance metric by counting the relations using integer set structures and operators. Overall, TENET achieves 37.4% and 51.4% latency reduction for CONV and GEMM kernels compared with the state-of-the-art data-centric notation by identifying more sophisticated hardware dataflows.
******
Modern data center applications exhibit deep software stacks, resulting in large instruction footprints that frequently cause instruction cache misses degrading performance, cost, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. Existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program con-texts that lead to I-cache misses and sparingly injects "cache line eviction" instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) over prior work due to a mean 19% (up to 28.6%) I-cache miss reduction.
******
To maintain strong reliability, memory manufacturers label server memories at much slower data rates than the highest data rates at which they can still operate correctly for most (e.g., 99.999%+ of) accesses; we refer to the gap between these two data rates as memory frequency margin. While many prior works have studied memory latency margins in a different context of consumer memories, none has publicly studied memory frequency margin (either for consumer or server memories).To close this knowledge gap in the public domain, we perform the first public study to characterize frequency margins in commodity server memory modules. Through our large-scale study, we find that under standard voltage and cooling, they can operate 27% faster, on average, without error(s) for 99.999%+ of accesses even at high temperatures.The current practice of conservatively operating server memory is far from ideal; it slows down 99.999%+ of accesses to benefit the <0.001% of accesses that would be erroneous at a faster data rate. An ideal system should only pay this reliability tax for the <0.001% of accesses that actually need it.Towards unleashing ideal performance, our second contribution is performing the first exploration on exploiting server memory frequency margin to maximize performance. We focus on High-Performance Computing (HPC) systems, where performance is paramount. We propose exploiting HPC systems’ abundant free memory in the common case to store copies of every data block and operate the copies unreliably fast to speedup common-case accesses; we use the safely-operated original blocks for recovery when the unsafely-operated copies become corrupted. We refer to our idea as Heterogeneously-accessed Dual Module Redundancy (Hetero-DMR).Hetero-DMR improves node-level performance by 18%, on average across two CPU memory hierarchies and six HPC benchmark suites, while weighted by different frequency margins and different levels of memory utilization. We also use a real system to emulate the speedup of Hetero-DMR over a conventional system; it closely matches simulation. Our system-wide simulations show applying Hetero-DMR to an HPC system provides 1.4x average speedup on job turnaround time. To facilitate adoption, Hetero-DMR also rigorously preserves system reliability and works for commodity DIMMs and CPU-memory interfaces.
******
Large persistent memories such as NVDIMM have been perceived as a disruptive memory technology, because they can maintain the state of a system even after a power failure and allow the system to recover quickly. However, overheads incurred by a heavy software-stack intervention seriously negate the benefits of such memories. First, to significantly reduce the software stack overheads, we propose HAMS, a hardware auto-mated Memory-over-Storage (MoS) solution. Specifically, HAMS aggregates the capacity of NVDIMM and ultra-low latency flash archives (ULL-Flash) into a single large memory space, which can be used as a working memory expansion or persistent memory expansion, in an OS-transparent manner. HAMS resides in the memory controller hub and manages its MoS address pool over conventional DDR and NVMe interfaces; it employs a simple hardware cache to serve all the memory requests from the host MMU after mapping the storage space of ULL-Flash to the memory space of NVDIMM. Second, to make HAMS more energy-efficient and reliable, we propose an "advanced HAMS" which removes unnecessary data transfers between NVDIMM and ULL-Flash after optimizing the datapath and hardware modules of HAMS. This approach unleashes the ULL-Flash and its NVMe controller from the storage box and directly connects the HAMS datapath to NVDIMM over the conventional DDR4 interface. Our evaluations show that HAMS and advanced HAMS can offer 97% and 119% higher system performance than a software-based NVDIMM design, while costing 41% and 45% lower energy, respectively.
******
Due to the wide deployment of deep learning applications in safety-critical systems, robust and secure execution of deep learning workloads is imperative. Adversarial examples, where the inputs are carefully designed to mislead the machine learning model is among the most challenging attacks to detect and defeat. The most dominant approach for defending against adversarial examples is to systematically create a network architecture that is sufficiently robust. Neural Architecture Search (NAS) has been heavily used as the de facto approach to design robust neural network models, by using the accuracy of detecting adversarial examples as a key metric of the neural network’s robustness. While NAS has been proven effective in improving the robustness (and accuracy in general), the NAS-generated network models run noticeably slower on typical DNN accelerators than the hand-crafted networks, mainly because DNN accelerators are not optimized for robust NAS-generated models. In particular, the inherent multi-branch nature of NAS-generated networks causes unacceptable performance and energy overheads.To bridge the gap between the robustness and performance efficiency of deep learning applications, we need to rethink the design of AI accelerators to enable efficient execution of robust (auto-generated) neural networks. In this paper, we propose a novel hardware architecture, NASGuard, which enables efficient inference of robust NAS networks. NASGuard leverages a heuristic multi-branch mapping model to improve the efficiency of the underlying computing resources. Moreover, NASGuard addresses the load imbalance problem between the computation and memory-access tasks from multi-branch parallel computing. Finally, we propose a topology-aware performance prediction model for data prefetching, to fully exploit the temporal and spatial localities of robust NAS-generated architectures. We have implemented NASGuard with Verilog RTL. The evaluation results show that NASGuard achieves an average speedup of 1.74× over the baseline DNN accelerator.
******
Neural network search (NAS) projects a promising direction to automate the design process of efficient and powerful neural network architectures. Nevertheless, the NAS techniques have to dynamically generate a large number of candidate neural networks, and iteratively train and evaluate these on-line generated network architectures, thus they are extremely time-consuming even when deployed on large GPU clusters, which dramatically hinders the adoption of NAS. Though recently there are many specialized architectures proposed to accelerate the training or inference of neural networks, we observe that existing neural network accelerators are typically targeted at static neural network architectures, and they are not suitable to accelerate the evaluation of the dynamical neural network candidates evolving during the NAS process, which cannot be deployed onto current accelerators via the off-line compilation.To enable rapid and energy-efficient NAS in compact single-chip solutions, we propose NASA, a specialized architecture for one-shot based NAS acceleration. It is able to generate, schedule, and evaluate the candidate neural network architectures for the target machine learning workload with high speed, significantly alleviating the processing bottleneck of one-shot NAS. Motivated by the observation that there are considerable computation sharing opportunities among the different neural network candidates generated in one-shot NAS, NASA is equipped with an on-chip network fusion unit to remove the redundant computation during the network mapping stage. In addition, the NASA accelerator can partition and re-schedule the candidate neural network architectures at fine-granularity to maximize the chance of data reuse and improve the utilization of the accelerator arrays integrated to accelerate network evaluation. According to our experiments on multiple one-shot NAS tasks, NASA achieves 33.52× performance speedup and 214.33× energy consumption reduction on average when compared to aCPU-GPU system.
******
To guarantee data persistence, storage workloads (such as key-value stores and databases) typically use a synchronous protocol that places the network and server stack latency on the critical path of request processing. The use of the fast and byte-addressable persistent memory (PM) has helped mitigate the storage overhead of the server stack; yet, networking is still a dominant factor in the end-to-end latency of request processing. Emerging programmable network devices can reduce network latency by moving parts of the applications’ compute into the network (e.g., caching results for read requests); however, for update requests, the client still has to stall on the server to commit the updates, persistently.In this work, we introduce in-network data persistence that extends the data-persistence domain from servers to the network, and present PMNet, a programmable data plane (e.g., switch or NIC) with PM for persisting data in the network. PMNet logs incoming update requests and acknowledges clients directly without having them wait on the server to commit the request. In case of a failure, the logged requests act as redo logs for the server to recover. We implement PMNet on an FPGA and evaluate its performance using common PM workloads, including key-value stores and PM-backed applications. Our evaluation shows that PMNet can improve the throughput of update requests by 4.31× on average, and the 99th-percentile tail latency by 3.23×.
******
Quantum technologies currently struggle to scale beyond moderate scale prototypes and are unable to execute even reasonably sized programs due to prohibitive gate error rates or coherence times. Many software approaches rely on heavy compiler optimization to squeeze extra value from noisy machines but are fundamentally limited by hardware. Alone, these software approaches help to maximize the use of available hardware but cannot overcome the inherent limitations posed by the underlying technology.An alternative approach is to explore the use of new, though potentially less developed, technology as a path towards scalability. In this work we evaluate the advantages and disadvantages of a Neutral Atom (NA) architecture. NA systems offer several promising advantages such as long range interactions and native multiqubit gates which reduce communication overhead, overall gate count, and depth for compiled programs. Long range interactions, however, impede parallelism with restriction zones surrounding interacting qubit pairs. We extend current compiler methods to maximize the benefit of these advantages and minimize the cost.Furthermore, atoms in an NA device have the possibility to randomly be lost over the course of program execution which is extremely detrimental to total program execution time as atom arrays are slow to load. When the compiled program is no longer compatible with the underlying topology, we need a fast and efficient coping mechanism. We propose hardware and compiler methods to increase system resilience to atom loss dramatically reducing total computation time by circumventing complete reloads or full recompilation every cycle.
******
Computational chemistry is the leading application to demonstrate the advantage of quantum computing in the near term. However, large-scale simulation of chemical systems on quantum computers is currently hindered due to a mismatch between the computational resource needs of the program and those available in today’s technology. In this paper we argue that significant new optimizations can be discovered by co-designing the application, compiler, and hardware. We show that multiple optimization objectives can be coordinated through the key abstraction layer of Pauli strings, which are the basic building blocks of computational chemistry programs. In particular, we leverage Pauli strings to identify critical program components that can be used to compress program size with minimal loss of accuracy. We also leverage the structure of Pauli string simulation circuits to tailor a novel hardware architecture and compiler, leading to significant execution overhead reduction by up to 99%. While exploiting the high-level domain knowledge reveals significant optimization opportunities, our hardware/software framework is not tied to a particular program instance and can accommodate the full family of computational chemistry problems with such structure. We believe the co-design lessons of this study can be extended to other domains and hardware technologies to hasten the onset of quantum advantage.
******
Near-term quantum computing (QC) systems have limited qubit counts, high gate (instruction) error rates, and typically support a minimal instruction set having one type of two-qubit gate (2Q). To reduce program instruction counts and improve application expressivity, vendors have proposed, and shown proof-of-concept demonstrations of richer instruction sets such as XY gates (Rigetti) and fSim gates (Google). These instruction sets comprise of families of 2Q gate types parameterized by continuous qubit rotation angles. That is, it allows a large set of different physical operations to be realized on the qubits, based on the input angles. However, having such a large number of gate types is problematic because each gate type has to be calibrated periodically, across the full system, to obtain high fidelity implementations. This results in substantial recurring calibration overheads even on current systems which use only a few gate types. Our work aims to navigate this tradeoff between application expressivity and calibration overhead, and identify what instructions vendors should implement to get the best expressivity with acceptable calibration time.Studying this tradeoff is challenging because of the diversity in QC application requirements, the need to optimize applications for widely different hardware gate types and noise variations across gate types. Therefore, our work develops NuOp, a flexible compilation pass based on numerical optimization, to efficiently decompose application operations into arbitrary hardware gate types. Using NuOp and four important quantum applications, we study the instruction set proposals of Rigetti and Google, with realistic noise simulations and a calibration model. Our experiments show that implementing 4-8 types of 2Q gates is sufficient to attain nearly the same expressivity as a full continuous gate family, while reducing the calibration overhead by two orders of magnitude. With several vendors proposing rich gate families as means to higher fidelity, our work has potential to provide valuable instruction set design guidance for near-term QC systems.
******
With the end of Dennard scaling, highly-parallel and specialized hardware accelerators have been proposed to improve the throughput and energy-efficiency of deep neural network (DNN) models for various applications. However, collective data movement primitives such as multicast and broadcast that are required for multiply-and-accumulate (MAC) computation in DNN models are expensive, and require excessive energy and latency when implemented with electrical networks. This consequently limits the scalability and performance of electronic hardware accelerators. Emerging technology such as silicon photonics can inherently provide efficient implementation of multicast and broadcast operations, making photonics more amenable to exploit parallelism within DNN models. Moreover, when coupled with other unique features such as low energy consumption, high channel capacity with wavelength-division multiplexing (WDM), and high speed, silicon photonics could potentially provide a viable technology for scaling DNN acceleration.In this paper, we propose Albireo, an analog photonic architecture for scaling DNN acceleration. By characterizing photonic devices such as microring resonators (MRRs) and Mach-Zehnder modulators (MZM) using photonic simulators, we develop realistic device models and outline their capability for system level acceleration. Using the device models, we develop an efficient broadcast combined with multicast data distribution by leveraging parameter sharing through unique WDM dot product processing. We evaluate the energy and throughput performance of Albireo on DNN models such as ResNet18, MobileNet and VGG16. When compared to cur-rent state-of-the-art electronic accelerators, Albireo increases throughput by 110 X, and improves energy-delay product (EDP) by an average of 74 X with current photonic devices. Furthermore, by considering moderate and aggressive photonic scaling, the proposed Albireo design shows that EDP can be reduced by at least 229 X.
******
Transient execution vulnerabilities originate in the extensive speculation implemented in modern high-performance microprocessors. Identifying all possible vulnerabilities in complex designs is very challenging. One of the challenges stems from the lack of visibility into the transient micro-architectural state of the processor. Prior work has used covert channels to identify data leakage from transient state, which limits the systematic discovery of all potential leakage sources.This paper presents INTROSPECTRE, a pre-silicon framework for early discovery of transient execution vulnerabilities. IN- TROSPECTRE addresses the lack of visibility into the micro- architectural processor state by integrating into the register transfer level (RTL) design flow, gaining full access to the internal state of the processor. Full visibility into the processor state enables INTROSPECTRE to perform a systematic leakage analysis that includes all micro-architectural structures, allowing it to identify potential leakage that may not be reachable with known side channels. We implement INTROSPECTRE on an RTL simulator and use it to perform transient leakage analysis on the RISC-V BOOM processor. We identify multiple transient leakage scenarios, most of which had not been highlighted on this processor design before.
******
The security of computers is at risk because of information leaking through their power consumption. Attackers can use advanced signal measurement and analysis to recover sensitive data from this side channel.To address this problem, this paper presents Maya, a simple and effective defense against power side channels. The idea is to use formal control to re-shape the power dissipated by a computer in an application-transparent manner—preventing attackers from learning any information about the applications that are running. With formal control, a controller can reliably keep power close to a desired target function even when runtime conditions change unpredictably. By selecting the target function intelligently, the controller can make power to follow any desired shape, appearing to carry activity information which, in reality, is unrelated to the application. Maya can be implemented in privileged software, firmware, or simple hardware. In this paper, we implement Maya on three machines using privileged threads only, and show its effectiveness and ease of deployment. Maya has already thwarted a newly-developed remote power attack.
******
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe pitfalls in widely used vulnerability measurement approaches, which separate the hardware and the software layers. We rely on microarchitecture level fault injection to derive very tight full-system vulnerability measurements. For our architectural and microarchitectural measurements, we employ GeFIN, a state-of-the-art fault injector built on top of the gem5 simulator, while for software level measurements we employ the LLFI fault injector. Analyzing two different Arm ISAs and two different microarchitectures for each ISA, we quantify the sources and the magnitude of error of architecture and software level vulnerability evaluation methods, which aim to reproduce the effects of hardware faults. We show that widely applied methodologies for system resilience evaluation fail to capture important fault manifestation and propagation aspects and lead to misleading findings, which report opposite vulnerability results than a comprehensive cross-layer analysis. To justify the validity of our findings we employ a state-of-the-art software-based fault tolerance technique and evaluate its impact at all layers through a case study. Our evaluation shows that although higher-level methods can report significant vulnerability improvements (up to 3.8x vulnerability reduction), the actual cross-layer vulnerability of the protected system can be degraded (increased) by up to 30% for the selected benchmarks. Our analysis firmly suggests that only accurate methodologies for full-system vulnerability evaluation of a microprocessor can guide informed transient faults protection decisions either at the hardware or at the software layer.
******
Memory safety continues to be a significant software reliability and security problem, and low overhead and low complexity hardware solutions have eluded computer designers. In this paper, we explore a pathway to deployable memory safety defenses. Our technique builds on a recent trend in software: the usage of binning memory allocators. We observe that if memory allocation sizes (e.g., malloc sizes) are made an architectural feature, then it is possible to overcome many of the thorny issues with traditional approaches to memory safety such as compatibility with unsecured software and significant performance degradation. We show that our architecture, No-FAT, incurs an overhead of 8% on SPEC CPU2017 benchmarks, and our VLSI measurements show low power and area overheads. Finally, as No-FAT’s hardware is aware of the memory allocation sizes, it effectively mitigates certain speculative attacks (e.g., Spectre-V1) with no additional cost. When our solution is used for pre-deployment fuzz testing it can improve fuzz testing bandwidth by an order of magnitude compared to state-of-the-art approaches.
******
With offloading of data to the cloud, ensuring privacy and securing data has become more important. However, encrypting data alone is insufficient as the memory address itself can leak sensitive information. In this work, we exploit packetized memory interface to provide secure memory access and support oblivious computation in a system with multiple memory modules interconnected with a multi-hop, memory-centric network. While the memory address can be encrypted with a packetized memory interface, simply encrypting the address does not provide full oblivious computation since coarse-grain memory access patterns can be leaked. In this work, we first propose a scalable encryption microarchitecture with source-based routing where the packet is only encrypted once at source and latency overhead in intermediate routers is minimized. We then define secure routing in memory-centric networks to enable oblivious computation such that memory access patterns across the memory modules are completely obfuscated. We explore different naive secure routing algorithms to ensure oblivious computation but they come with high performance overhead. To minimize performance overhead, we propose ghost packets that replace dummy packets with existing network traffic. We also propose Ghost routing that batches multiple ghost packets together to minimize bandwidth loss from naive secure routing while exploiting random routing.
******
True random number generators (TRNG) sample random physical processes to create large amounts of random numbers for various use cases, including security-critical cryptographic primitives, scientific simulations, machine learning applications, and even recreational entertainment. Unfortunately, not every computing system is equipped with dedicated TRNG hardware, limiting the application space and security guarantees for such systems. To open the application space and enable security guarantees for the overwhelming majority of computing systems that do not necessarily have dedicated TRNG hardware (e.g., processing-in-memory systems), we develop QUAC-TRNG, a new high-throughput TRNG that can be fully implemented in commodity DRAM chips, which are key components in most modern systems.QUAC-TRNG exploits the new observation that a carefully-engineered sequence of DRAM commands activates four consecutive DRAM rows in rapid succession. This QUadruple ACtivation (QUAC) causes the bitline sense amplifiers to non-deterministically converge to random values when we activate four rows that store conflicting data because the net deviation in bitline voltage fails to meet reliable sensing margins.We experimentally demonstrate that QUAC reliably generates random values across 136 commodity DDR4 DRAM chips from one major DRAM manufacturer. We describe how to develop an effective TRNG (QUAC-TRNG) based on QUAC. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that QUAC-TRNG successfully passes each test. Our experimental evaluations show that QUAC-TRNG reliably generates true random numbers with a throughput of 3.44 Gb/s (per DRAM channel), outperforming the state-of-the-art DRAM-based TRNG by 15.08× and 1.41× for basic and throughput-optimized versions, respectively. We show that QUAC-TRNG utilizes DRAM bandwidth better than the state-of-the-art, achieving up to 2.03× the throughput of a throughput-optimized baseline when scaling bus frequencies to 12 GT/s.
******
The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm2 (22 nm FDSOI).
******
Graphics Processing Units (GPUs) are ubiquitous components used across the range of today’s computing platforms, from phones and tablets, through personal computers, to high-end server class platforms. With the increasing importance of graphics and video workloads, recent processors are shipped with GPU devices that are integrated on the same chip. Integrated GPUs share some resources with the CPU and as a result, there is a potential for microarchitectural attacks from the GPU to the CPU or vice versa. We consider the potential for covert channel attacks that arise either from shared microarchitectural components (such as caches) or through shared contention domains (e.g., shared buses). We illustrate these two types of channels by developing two reliable covert channel attacks. The first covert channel uses the shared LLC cache in Intel’s integrated GPU architectures. The second is a contention based channel targeting the ring bus connecting the CPU and GPU to the LLC. This is the first demonstrated microarchitectural attack crossing the component boundary (GPU to CPU or vice versa). Cross-component channels introduce a number of new challenges that we had to overcome since they occur across heterogeneous components that use different computation models and are interconnected using asymmetric memory hierarchies. We also exploit GPU parallelism to increase the bandwidth of the communication, even without relying on a common clock. The LLC based channel achieves a bandwidth of 120 kbps with a low error rate of 2%, while the contention based channel delivers up to 400 kbps with a 0.8% error rate. We also demonstrate a proof-of-concept prime-and-probe side channel attack that probes the full LLC from the GPU.
******
To operate efficiently across a wide range of workloads with varying power requirements, a modern processor applies different current management mechanisms, which briefly throttle instruction execution while they adjust voltage and frequency to accommodate for power-hungry instructions (PHIs) in the instruction stream. Doing so 1) reduces the power consumption of non-PHI instructions in typical workloads and 2) optimizes system voltage regulators’ cost and area for the common use case while limiting current consumption when executing PHIs.However, these mechanisms may compromise a system’s confidentiality guarantees. In particular, we observe that multilevel side-effects of throttling mechanisms, due to PHI-related current management mechanisms, can be detected by two different software contexts (i.e., sender and receiver) running on 1) the same hardware thread, 2) co-located Simultaneous Multi-Threading (SMT) threads, and 3) different physical cores.Based on these new observations on current management mechanisms, we develop a new set of covert channels, IChannels, and demonstrate them in real modern Intel processors (which span more than 70% of the entire client and server processor market). Our analysis shows that IChannels provides more than 24× the channel capacity of state-of-the-art power management covert channels. We propose practical and effective mitigations to each covert channel in IChannels by leveraging the insights we gain through a rigorous characterization of real systems.
******
A large class of today’s systems require high levels of availability and security. Unfortunately, state-of-the-art security solutions tend to induce crashes and raise exceptions when under attack, trading off availability for security. In this work, we propose ZeRØ, a pointer integrity mechanism that can continue program execution even when under attack. ZeRØ proposes unique memory instructions and a novel metadata encoding scheme to protect code and data pointers. The combination of instructions and metadata allows ZeRØ to avoid explicitly tagging every word in memory, eliminating performance overheads. Moreover, ZeRØ is a deterministic security primitive that requires minor microarchitectural changes. We show that ZeRØ is better than commercially available state-of-the-art hardware primitives, e.g., ARM’s Pointer Authentication (PAC), by a significant margin. ZeRØ incurs zero performance overheads on the SPEC CPU2017 benchmarks, and our VLSI measurements show low power and area overheads.
******
The revolution of machine learning poses an unprecedented demand for computation resources, urging more transistors on a single monolithic chip, which is not sustainable in the Post-Moore era. The multichip integration with small functional dies, called chiplets, can reduce the manufacturing cost, improve the fabrication yield, and achieve die-level reuse for different system scales. DNN workload mapping and hardware design space exploration on such multichip systems are critical, but missing in the current stage.This work provides a hierarchical and analytical framework to describe the DNN mapping on a multichip accelerator and analyze the communication overhead. Based on this framework, we propose an automatic tool called NN-Baton with a pre-design flow and a post-design flow. The pre-design flow aims to guide the chiplet granularity exploration with given area and performance budgets for the target workload. The post-design flow focuses on the workload orchestration on different computation levels -package, chiplet, and core - in the hierarchy. Compared to Simba, NN-Baton generates mapping strategies that save 22.5%∼44% energy under the same computation and memory configurations.The architecture exploration demonstrates that area is a decisive factor for the chiplet granularity. For a 2048-MAC system under a 2 mm2 chiplet area constraint, the 4-chiplet implementation with 4 cores and 16 lanes of 8-size vector-MAC is always the top-pick computation allocation across several benchmarks. In contrast, the optimal memory allocation policy in the hierarchy typically depends on the neural network models.
******
Ultra-low-power (ULP) devices are becoming pervasive, enabling many emerging sensing applications. Energy-efficiency is paramount in these applications, as efficiency determines device lifetime in battery-powered deployments and performance in energy-harvesting deployments. Unfortunately, existing designs fall short because ASICs’ upfront costs are too high and prior ULP architectures are too inefficient or inflexible.We present Snafu, the first framework to flexibly generate ULP coarse-grain reconfigurable arrays (CGRAs). Snafu provides a standard interface for processing elements (PE), making it easy to integrate new types of PEs for new applications. Unlike prior high-performance, high-power CGRAs, Snafu is designed from the ground up to minimize energy consumption while maximizing flexibility. Snafu saves energy by configuring PEs and routers for a single operation to minimize switching activity; by minimizing buffering within the fabric; by implementing a statically routed, bufferless, multi-hop network; and by executing operations in-order to avoid expensive tag-token matching.We further present Snafu-Arch, a complete ULP system that integrates an instantiation of the Snafu fabric alongside a scalar RISC-V core and memory. We implement Snafu in RTL and evaluate it on an industrial sub-28 nm FinFET process across a suite of common sensing benchmarks. Snafu-Arch operates at <1 mW, orders-of-magnitude less power than most prior CGRAs. Snafu-Arch uses 41% less energy and runs 4.4× faster than the prior state-of-the-art general-purpose ULP architecture. Moreover, we conduct three comprehensive case-studies to quantify the cost of programmability in Snafu. We find that Snafu-Arch is close to ASIC designs built in the same technology, using just 2.6× more energy on average.
******
The need for speed in modern data-intensive work-loads and the rise of "dark silicon" in the semiconductor industry are pushing for larger, faster, and more energy and area-efficient architectures, such as Reconfigurable Dataflow Accelerators (RDAs). Nevertheless, challenges remain in developing mechanisms to effectively utilize the compute power of these large-scale RDAs. To address these challenges, we present SARA, a compiler that employs a novel mapping strategy to efficiently utilize large-scale RDAs. Starting from a single-threaded imperative abstraction, SARA spatially maps a program onto RDA's distributed resources, exploiting dataflow parallelism within and across hyperblocks to saturate the compute throughput of an RDA. SARA introduces (a) compiler-managed memory consistency (CMMC), a control paradigm that hierarchically pipelines a nested and data-dependent control-flow graph onto a dataflow architecture, and (b) a compilation flow that decomposes the program graph across distributed heterogeneous resources to hide low-level RDA constraints from programmers. Our evaluation shows that SARA achieves close to perfect performance scaling on a recently proposed RDA—Plasticine. Over a mix of deep-learning, graph-processing, and streaming applications, SARA achieves a 1.9× geo-mean speedup over a Tesla V100 GPU using only 12% of the silicon area.
******
Tensor computations overwhelm traditional general-purpose computing devices due to the large amounts of data and operations of the computations. They call for a holistic solution composed of both hardware acceleration and software mapping. Hardware/software (HW/SW) co-design optimizes the hardware and software in concert and produces high-quality solutions. There are two main challenges in the co-design flow. First, multiple methods exist to partition tensor computation and have different impacts on performance and energy efficiency. Besides, the hardware part must be implemented by the intrinsic functions of spatial accelerators. It is hard for programmers to identify and analyze the partitioning methods manually. Second, the overall design space composed of HW/SW partitioning, hardware optimization, and software optimization is huge. The design space needs to be efficiently explored. To this end, we propose an agile co-design approach HASCO that provides an efficient HW/SW solution to dense tensor computation. We use tensor syntax trees as the unified IR, based on which we develop a two-step approach to identify partitioning methods. For each method, HASCO explores the hardware and software design spaces. We propose different algorithms for the explorations, as they have distinct objectives and evaluation costs. Concretely, we develop a multi-objective Bayesian optimization algorithm to explore hardware optimization. For software optimization, we use heuristic and Q-learning algorithms. Experiments demonstrate that HASCO achieves a 1.25X to 1.44X latency reduction through HW/SW co-design compared with developing the hardware and software separately.
******
Irregular applications, such as graph analytics and sparse linear algebra, exhibit frequent indirect, data-dependent accesses to single or short sequences of elements that cause high main memory traffic and limit performance. Data compression is a promising way to accelerate irregular applications by reducing memory traffic. However, software compression adds substantial overheads, and prior hardware compression techniques work poorly on the complex access patterns of irregular applications.We present SpZip, an architectural approach that makes data compression practical for irregular algorithms. SpZip accelerates the traversal, decompression, and compression of the data structures used by irregular applications. In addition, these activities run in a decoupled fashion, hiding both memory access and decompression latencies. To support the wide range of access patterns in these applications, SpZip is programmable, and uses a novel Dataflow Configuration Language to specify programs that traverse and generate compressed data. Our SpZip implementation leverages dataflow execution and time-multiplexing to implement programmability cheaply. We evaluate SpZip on a simulated multicore system running a broad set of graph and linear algebra algorithms. SpZip outperforms prior state-of-the art software-only (hardware-accelerated) systems by gmean 3.0× (1.5×) and reduces memory traffic by 1.7× (1.4×). These benefits stem from both reducing data movement due to compression, and offloading expensive traversal and (de)compression operations.
******
Leveraging sparsity in deep neural network (DNN) models is promising for accelerating model inference. Yet existing GPUs can only leverage the sparsity from weights but not activations, which are dynamic, unpredictable, and hence challenging to exploit. In this work, we propose a novel architecture to efficiently harness the dual-side sparsity (i.e., weight and activation sparsity). We take a systematic approach to understand the (dis)advantages of previous sparsity-related architectures and propose a novel, unexplored paradigm that combines outer-product computation primitive and bitmap-based encoding format. We demonstrate the feasibility of our design with minimal changes to the existing production-scale inner-product-based Tensor Core. We propose a set of novel ISA extensions and co-design the matrix-matrix multiplication and convolution algorithms, which are the two dominant computation patterns in today’s DNN models, to exploit our new dual-side sparse Tensor Core. Our evaluation shows that our design can fully unleash the dual-side DNN sparsity and improve the performance by up to one order of magnitude with small hardware overhead.
******
In the era of artificial intelligence, convolutional neural networks (CNNs) are emerging as a powerful technique for computational imaging. They have shown superior quality for reconstructing fine textures from badly-distorted images and have potential to bring next-generation cameras and displays to our daily life. However, CNNs demand intensive computing power for generating high-resolution videos and defy conventional sparsity techniques when rendering dense details. Therefore, finding new possibilities in regular sparsity is crucial to enable large-scale deployment of CNN-based computational imaging.In this paper, we consider a fundamental but yet well-explored approach—algebraic sparsity—for energy-efficient CNN acceleration. We propose to build CNN models based on ring algebra that defines multiplication, addition, and non-linearity for n-tuples properly. Then the essential sparsity will immediately follow, e.g. n-times reduction for the number of real-valued weights. We define and unify several variants of ring algebras into a modeling framework, RingCNN, and make comparisons in terms of image quality and hardware complexity. On top of that, we further devise a novel ring algebra which minimizes complexity with component-wise product and achieves the best quality using directional ReLU. Finally, we design an accelerator, eRingCNN, to accommodate to the proposed ring algebra, in particular with regular ring-convolution arrays for efficient inference and on-the-fly directional ReLU blocks for fixed-point computation. We implement two configurations, n = 2 and 4 (50% and 75% sparsity), with 40 nm technology to support advanced denoising and super-resolution at up to 4K UHD 30 fps. Layout results show that they can deliver equivalent 41 TOPS using 3.76 W and 2.22 W, respectively. Compared to the real-valued counterpart, our ring convolution engines for n = 2 achieve 2.00× energy efficiency and 2.08× area efficiency with similar or even better image quality. With n = 4, the efficiency gains of energy and area are further increased to 3.84× and 3.77× with only 0.11 dB drop of peak signal-to-noise ratio (PSNR). The results show that RingCNN exhibits great architectural advantages for providing near-maximum hardware efficiencies and graceful quality degradation simultaneously.
******
The co-existence of activation sparsity and model sparsity in convolutional neural network (CNN) models makes sparsity-aware CNN hardware designs very attractive. The existing sparse CNN accelerators utilize intersection operation to search and identify the key positions of the matched entries between two sparse vectors, and hence avoid unnecessary computations. However, these state-of-the-art designs still suffer from three major architecture-level drawbacks, including 1) hardware cost for the intersection operation is high; 2) frequent stalls of computation phase due to strong data dependency between intersection and computation phases; and 3) unnecessary data transfer incurred by the explicit intersection operation.By leveraging the knowledge of the complete sparse 2-D convolution, this paper proposes two key ideas that overcome all of the three drawbacks. First, an implicit on-the-fly intersection is proposed to realize the optimal solution for intersection between one static stream and one dynamic stream, which is the case for sparse neural network inference. Second, by leveraging the global computation structure of 2-D convolution, we propose a specialized computation reordering to ensure that the activation is only transferred if necessary and only once.Based on these two key ideas, we develop GoSPA, an energy-efficient high-performance Globally Optimized SParse CNN Accelerator. GoSPA is implemented with CMOS 28nm technology. Compared with the state-of-the-art sparse CNN architecture, GoSPA achieves average 1.38×, 1.28×, 1.23×, 1.17×, 1.21× and 1.28× speedup on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt workloads, respectively. Also, GoSPA achieves 5.38×, 4.96×, 4.79×, 5.02×, 4.86× and 2.06× energy efficiency improvement on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt, respectively. In more comprehensive comparison including DRAM access, GoSPA also shows significant performance improvement over the existing designs.
******