ACM Transactions on Design Automation of Electronic Systems

Material type: TextTextSeries: ; ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 4, 2017Publication details: New York : Association for Computing Machinery, 2017Description: various pagings : illustrations ; 25 cmISSN:
  • 1084-4309
Subject(s):
Contents:
Exploring Energy-Efficient Cache Design in Emerging Mobile Platforms -- Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors -- Optimal Scheduling and Allocation for IC Design Management and Cost Reduction -- Proof-Carrying Hardware via Inductive Invariants -- Automated Integration of Dual-Edge Clocking for Low-Power Operation in Nanometer Nodes -- Design Methodology of Fault-Tolerant Custom 3D Network-on-Chip -- Approximate Energy-Efficient Encoding for Serial Interfaces -- Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencies -- Generating Current Constraints to Guarantee RLC Power Grid Safety -- Test Modification for Reduced Volumes of Fail Data -- Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC Converters -- Training Fixed-Point Classifiers for On-Chip Low-Power Implementation -- Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon Era -- Spatio-Temporal Scheduling of Preemptive Real-Time Tasks on Partially Reconfigurable Systems -- Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation -- Noc-HMP: A Heterogeneous Multicore Processor for Embedded Systems Designed in System -- Time-Triggered Scheduling of Mixed-Criticality Systems -- Incremental Layer Assignment for Timing Optimization.
Summary: [Article Title: Exploring Energy-Efficient Cache Design in Emerging Mobile Platforms/ Kaige Yan,Lu Pen,Mingsong Chen and Xin Fu, p. 58:1-58:20] Abstract: Mobile devices are quickly becoming the most widely used processors in consumer devices. Since their major power supply is battery, energy-efficient computing is highly desired. In this article, we focus on energy-efficient cache design in emerging mobile platforms. We observe that more than 40% of L2 cache accesses are OS kernel accesses in interactive smartphone applications. Such frequent kernel accesses cause serious interferences between the user and kernel blocks in the L2 cache, leading to unnecessary block replacements and high L2 cache miss rate. We first propose to statically partition the L2 cache into two separate segments, which can be accessed only by the user code and kernel code, respectively. Meanwhile, the overall size of the two segments is shrunk, which reduces the energy consumption while still maintaining the similar cache miss rate. We then find completely different access behaviors between the two separated kernel and user segments and explore the multi-retention STT-RAM-based user and kernel segments to obtain higher energy savings in this static partition-based cache design. Finally, we propose to dynamically partition the L2 cache into the user and kernel segments to minimize overall cache size. We also integrate the short-retention STT-RAM into this dynamic partition-based cache design for maximal energy savings. The experimental results show that our static technique reduces cache energy consumption by 75% with 2% performance loss, and our dynamic technique further shows strong capability to reduce cache energy consumption by 85% with only 3% performance loss.;[Article Title: Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors/ Taehyun Kim,Jongbum Lim,Jinku Kim,Woo-Cheol Cho,Eui-Young Chung and Hyuk-Jun Lee, p. 59:1-59:26] Abstract: Scalability of network processor-based routers heavily depends on limitations imposed by memory accesses and associated power consumption. Bandwidth shaping of a flow is a key function, which requires a token bucket per output queue and abuses memory bandwidth. As the number of output queues increases, managing token buckets becomes prohibitively expensive and limits scalability. In this work, we propose a scalable software-based token bucket management scheme that can reduce memory accesses and power consumption significantly. To satisfy real-time and low-cost constraints, we propose novel parallel heap data structures running on a manycore-based network processor. By using cache locking, the performance of heap processing is enhanced significantly and is more predictable. In addition, we quantitatively analyze the performance and memory footprint of the proposed software scheme using stochastic modeling and the Lyapunov central limit theorem. Finally, the proposed scheme provides an adaptive method to limit the size of heaps in the case of oversubscribed queues, which can successfully isolate the queues showing unideal behavior. The proposed scheme reduces memory accesses by up to three orders of magnitude for one million queues sharing a 100Gbps interface of the router while maintaining stability under stressful scenarios. ;[Article Title: Optimal Scheduling and Allocation for IC Design Management and Cost Reduction/ Prabhav Agrawal,Mike Broxterman,Biswadeep Chatterjee,Patrick Cuevas,Kathy H. Hayashi,Andrew B. Kahng,Pranay K. Myana and Siddhartha Nath, p. 60:1-60:30] Abstract: A large semiconductor product company spends hundreds of millions of dollars each year on design infrastructure to meet tapeout schedules for multiple concurrent projects. Resources (servers, electronic design automation tool licenses, engineers, and so on) are limited and must be shared -- and the cost per day of schedule slip can be enormous. Co-constraints between resource types (e.g., one license per every two cores (threads)) and dedicated versus shareable resource pools make scheduling and allocation hard. In this article, we formulate two mixed integer-linear programs for optimal multi-project, multi-resource allocation with task precedence and resource co-constraints. Application to a real-world three-project scheduling problem extracted from a leading-edge design center of anonymized Company X shows substantial compute and license costs savings. Compared to the product company, our solution shows that the makespan of schedule of all projects can be reduced by seven days, which not only saves ∼ 2.7% of annual labor and infrastructure costs but also enhances market competitiveness. We also demonstrate the capability of scheduling over two dozen chip development projects at the design center level, subject to resource and datacenter capacity limits as well as per-project penalty functions for schedule slips. The design center ended up purchasing 600 additional servers, whereas our solution demonstrates that the schedule can be met without having to purchase any additional servers. Application to a four-project scheduling problem extracted from a leading-edge design center in a non-US location shows availability of up to ∼ 37% headcount reduction during a half-year schedule for just one type of chip design activity.;[Article Title: Proof-Carrying Hardware via Inductive Invariants/ Tobias Isenberg,Marco Platzner,Heike Wehrheim and Tobias Wiersema, p. 61:1-61:23] Abstract: Proof-carrying hardware (PCH) is a principle for achieving safety for dynamically reconfigurable hardware systems. The producer of a hardware module spends huge effort when creating a proof for a safety policy. The proof is then transferred as a certificate together with the configuration bitstream to the consumer of the hardware module, who can quickly verify the given proof. Previous work utilized SAT solvers and resolution traces to set up a PCH technology and corresponding tool flows. In this article, we present a novel technology for PCH based on inductive invariants. For sequential circuits, our approach is fundamentally stronger than the previous SAT-based one since we avoid the limitations of bounded unrolling. We contrast our technology to existing ones and show that it fits into previously proposed tool flows. We conduct experiments with four categories of benchmark circuits and report consumer and producer runtime and peak memory consumption, as well as the size of the certificates and the distribution of the workload between producer and consumer. Experiments clearly show that our new induction-based technology is superior for sequential circuits, whereas the previous SAT-based technology is the better choice for combinational circuits.;[Article Title: Automated Integration of Dual-Edge Clocking for Low-Power Operation in Nanometer Nodes/ Andrea Bonetti,Nicholas Preyss,Adam Teman and Andreas Burg, p. 62:1-62:20] Abstract: Clocking power, including both clock distribution and registers, has long been one of the primary factors in the total power consumption of many digital systems. One straightforward approach to reduce this power consumption is to apply dual-edge-triggered (DET) clocking, as sequential elements operate at half the clock frequency while maintaining the same throughput as with conventional single-edge-triggered (SET) clocking. However, the DET approach is rarely taken in modern integrated circuits, primarily due to the perceived complexity of integrating such a clocking scheme. In this article, we first identify the most promising conditions for achieving low-power operation with DET clocking and then introduce a fully automated design flow for applying DET to a conventional SET design. The proposed design flow is demonstrated on three benchmark circuits in a 40nm CMOS technology, providing as much as a 50% reduction in clock distribution and register power consumption. ;[Article Title: Design Methodology of Fault-Tolerant Custom 3D Network-on-Chip/ Katherine Shu-Min Li and Sying-Jyan Wang, p. 63:1-63:20] Abstract: A systematic design methodology is presented for custom Network-on-Chip (NoC) in three-dimensional integrated circuits (3D-ICs). In addition, fault tolerance is supported in the NoC if extra links are included in the NoC topology. Summary: In the proposed method, processors and the communication architecture are synthesized simultaneously in the 3D floorplanning process. 3D-IC technology enables ICs to be implemented in smaller size with higher performance; on the flip side, 3D-ICs suffer yield loss due to multiple dies in a 3D stack and lower manufacturing yield of through-silicon vias (TSVs). To alleviate this problem, a known-good-dies (KGD) test can be applied to ensure every die to be packaged into a 3D-IC is fault-free. However, faulty TSVs cannot be tested in the KGD test. In this article, the proposed method deals with the problem by providing fault tolerance in the NoC topology. The efficiency of the proposed method is evaluated using several benchmark circuits, and the experimental results show that the proposed method produces 3D NoCs with comparable performance than previous methods when fault-tolerant features are not realized. With fault tolerance in NoCs, higher yield can be achieved at the cost of performance penalty and elevated power level.;[Article Title: Approximate Energy-Efficient Encoding for Serial Interfaces/ Daniele Jahier Pagliari,Enrico Macii and Massimo Poncino, p. 64:1-64:25] Abstract: Serial buses are ubiquitous interconnections in embedded computing systems that are used to interface processing elements with peripherals, such as sensors, actuators, and I/O controllers. Despite their limited wiring, as off-chip connections they can account for a significant amount of the total power consumption of a system-on-chip device. Encoding the information sent on these buses is the most intuitive and affordable way to reduce their power contribution; moreover, the encoding can be made even more effective by exploiting the fact that many embedded applications can tolerate intermediate approximations without a significant impact on the final quality of results, thus trading off accuracy for power consumption. We propose a simple yet very effective approximate encoding for reducing dynamic energy in serial buses. Our approach uses differential encoding as a baseline scheme and extends it with bounded approximations to overcome the intrinsic limitations of differential encoding for data with low temporal correlation. We show that the proposed scheme, in addition to yielding extremely compact codecs, is superior to all state-of-the-art approximate serial encodings over a wide set of traces representing data received or sent from/to sensor or actuators. ;[Article Title: Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencies/ Benjamin Carrion Schafer, p. 65:1-65:20] Abstract: This works presents a Design Space Exploration (DSE) method for Behavioral IPs (BIPs) given in ANSI-C or SystemC to find the smallest micro-architecture for a specific target latency. Previous work on High-Level Synthesis (HLS) DSE mainly focused on finding a tradeoff curve with Pareto-optimal designs. HLS is, however, a single process (component) synthesis method. Very often, the latency of the components requires a specific fixed latency when inserted within a larger system. This work presents a fast multi-threaded method to find the smallest micro-architecture for a given BIP and target latency by discriminating between all different exploration knobs and exploring these concurrently. Experimental results show that our proposed method is very effective and comprehensive results compare the quality of results vs. the speedup of your proposed explorer.;[Article Title: Generating Current Constraints to Guarantee RLC Power Grid Safety/ Zahi Moudallal and Farid N. Najm, p. 66:1-66:39] Abstract: A critical task during early chip design is the efficient verification of the chip power distribution network. Vectorless verification, developed since the mid-2000s as an alternative to traditional simulation-based methods, requires the user to specify current constraints (budgets) for the underlying circuitry and checks if the corresponding voltage variations on all grid nodes are within a user-specified margin. This framework is extremely powerful, as it allows for efficient and early verification, but specifying/obtaining current constraints remains a burdensome task for users and a hurdle to adoption of this framework by the industry. Recently, the inverse problem has been introduced: Generate circuit current constraints that, if satisfied by the underlying logic circuitry, would guarantee grid safety from excessive voltage variations. This approach has many potential applications, including various grid quality metrics, as well as voltage drop-aware placement and floorplanning. So far, this framework has been developed assuming only resistive and capacitive (RC) elements in the power grid model. Inductive effects are becoming a significant component of the power supply noise and can no longer be ignored. In this article, we extend the constraints generation approach to allow for inductance. We give a rigorous problem definition and develop some key theoretical results related to maximality of the current space defined by the constraints. Based on this, we then develop three constraints generation algorithms that target the peak total chip power that is allowed by the grid, the uniformity of current distribution across the die area, and a combination of both metrics.;[Article Title: Test Modification for Reduced Volumes of Fail Data/ Irith Pomeranz,M. Enamul Amyeen and Srikanth Venkataraman, p. 67:1-67:17] Abstract: As part of a yield improvement process, fail data is collected from faulty units. Several approaches exist for reducing the tester time and the volume of fail data that needs to be collected based on the observation that a subset of the fail data is sufficient for accurate defect diagnosis. This article addresses the volume of fail data by considering the test set that is used for collecting fail data. It observes that certain faults from a set of target faults produce significantly larger numbers of faulty output values (and therefore significantly larger volumes of fail data) than other faults under a given test set. Based on this observation, it describes a procedure for modifying the test set to reduce the maximum number of faulty output values that a target fault produces. When defects are considered in a simulation experiment, and a defect diagnosis procedure is applied to the fail data that they produce, two effects are observed: the maximum and average numbers of faulty output values per defect are reduced significantly with the modified test set, and the quality of diagnosis is similar or even improved with the modified test set.;[Article Title: Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC Converters/ Ya Wang,Di Gao,Dani Tannir,Ning Dong,G. Peter Fang,Wei Dong and Peng Li, p. 68:1-68:16] Abstract: Small-signal models of pulse-width modulation (PWM) converters are widely used for analyzing stability and play an important role in converter design and control. However, existing small-signal models either are based on averaged DC behaviors, and hence are unable to capture frequency responses that are faster than the switching frequency, or greatly approximate these high-frequency responses. We address the severe limitations of the existing models by proposing a multiharmonic model that provides a complete small-signal characterization of both DC averages and high-order harmonic responses. The proposed model captures important high-frequency overshoots and undershoots of the converter response, which are otherwise unaccounted for by the existing techniques. In two converter examples, the proposed model corrects the misleading results of the existing models by providing truthful characterization of the overall converter AC response and offers important guidance for converter design and closed-loop control.;[Article Title: Training Fixed-Point Classifiers for On-Chip Low-Power Implementation/ Hassan Albalawi,Yuanning Li and Xin Li, p. 69:1-69:18] Abstract: In this article, we develop several novel algorithms to train classifiers that can be implemented on chip with low-power fixed-point arithmetic with extremely small word length. These algorithms are based on Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), and Logistic Regression (LR), and are referred to as LDA-FP, SVM-FP, and LR-FP, respectively. Summary: They incorporate the nonidealities (i.e., rounding and overflow) associated with fixed-point arithmetic into the offline training process so that the resulting classifiers are robust to these nonidealities. Mathematically, LDA-FP, SVM-FP, and LR-FP are formulated as mixed integer programming problems that can be robustly solved by the branch-and-bound methods described in this article. Our numerical experiments demonstrate that LDA-FP, SVM-FP, and LR-FP substantially outperform the conventional approaches for the emerging biomedical applications of brain decoding. ;[Article Title: Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon Era/ Mohaddeseh Hoveida,Fatemeh Aghaaliakbari,Ramin Bashizade,Mohammad Arjomand and Hamid Sarbazi-Azad, p. 70:1-70:26] Abstract: The failure of Dennard scaling has led to the utilization wall that is the source of dark silicon and limits the percentage of a chip that can actively switch within a given power budget. To address this issue, a structure is needed to guarantee the limited power budget along with providing sufficient flexibility and performance for different applications with various communication requirements. In this article, we present a general-purpose platform for future many-core Chip-Multiprocessors (CMPs) that benefits from the advantages of clustering, Network-on-Chip (NoC) resource sharing among cores, and power gating the unused components of clusters. We also propose two task mapping methods for the proposed platform in which active and dark cores are dispersed appropriately, so that an excess of power budget can be obtained. Our evaluations reveal that the first and second proposed mapping mechanisms respectively reduce the execution time by up to 28.6% and 39.2% and the NoC power consumption by up to 11.1% and 10%, and gain an excess power budget of up to 7.6% and 13.4% over the baseline architecture.;[Article Title: Spatio-Temporal Scheduling of Preemptive Real-Time Tasks on Partially Reconfigurable Systems/ Sangeet Saha,Arnab Sarkar and Amlan Chakrabarti, p. 71:1-71:26] Abstract: Reconfigurable devices that promise to offer the twin benefits of flexibility as in general-purpose processors along with the efficiency of dedicated hardwares often provide a lucrative solution for many of today's highly complex real-time embedded systems. However, online scheduling of dynamic hard real-time tasks on such systems with efficient resource utilization in terms of both space and time poses an enormously challenging problem. We attempt to solve this problem using a combined offline-online approach. The offline component generates and stores various optional feasible placement solutions for different sub-sets of tasks that may possibly be co-mapped together. Given a set of periodic preemptive real-time tasks that requires to be executed at runtime, the online scheduler first carries out an admission control procedure and then produces a schedule, which is guaranteed to meet all timing constraints provided it is spatially feasible to place designated subsets of these tasks at specified scheduling points within a future time interval. These feasibility checks are done and actual placement solutions are obtained through a low overhead search of the statically precomputed placement solutions. Based on this approach, we have proposed a periodic preemptive real-time scheduling methodology for runtime partially reconfigurable devices. Effectiveness of the proposed strategy has been verified through simulation based experiments and we observed that the strategy achieves high resource utilization with low task rejection rates over various simulation scenarios.;[Article Title: Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation/ Jaume Abella,Maria Padilla,Joan Del Castillo and Francisco J. Cazorla, p. 72:1-72:29] Abstract: Extreme Value Theory (EVT) has been historically used in domains such as finance and hydrology to model worst-case events (e.g., major stock market incidences). EVT takes as input a sample of the distribution of the variable to model and fits the tail of that sample to either the Generalised Extreme Value (GEV) or the Generalised Pareto Distribution (GPD). Recently, EVT has become popular in real-time systems to derive worst-case execution time (WCET) estimates of programs. However, the application of EVT is not straightforward and requires a detailed analysis of, and customisation for, the particular problem at hand. In this article, we tailor the application of EVT to timing analysis. To that end, (1) we analyse the response time of different hardware resources (e.g., cache memories) and identify those that may lead to radically different types of execution time distributions. (2) We show that one of these distributions, known as mixture distribution, causes problems in the use of EVT. In particular, mixture distributions challenge not only properly selecting GEV/GPD parameters (i.e., location, scale and shape) but also determining the size of the sample to ensure that enough tail values are passed to EVT and that only tail values are used by EVT to fit GEV/GPD. Failing to select these parameters has a negative impact on the quality of the derived WCET estimates. We tackle these problems, by (3) proposing Measurement-Based Probabilistic Timing Analysis using the Coefficient of Variation (MBPTA-CV), a new mixture-distribution aware, WCET-suited MBPTA method that builds on recent EVT developments in other fields (e.g., finance) to automatically select the distribution parameters that best fit the maxima of the observed execution times. Our results on a simulation environment and a real board show that MBPTA-CV produces high-quality WCET estimates. ;[Article Title: Noc-HMP: A Heterogeneous Multicore Processor for Embedded Systems Designed in System/ Zoran Salcic,Heejong Park,Jürgen Teich,Avinash Malik and Muhammad Nadeem, p. 73:1-73:25] Abstract: Scalability and performance in multicore processors for embedded and real-time systems usually don't go well each with the other. Networks on Chip (NoCs) provide scalable execution platforms suitable for such kind of embedded systems. This article presents a NoC-based Heterogeneous Multi-Processor system, called NoC-HMP, which is a scalable platform for embedded systems developed in the GALS language SystemJ. NoC-HMP uses a time-predictable TDMA-MIN NoC to guarantee latencies and communication time between the two types of time-predictable cores and can be customized for a specific performance goal through the execution strategy and scheduling of SystemJ program deployed across multiple cores. Examples of different execution strategies are introduced, explored and analyzed via measurements. The number of used cores can be minimized to achieve the target performance of the application. TDMA-MIN allows easy extensions of NoC-HMP with other cores or IP blocks. Experiments show a significant improvement of performance over a single core system and demonstrate how the addition of cores affects the performance of the designed system.;[Article Title: Time-Triggered Scheduling of Mixed-Criticality Systems/ Lalatendu Behera and Purandar Bhaduri, p. 74:1-74:25] Abstract: Real-time and embedded systems are moving from the traditional design paradigm to integration of multiple functionalities onto a single computing platform. Some of the functionalities are safety critical and subject to certification. The rest of the functionalities are nonsafety critical and do not need to be certified. Designing efficient scheduling algorithms which can be used to meet the certification requirement is challenging. Our research considers the time-triggered approach to scheduling of mixed-criticality jobs with two criticality levels. The first proposed algorithm for the time-triggered approach is based on the OCBP scheduling algorithm which finds a fixed-priority order of jobs. Based on this priority order, the existing algorithm constructs two scheduling tables SLOoc and SHIoc. The scheduler uses these tables to find a scheduling strategy. Another time-triggered algorithm called MCEDF was proposed as an improvement over the OCBP-based algorithm. Here we propose an algorithm which directly constructs two scheduling tables without using a priority order. Summary: Furthermore, we show that our algorithm schedules a strict superset of instances which can be scheduled by the OCBP-based algorithm as well as by MCEDF. We show that our algorithm outperforms both the OCBP-based algorithm and MCEDF in terms of the number of instances scheduled in a randomly generated set of instances. We generalize our algorithm for jobs with m criticality levels. Subsequently, we extend our algorithm to find scheduling tables for periodic and dependent jobs. Finally, we show that our algorithm is also applicable to mixed-criticality synchronous programs upon uniprocessor platforms and schedules a bigger set of instances than the existing algorithm. ;[Article Title: Incremental Layer Assignment for Timing Optimization/ Derong Liu,Bei Yu,Salim Chowdhury and David Z. Pan, p. 75:1-75:25] Abstract: With VLSI technology nodes scaling into the nanometer regime, interconnect delay plays an increasingly critical role in timing. For layer assignment, most works deal with via counts or total net delays, ignoring critical paths of each net and resulting in potential timing issues. In this article, we propose an incremental layer assignment framework targeting delay optimization in timing the critical path of each net. A set of novel techniques are presented: self-adaptive quadruple partition based on K × K division benefits the runtime; semidefinite programming is utilized for each partition; and the sequential mapping algorithm guarantees integer solutions while satisfying edge capacities; additionally, concurrent mapping offers a global view of assignment and post delay optimization reduces the path timing violations. The effectiveness of our work is verified by ISPD'08 benchmarks.
Item type: Serials
Tags from this library: No tags from this library for this title. Log in to add tags.
Star ratings
    Average rating: 0.0 (0 votes)
Holdings
Item type Current library Home library Collection Call number Copy number Status Date due Barcode
Serials Serials National University - Manila LRC - Main Periodicals Gen. Ed. - CCIT ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 4, 2017 (Browse shelf(Opens below)) c.1 Available PER000000029
Browsing LRC - Main shelves, Shelving location: Periodicals, Collection: Gen. Ed. - CCIT Close shelf browser (Hides shelf browser)
No cover image available
No cover image available
No cover image available
No cover image available
No cover image available
No cover image available
No cover image available
ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 1, 2017 ACM Transactions on Design Automation of Electronic Systems ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 2, 2017 ACM Transactions on Design Automation of Electronic Systems ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 3, 2017 ACM Transactions on Design Automation of Electronic Systems ACM Transactions on Design Automation of Electronic Systems, Volume 22, Issue 4, 2017 ACM Transactions on Design Automation of Electronic Systems ACM Transactions on Intelligent Systems and Technology, Volume 12, Issue 1, 2021 ACM Transactions on Intelligent Systems and Technology ACM Transactions on Intelligent Systems and Technology, Volume 12, Issue 3, June 2021 ACM Transactions on intelligent systems and technology / Yu Zheng, editor-in-chief ACM Transactions on Intelligent Systems and Technology, Volume 12, Issue 4, 2021 ACM Transactions on Intelligent Systems and Technology

Includes bibliographical references.

Exploring Energy-Efficient Cache Design in Emerging Mobile Platforms -- Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors -- Optimal Scheduling and Allocation for IC Design Management and Cost Reduction -- Proof-Carrying Hardware via Inductive Invariants -- Automated Integration of Dual-Edge Clocking for Low-Power Operation in Nanometer Nodes -- Design Methodology of Fault-Tolerant Custom 3D Network-on-Chip -- Approximate Energy-Efficient Encoding for Serial Interfaces -- Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencies -- Generating Current Constraints to Guarantee RLC Power Grid Safety -- Test Modification for Reduced Volumes of Fail Data -- Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC Converters -- Training Fixed-Point Classifiers for On-Chip Low-Power Implementation -- Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon Era -- Spatio-Temporal Scheduling of Preemptive Real-Time Tasks on Partially Reconfigurable Systems -- Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation -- Noc-HMP: A Heterogeneous Multicore Processor for Embedded Systems Designed in System -- Time-Triggered Scheduling of Mixed-Criticality Systems -- Incremental Layer Assignment for Timing Optimization.

[Article Title: Exploring Energy-Efficient Cache Design in Emerging Mobile Platforms/ Kaige Yan,Lu Pen,Mingsong Chen and Xin Fu, p. 58:1-58:20] Abstract: Mobile devices are quickly becoming the most widely used processors in consumer devices. Since their major power supply is battery, energy-efficient computing is highly desired. In this article, we focus on energy-efficient cache design in emerging mobile platforms. We observe that more than 40% of L2 cache accesses are OS kernel accesses in interactive smartphone applications. Such frequent kernel accesses cause serious interferences between the user and kernel blocks in the L2 cache, leading to unnecessary block replacements and high L2 cache miss rate. We first propose to statically partition the L2 cache into two separate segments, which can be accessed only by the user code and kernel code, respectively. Meanwhile, the overall size of the two segments is shrunk, which reduces the energy consumption while still maintaining the similar cache miss rate. We then find completely different access behaviors between the two separated kernel and user segments and explore the multi-retention STT-RAM-based user and kernel segments to obtain higher energy savings in this static partition-based cache design. Finally, we propose to dynamically partition the L2 cache into the user and kernel segments to minimize overall cache size. We also integrate the short-retention STT-RAM into this dynamic partition-based cache design for maximal energy savings. The experimental results show that our static technique reduces cache energy consumption by 75% with 2% performance loss, and our dynamic technique further shows strong capability to reduce cache energy consumption by 85% with only 3% performance loss.;[Article Title: Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors/ Taehyun Kim,Jongbum Lim,Jinku Kim,Woo-Cheol Cho,Eui-Young Chung and Hyuk-Jun Lee, p. 59:1-59:26] Abstract: Scalability of network processor-based routers heavily depends on limitations imposed by memory accesses and associated power consumption. Bandwidth shaping of a flow is a key function, which requires a token bucket per output queue and abuses memory bandwidth. As the number of output queues increases, managing token buckets becomes prohibitively expensive and limits scalability. In this work, we propose a scalable software-based token bucket management scheme that can reduce memory accesses and power consumption significantly. To satisfy real-time and low-cost constraints, we propose novel parallel heap data structures running on a manycore-based network processor. By using cache locking, the performance of heap processing is enhanced significantly and is more predictable. In addition, we quantitatively analyze the performance and memory footprint of the proposed software scheme using stochastic modeling and the Lyapunov central limit theorem. Finally, the proposed scheme provides an adaptive method to limit the size of heaps in the case of oversubscribed queues, which can successfully isolate the queues showing unideal behavior. The proposed scheme reduces memory accesses by up to three orders of magnitude for one million queues sharing a 100Gbps interface of the router while maintaining stability under stressful scenarios. ;[Article Title: Optimal Scheduling and Allocation for IC Design Management and Cost Reduction/ Prabhav Agrawal,Mike Broxterman,Biswadeep Chatterjee,Patrick Cuevas,Kathy H. Hayashi,Andrew B. Kahng,Pranay K. Myana and Siddhartha Nath, p. 60:1-60:30] Abstract: A large semiconductor product company spends hundreds of millions of dollars each year on design infrastructure to meet tapeout schedules for multiple concurrent projects. Resources (servers, electronic design automation tool licenses, engineers, and so on) are limited and must be shared -- and the cost per day of schedule slip can be enormous. Co-constraints between resource types (e.g., one license per every two cores (threads)) and dedicated versus shareable resource pools make scheduling and allocation hard. In this article, we formulate two mixed integer-linear programs for optimal multi-project, multi-resource allocation with task precedence and resource co-constraints. Application to a real-world three-project scheduling problem extracted from a leading-edge design center of anonymized Company X shows substantial compute and license costs savings. Compared to the product company, our solution shows that the makespan of schedule of all projects can be reduced by seven days, which not only saves ∼ 2.7% of annual labor and infrastructure costs but also enhances market competitiveness. We also demonstrate the capability of scheduling over two dozen chip development projects at the design center level, subject to resource and datacenter capacity limits as well as per-project penalty functions for schedule slips. The design center ended up purchasing 600 additional servers, whereas our solution demonstrates that the schedule can be met without having to purchase any additional servers. Application to a four-project scheduling problem extracted from a leading-edge design center in a non-US location shows availability of up to ∼ 37% headcount reduction during a half-year schedule for just one type of chip design activity.;[Article Title: Proof-Carrying Hardware via Inductive Invariants/ Tobias Isenberg,Marco Platzner,Heike Wehrheim and Tobias Wiersema, p. 61:1-61:23] Abstract: Proof-carrying hardware (PCH) is a principle for achieving safety for dynamically reconfigurable hardware systems. The producer of a hardware module spends huge effort when creating a proof for a safety policy. The proof is then transferred as a certificate together with the configuration bitstream to the consumer of the hardware module, who can quickly verify the given proof. Previous work utilized SAT solvers and resolution traces to set up a PCH technology and corresponding tool flows. In this article, we present a novel technology for PCH based on inductive invariants. For sequential circuits, our approach is fundamentally stronger than the previous SAT-based one since we avoid the limitations of bounded unrolling. We contrast our technology to existing ones and show that it fits into previously proposed tool flows. We conduct experiments with four categories of benchmark circuits and report consumer and producer runtime and peak memory consumption, as well as the size of the certificates and the distribution of the workload between producer and consumer. Experiments clearly show that our new induction-based technology is superior for sequential circuits, whereas the previous SAT-based technology is the better choice for combinational circuits.;[Article Title: Automated Integration of Dual-Edge Clocking for Low-Power Operation in Nanometer Nodes/ Andrea Bonetti,Nicholas Preyss,Adam Teman and Andreas Burg, p. 62:1-62:20] Abstract: Clocking power, including both clock distribution and registers, has long been one of the primary factors in the total power consumption of many digital systems. One straightforward approach to reduce this power consumption is to apply dual-edge-triggered (DET) clocking, as sequential elements operate at half the clock frequency while maintaining the same throughput as with conventional single-edge-triggered (SET) clocking. However, the DET approach is rarely taken in modern integrated circuits, primarily due to the perceived complexity of integrating such a clocking scheme. In this article, we first identify the most promising conditions for achieving low-power operation with DET clocking and then introduce a fully automated design flow for applying DET to a conventional SET design. The proposed design flow is demonstrated on three benchmark circuits in a 40nm CMOS technology, providing as much as a 50% reduction in clock distribution and register power consumption. ;[Article Title: Design Methodology of Fault-Tolerant Custom 3D Network-on-Chip/ Katherine Shu-Min Li and Sying-Jyan Wang, p. 63:1-63:20] Abstract: A systematic design methodology is presented for custom Network-on-Chip (NoC) in three-dimensional integrated circuits (3D-ICs). In addition, fault tolerance is supported in the NoC if extra links are included in the NoC topology.

In the proposed method, processors and the communication architecture are synthesized simultaneously in the 3D floorplanning process. 3D-IC technology enables ICs to be implemented in smaller size with higher performance; on the flip side, 3D-ICs suffer yield loss due to multiple dies in a 3D stack and lower manufacturing yield of through-silicon vias (TSVs). To alleviate this problem, a known-good-dies (KGD) test can be applied to ensure every die to be packaged into a 3D-IC is fault-free. However, faulty TSVs cannot be tested in the KGD test. In this article, the proposed method deals with the problem by providing fault tolerance in the NoC topology. The efficiency of the proposed method is evaluated using several benchmark circuits, and the experimental results show that the proposed method produces 3D NoCs with comparable performance than previous methods when fault-tolerant features are not realized. With fault tolerance in NoCs, higher yield can be achieved at the cost of performance penalty and elevated power level.;[Article Title: Approximate Energy-Efficient Encoding for Serial Interfaces/ Daniele Jahier Pagliari,Enrico Macii and Massimo Poncino, p. 64:1-64:25] Abstract: Serial buses are ubiquitous interconnections in embedded computing systems that are used to interface processing elements with peripherals, such as sensors, actuators, and I/O controllers. Despite their limited wiring, as off-chip connections they can account for a significant amount of the total power consumption of a system-on-chip device. Encoding the information sent on these buses is the most intuitive and affordable way to reduce their power contribution; moreover, the encoding can be made even more effective by exploiting the fact that many embedded applications can tolerate intermediate approximations without a significant impact on the final quality of results, thus trading off accuracy for power consumption. We propose a simple yet very effective approximate encoding for reducing dynamic energy in serial buses. Our approach uses differential encoding as a baseline scheme and extends it with bounded approximations to overcome the intrinsic limitations of differential encoding for data with low temporal correlation. We show that the proposed scheme, in addition to yielding extremely compact codecs, is superior to all state-of-the-art approximate serial encodings over a wide set of traces representing data received or sent from/to sensor or actuators. ;[Article Title: Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencies/ Benjamin Carrion Schafer, p. 65:1-65:20] Abstract: This works presents a Design Space Exploration (DSE) method for Behavioral IPs (BIPs) given in ANSI-C or SystemC to find the smallest micro-architecture for a specific target latency. Previous work on High-Level Synthesis (HLS) DSE mainly focused on finding a tradeoff curve with Pareto-optimal designs. HLS is, however, a single process (component) synthesis method. Very often, the latency of the components requires a specific fixed latency when inserted within a larger system. This work presents a fast multi-threaded method to find the smallest micro-architecture for a given BIP and target latency by discriminating between all different exploration knobs and exploring these concurrently. Experimental results show that our proposed method is very effective and comprehensive results compare the quality of results vs. the speedup of your proposed explorer.;[Article Title: Generating Current Constraints to Guarantee RLC Power Grid Safety/ Zahi Moudallal and Farid N. Najm, p. 66:1-66:39] Abstract: A critical task during early chip design is the efficient verification of the chip power distribution network. Vectorless verification, developed since the mid-2000s as an alternative to traditional simulation-based methods, requires the user to specify current constraints (budgets) for the underlying circuitry and checks if the corresponding voltage variations on all grid nodes are within a user-specified margin. This framework is extremely powerful, as it allows for efficient and early verification, but specifying/obtaining current constraints remains a burdensome task for users and a hurdle to adoption of this framework by the industry. Recently, the inverse problem has been introduced: Generate circuit current constraints that, if satisfied by the underlying logic circuitry, would guarantee grid safety from excessive voltage variations. This approach has many potential applications, including various grid quality metrics, as well as voltage drop-aware placement and floorplanning. So far, this framework has been developed assuming only resistive and capacitive (RC) elements in the power grid model. Inductive effects are becoming a significant component of the power supply noise and can no longer be ignored. In this article, we extend the constraints generation approach to allow for inductance. We give a rigorous problem definition and develop some key theoretical results related to maximality of the current space defined by the constraints. Based on this, we then develop three constraints generation algorithms that target the peak total chip power that is allowed by the grid, the uniformity of current distribution across the die area, and a combination of both metrics.;[Article Title: Test Modification for Reduced Volumes of Fail Data/ Irith Pomeranz,M. Enamul Amyeen and Srikanth Venkataraman, p. 67:1-67:17] Abstract: As part of a yield improvement process, fail data is collected from faulty units. Several approaches exist for reducing the tester time and the volume of fail data that needs to be collected based on the observation that a subset of the fail data is sufficient for accurate defect diagnosis. This article addresses the volume of fail data by considering the test set that is used for collecting fail data. It observes that certain faults from a set of target faults produce significantly larger numbers of faulty output values (and therefore significantly larger volumes of fail data) than other faults under a given test set. Based on this observation, it describes a procedure for modifying the test set to reduce the maximum number of faulty output values that a target fault produces. When defects are considered in a simulation experiment, and a defect diagnosis procedure is applied to the fail data that they produce, two effects are observed: the maximum and average numbers of faulty output values per defect are reduced significantly with the modified test set, and the quality of diagnosis is similar or even improved with the modified test set.;[Article Title: Multiharmonic Small-Signal Modeling of Low-Power PWM DC-DC Converters/ Ya Wang,Di Gao,Dani Tannir,Ning Dong,G. Peter Fang,Wei Dong and Peng Li, p. 68:1-68:16] Abstract: Small-signal models of pulse-width modulation (PWM) converters are widely used for analyzing stability and play an important role in converter design and control. However, existing small-signal models either are based on averaged DC behaviors, and hence are unable to capture frequency responses that are faster than the switching frequency, or greatly approximate these high-frequency responses. We address the severe limitations of the existing models by proposing a multiharmonic model that provides a complete small-signal characterization of both DC averages and high-order harmonic responses. The proposed model captures important high-frequency overshoots and undershoots of the converter response, which are otherwise unaccounted for by the existing techniques. In two converter examples, the proposed model corrects the misleading results of the existing models by providing truthful characterization of the overall converter AC response and offers important guidance for converter design and closed-loop control.;[Article Title: Training Fixed-Point Classifiers for On-Chip Low-Power Implementation/ Hassan Albalawi,Yuanning Li and Xin Li, p. 69:1-69:18] Abstract: In this article, we develop several novel algorithms to train classifiers that can be implemented on chip with low-power fixed-point arithmetic with extremely small word length. These algorithms are based on Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), and Logistic Regression (LR), and are referred to as LDA-FP, SVM-FP, and LR-FP, respectively.

They incorporate the nonidealities (i.e., rounding and overflow) associated with fixed-point arithmetic into the offline training process so that the resulting classifiers are robust to these nonidealities. Mathematically, LDA-FP, SVM-FP, and LR-FP are formulated as mixed integer programming problems that can be robustly solved by the branch-and-bound methods described in this article. Our numerical experiments demonstrate that LDA-FP, SVM-FP, and LR-FP substantially outperform the conventional approaches for the emerging biomedical applications of brain decoding. ;[Article Title: Efficient Mapping of Applications for Future Chip-Multiprocessors in Dark Silicon Era/ Mohaddeseh Hoveida,Fatemeh Aghaaliakbari,Ramin Bashizade,Mohammad Arjomand and Hamid Sarbazi-Azad, p. 70:1-70:26] Abstract: The failure of Dennard scaling has led to the utilization wall that is the source of dark silicon and limits the percentage of a chip that can actively switch within a given power budget. To address this issue, a structure is needed to guarantee the limited power budget along with providing sufficient flexibility and performance for different applications with various communication requirements. In this article, we present a general-purpose platform for future many-core Chip-Multiprocessors (CMPs) that benefits from the advantages of clustering, Network-on-Chip (NoC) resource sharing among cores, and power gating the unused components of clusters. We also propose two task mapping methods for the proposed platform in which active and dark cores are dispersed appropriately, so that an excess of power budget can be obtained. Our evaluations reveal that the first and second proposed mapping mechanisms respectively reduce the execution time by up to 28.6% and 39.2% and the NoC power consumption by up to 11.1% and 10%, and gain an excess power budget of up to 7.6% and 13.4% over the baseline architecture.;[Article Title: Spatio-Temporal Scheduling of Preemptive Real-Time Tasks on Partially Reconfigurable Systems/ Sangeet Saha,Arnab Sarkar and Amlan Chakrabarti, p. 71:1-71:26] Abstract: Reconfigurable devices that promise to offer the twin benefits of flexibility as in general-purpose processors along with the efficiency of dedicated hardwares often provide a lucrative solution for many of today's highly complex real-time embedded systems. However, online scheduling of dynamic hard real-time tasks on such systems with efficient resource utilization in terms of both space and time poses an enormously challenging problem. We attempt to solve this problem using a combined offline-online approach. The offline component generates and stores various optional feasible placement solutions for different sub-sets of tasks that may possibly be co-mapped together. Given a set of periodic preemptive real-time tasks that requires to be executed at runtime, the online scheduler first carries out an admission control procedure and then produces a schedule, which is guaranteed to meet all timing constraints provided it is spatially feasible to place designated subsets of these tasks at specified scheduling points within a future time interval. These feasibility checks are done and actual placement solutions are obtained through a low overhead search of the statically precomputed placement solutions. Based on this approach, we have proposed a periodic preemptive real-time scheduling methodology for runtime partially reconfigurable devices. Effectiveness of the proposed strategy has been verified through simulation based experiments and we observed that the strategy achieves high resource utilization with low task rejection rates over various simulation scenarios.;[Article Title: Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation/ Jaume Abella,Maria Padilla,Joan Del Castillo and Francisco J. Cazorla, p. 72:1-72:29] Abstract: Extreme Value Theory (EVT) has been historically used in domains such as finance and hydrology to model worst-case events (e.g., major stock market incidences). EVT takes as input a sample of the distribution of the variable to model and fits the tail of that sample to either the Generalised Extreme Value (GEV) or the Generalised Pareto Distribution (GPD). Recently, EVT has become popular in real-time systems to derive worst-case execution time (WCET) estimates of programs. However, the application of EVT is not straightforward and requires a detailed analysis of, and customisation for, the particular problem at hand. In this article, we tailor the application of EVT to timing analysis. To that end, (1) we analyse the response time of different hardware resources (e.g., cache memories) and identify those that may lead to radically different types of execution time distributions. (2) We show that one of these distributions, known as mixture distribution, causes problems in the use of EVT. In particular, mixture distributions challenge not only properly selecting GEV/GPD parameters (i.e., location, scale and shape) but also determining the size of the sample to ensure that enough tail values are passed to EVT and that only tail values are used by EVT to fit GEV/GPD. Failing to select these parameters has a negative impact on the quality of the derived WCET estimates. We tackle these problems, by (3) proposing Measurement-Based Probabilistic Timing Analysis using the Coefficient of Variation (MBPTA-CV), a new mixture-distribution aware, WCET-suited MBPTA method that builds on recent EVT developments in other fields (e.g., finance) to automatically select the distribution parameters that best fit the maxima of the observed execution times. Our results on a simulation environment and a real board show that MBPTA-CV produces high-quality WCET estimates. ;[Article Title: Noc-HMP: A Heterogeneous Multicore Processor for Embedded Systems Designed in System/ Zoran Salcic,Heejong Park,J├╝rgen Teich,Avinash Malik and Muhammad Nadeem, p. 73:1-73:25] Abstract: Scalability and performance in multicore processors for embedded and real-time systems usually don't go well each with the other. Networks on Chip (NoCs) provide scalable execution platforms suitable for such kind of embedded systems. This article presents a NoC-based Heterogeneous Multi-Processor system, called NoC-HMP, which is a scalable platform for embedded systems developed in the GALS language SystemJ. NoC-HMP uses a time-predictable TDMA-MIN NoC to guarantee latencies and communication time between the two types of time-predictable cores and can be customized for a specific performance goal through the execution strategy and scheduling of SystemJ program deployed across multiple cores. Examples of different execution strategies are introduced, explored and analyzed via measurements. The number of used cores can be minimized to achieve the target performance of the application. TDMA-MIN allows easy extensions of NoC-HMP with other cores or IP blocks. Experiments show a significant improvement of performance over a single core system and demonstrate how the addition of cores affects the performance of the designed system.;[Article Title: Time-Triggered Scheduling of Mixed-Criticality Systems/ Lalatendu Behera and Purandar Bhaduri, p. 74:1-74:25] Abstract: Real-time and embedded systems are moving from the traditional design paradigm to integration of multiple functionalities onto a single computing platform. Some of the functionalities are safety critical and subject to certification. The rest of the functionalities are nonsafety critical and do not need to be certified. Designing efficient scheduling algorithms which can be used to meet the certification requirement is challenging. Our research considers the time-triggered approach to scheduling of mixed-criticality jobs with two criticality levels. The first proposed algorithm for the time-triggered approach is based on the OCBP scheduling algorithm which finds a fixed-priority order of jobs. Based on this priority order, the existing algorithm constructs two scheduling tables SLOoc and SHIoc. The scheduler uses these tables to find a scheduling strategy. Another time-triggered algorithm called MCEDF was proposed as an improvement over the OCBP-based algorithm. Here we propose an algorithm which directly constructs two scheduling tables without using a priority order.

Furthermore, we show that our algorithm schedules a strict superset of instances which can be scheduled by the OCBP-based algorithm as well as by MCEDF. We show that our algorithm outperforms both the OCBP-based algorithm and MCEDF in terms of the number of instances scheduled in a randomly generated set of instances. We generalize our algorithm for jobs with m criticality levels. Subsequently, we extend our algorithm to find scheduling tables for periodic and dependent jobs. Finally, we show that our algorithm is also applicable to mixed-criticality synchronous programs upon uniprocessor platforms and schedules a bigger set of instances than the existing algorithm. ;[Article Title: Incremental Layer Assignment for Timing Optimization/ Derong Liu,Bei Yu,Salim Chowdhury and David Z. Pan, p. 75:1-75:25] Abstract: With VLSI technology nodes scaling into the nanometer regime, interconnect delay plays an increasingly critical role in timing. For layer assignment, most works deal with via counts or total net delays, ignoring critical paths of each net and resulting in potential timing issues. In this article, we propose an incremental layer assignment framework targeting delay optimization in timing the critical path of each net. A set of novel techniques are presented: self-adaptive quadruple partition based on K × K division benefits the runtime; semidefinite programming is utilized for each partition; and the sequential mapping algorithm guarantees integer solutions while satisfying edge capacities; additionally, concurrent mapping offers a global view of assignment and post delay optimization reduces the path timing violations. The effectiveness of our work is verified by ISPD'08 benchmarks.

There are no comments on this title.

to post a comment.