Per Stenström
Per Stenström is Professor of Computer Engineering at Chalmers University of Technology, Sweden since 1995. His research interests are devoted to computer architecture and he has made major contributions to especially high-performance memory systems. He has authored or co-authored four textbooks and more than 200 publications in international journals and conferences and 20 patents. He is regularly serving program committees of major conferences in the computer architecture field and is currently a Senior Associate Editor of ACM Transactions on Architecture and Code Optimization and Topical, Associate Editor of IEEE Transactions on Computers. He was serving the Journal of Parallel and Distributed Computing for 30 years as Editor and Associate Editor-in-Chief. He has served IEEE Transactions on Parallel and Distributed Processing, the IEEE TCCA Computer Architecture Letters, and others. He co-founded the HiPEAC Network of Excellence funded by the European Commission. He has also acted as Program Chair for a large number of conferences including the ACM/IEEE Int. Symposium on Computer Architecture, the IEEE High-Performance Computer Architecture Symposium, the IEEE International Parallel and Distributed Processing Symposium and the ACM International Conference on Supercomputing and IEEE/ACM PACT. He has been part of more than ten European projects. He has graduated about 25 PhDs. He earned his PhD degree in computer engineering in 1990 from Lund University, Sweden. He is a Fellow of the ACM and Fellow of the IEEE and a member of Academia Europaea and the Royal Swedish Academy of Engineering Sciences among other academies.
Showing 197 publications
BATCH-DNN: Adaptive and Dynamic Batching for Multi-DNN Accelerators
ASaP: Automatic Software Prefetching for Sparse Tensor Computations in MLIR
DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators
HMComp: Extending Near-Memory Capacity using Compression in Hybrid Memory
SoK: Analysis of Root Causes and Defense Strategies for Attacks on Microarchitectural Optimizations
eProcessor: European, Extendable, Energy-Efficient, Extreme-Scale, Extensible, Processor Ecosystem
SCALE: Secure and Scalable Cache Partitioning
GBDI: Going Beyond Base-Delta-Immediate Compression with Global Bases
Bounding the execution time of parallel applications on unrelated multiprocessors
CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling
Federated Scheduling of Sporadic DAGs on Unrelated Multiprocessors
DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors
A GPU Register File using Static Data Compression
QoS-driven coordinated management of resources to save energy in multi-core systems
Trends on heterogeneous and innovative hardware and software systems
SaC: Exploiting execution-time slack to save energy in heterogeneous multicore systems
SimICS/sun4m: A virtual workstation
Global dead-block management for task-parallel programs
Scheduling parallel real-time recurrent tasks on multicore platforms
ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness
SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures
Runtime-Assisted Global Cache Management for Task-based Parallel Programs
Rock: A framework for pruning the design space of hybrid main memory systems
Timing-anomaly free dynamic scheduling of task-based parallel applications
ProF: Probabilistic Hybrid Main Memory Management for High Performance and Fairness
PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor
RADAR: Runtime-assisted dead region management for last-level caches
Adaptive row addressing for cost-efficient parallel memory protocols in large-capacity memories
Timing-anomaly free dynamic scheduling of task-based parallel applications
RADAR: Run-time assisted Dead-Region Management for Last-Level Caches
EUROSERVER: Share-anything scale-out micro-server design
A Case for Runtime-Assisted Global Cache Management
A Primer on Compression in the Memory Hierarchy
Enhancing Garbage Collection Synchronization using Explicit Bit Barriers
RADAR: Runtime-Assisted Dead Region Management for Last-Level Caches
Performance Impact of Batching Web Application Requests using Hot-spot Processing on GPUs
HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods
Proceedings of the 2014 ACM International Conference on Supercomputing
Removal of Conflicts in Hardware Transactional Memory Systems
Overhead-Aware Temporal Partitioning on Multicore Processors
Introduction to the JPDC special issue on Perspectives on Parallel and Distributed Processing
Characterizing and Exploiting Small-Value Memory Instructions
ZEBRA: Data-Centric Contention Management in Hardware Transactional Memory
Crystal: A design-time resource partitioning method for hybrid main memory
A Case for a Value-Aware Cache
A Design-Time Resource Partitioning Method for Hybrid Main Memory
Temporal Partitioning on Multicore Platform
Performance and energy analysis of the restricted transactional memory implementation on haswell
SC2: A statistical compression cache scheme
Effective Resource Management Towards Efficient Computing
Runtime-guided cache coherence optimizations in multi-core architectures
Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures
Moving from Petaflops to Petadata
Improving Data Access Efficiency by Using a Tagless Access Buffer (TAB)
Efficient Forwarding of Producer-Consumer Data in Task-based Programs
Towards automatic resource management in parallel architectures.
Eager Beats Lazy: Improving Store Management in Eager Hardware Transactional Memory
Efficient Forwarding of Producer-Consumer Data in Task-based Programs
HARP: Adaptive Abort Recurrence Prediction for Hardware Transactional Memory
Transactions on Architectures and Code Optimizations
Transactional Prefetching: Narrowing the Window of Contention in Hardware Transactional Memory
Transactional Prefetching: Narrowing the Window of Contention in Hardware Transaction Memory
Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory
Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications
Transactional prefetching: Narrowing the window of contention in hardware transactional memory
A Data Forwarding Scheme for Task-based Programming Models
Coherence-Less Model for Shared-Memory, Speculative Multi-core Processors
Transactions on High Performance and Embedded Architectures and Compilers - Vol 4
ZEBRA: A data-centric, hybrid-policy hardware transactional memory design
Techniques for Reduction of Conflicts in Hardware Transactional Memory.
Transactions on High-Performance Embedded Architectures and Compilers Vol 3
Transaction on Architectures and Code Optimization
Implications of Merging Phases on Scalability of Multi-core Architectures
Implications of Merging Phases on Scalability of Multicore Architectures
The impact of non-coherent buffers on lazy hardware transactional memory systems
A Unified Approach to Eliminate Memory Accesses Early
Diagnosing Critical Section Bottlenecks in Multithreaded Applications
Eager meets lazy: The impact of write-buffering on hardware transactional memory
The Impact of Non-coherent on Lazy HardwareTransactional Memory Systems
Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory
Hints Based Speculative Execution for Exploiting Probabilistic Parallel Execution.
Classification and Elimination of Conflicts in Hardware-Transactional Memory Systems
LV*: A Class of Lazy-Versioning HTMs for Low-Cost Integration of Transactional Memory Systems
Semantic Information Driven Speculative Execution
Characterization and Exploitation of Narrow-Width Loads:The Narrow-Width Cache Approach
Sematic based speculative parallel execution.
Characterization and Exploitation of Silent Loads
Diagnosing Serialization Bottlenecks in Multi-threaded Applications on Multi-core Processors
LV*: A Low Complexity Lazy Versioning HTM Infrastructure
Simple Performance Optimization Techniques for Hardware Transactional Memory Systems
The VELOX Transactional Memory Stack
Generating and Comparing Memory Access Ranges for Speculative Throughput Computing
Using Hoarding to Increase the Availability in Shared File Systems
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing
Zero-Value Caches: Cancelling Loads that Return Zero.
Zero-Value Caches: Cancelling Loads that Return Zero
A Flexible Code-Compression Scheme using Partitioned Look-Up Tables
Cancellation of Loads that Return Zero Using Zero-Value Caches
Semantic information driven speculative execution
Schemes for avoiding starvation in transactional memory systems
Transactions on High-Performance Embedded Architectures and Compilers
Leveraging data promotion for low power D-NUCA caches
A Flexible Code Compression Scheme using Partitioned Look-Up Tables
Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems
Early Detection and Bypassing of Trivial Operations to Improve Energy Efficiency of Processors
Efficient Management of Speculative Data in Hardware Transactional Memory Systems
A Micro-Architectural Power-Saving Technique for D-NUCA Caches
Simple Penalty-Sensitive Cache Replacement Policies
Memory Link Compression Schemes: A Value Locality Perspective
Zero Loads: Canceling Load Requests by Tracking Zero Values
The worst-case execution-time problem - overview of methods and survey of tools
Reducing Roll-back Overhead in Transactional Memory Systems by Checkpointing Conflicting Accesses
Accommodation of the Bandwidth of Large Cache Blocks using Cache/Memory Link Compression
Proceedings of the 14th IEEE Symp. on High-Performance Computer Architecture
Effectiveness of Caching in a Distributed Digital Library.
Implicit Transactional Memory in Kilo-Instruction Processors
Loop-Level Speculative Parallelism in Embedded Applications.
Starvation-Free Transactional Memory System Protocols.
An Adaptive Shared/Private NUCA Cache Partiotioning Scheme for Chip Multiprocessors
Improving Power Efficiency of D-NUCA Caches
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing
Characterization of Apache web server with Specweb2005
The Paradigm Shift to Multi-Cores: Opportunities and Challenges
Starvation-Free Commit Arbitration Policies for Transactional Memory Systems.
SimWattch: Integrating complete-system and user-level performance and power simulators
Limits on Thread-Level Speculative Parallelism in Embedded Applications
Proceedings of the 2007 International Conference on HiPEAC
Efficient Management of Speculative Data in Hardware Transactional Memory Systems
Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems
Exposed Datapath for Efficient Computing
Proceedings of the 2007 ACM International Conference on Computing Frontiers
Enhancing Lower Level Cache Performance by Early Miss Determination and Bypassing.
High-Performance Embedded Architecture and Compilation Roadmap
Value-Cache Based Compression Schemes for Multiprocessors
Exposed Datapath for Efficient Computing
A Cache Replacement Algorithm based on Frequency and Recency for Chip Multiprocessors.
Starvation-Free Commit Arbitration Policies for Transactional Memory Systems.
A Cache-Partition Aware Replacement Policy for Chip Multiprocessors.
Exploitation of Value Locality for Memory Link Compression
Two Threads in the Machine is Better than Eight in the Bush
Penalty-Sensitive Replacement Policies for Caches.
Implementing Kilo-Instruction Multiprocessors
A Cost-Effective Memory Organization for Future Servers
Keynote 2: The chip-multiprocessing paradigm shift: Opportunities and challenges
Evaluation of Extended Dictionary-Based Static Code Compression Techniques
Enhancing Simulation Speed using Matched-Pair Comparison
Reducing Misspeculation Overhead for Module-Level Speculative Execution
Self-Correcting LRU Replacement Policies.
A Cache Block Reuse Prediction Scheme
Performance and Power Impact of Issue-width in Chip-Multiprocessor Cores
Improving Speculative Thread-Level Parallelism Through Module Run-Length Prediction
FlexSoC: Combining Flexibility and Efficiency in SoC Designs
An Evaluation of Document Prefetching in a Distributed Digital Library
SimWattch: An Approach to Integrate Complete-System with User-Level Performance/Power Simulators
A Novel Approach to Cache Block Reuse Prediction
An Evaluation of Document Prefetching in a Distributed Digital Library
Speculative Lock Reordering: Optimistic Out-of-Order Execution of Critical Sections
Reducing Misspeculation Overhead for Module-Level Speculative Execution
Evaluation of Document Prefetching in a Distributed Digital Library.
Coherence Predictor Cache: A Resource Efficient Coherence Message Prediction Infrastructure.
The FAB Predictor: Using Fourier Analysis to Predict the Outcome of a Conditional Branch
An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors
Improvement of energy-efficiency in off-chip caches by selective prefetching
TLB and Snoop Energy-Reduction using Virtual Caches for Low-Power Chip-Multiprocessors
Download publication list
You can download this list to your computer.
Filter and download publication list
As logged in user (Chalmers employee) you find more export functions in MyResearch.
You may also import these directly to Zotero or Mendeley by using a browser plugin. These are found herer:
Zotero Connector
Mendeley Web Importer
The service SwePub offers export of contents from Research in other formats, such as Harvard and Oxford in .RIS, BibTex and RefWorks format.
Showing 20 research projects
classIC - Chalmers Lund Center for Advanced Semiconductor System Design
Pilot using Independent Local & Open Technologies (The European PILOT)
Principer för beräknande minnesenheter (PRIDE)
eProcessor: European, extendable, energy- efficient, extreme-scale, extensible, Processor Ecosystem
PRIME: Principled Designs of Processing-in-Memory Parallel Systems
High Performance Embedded Architecture and Compilation
The European Processor Initiative (EPI)
High Performance and Embedded Architecture and Compilation (HiPEAC5)
TEchnology TRAnsfer via Multinational Application eXperiments (TETRAMAX)
High Performance and Embedded Architecture and Compilation (HiPEAC4)
ACE: Approximate Algorithms and Computing Systems
Meeting Challenges in Computer Architecture (MECCA)
Green Computing Node for European micro-servers (EUROSERVER)
A Framework for Fine-Grain Resource Management in Heterogeneous Parallel Architectures
High Performance and Embedded Architecture and Compilation (HiPEAC)