Schedule
Abstract
Modern datacenters are growing at phenomenal speeds and sizes that would have been considered impractical just ten years ago. The latest mega sites are now 250 MW and growing. Memory in recent years has also emerged as the most precious silicon in datacenters because online services host data in memory for tight latency constraints and containerised third-party workloads often run in memory for faster turnaround. Unfortunately, memory has slowed down with Moore’s Law in capacity scaling. Modern software stacks and services also heavily fragment memory exacerbating the pressure on memory. In this talk, I will make the case that today’s server blades are derived from the desktop PC and OS of the 80’s with the CPU dominating access to memory and the OS orchestrating movement to/from memory through legacy abstractions and interfaces. I will then present promising avenues for a clean-slate server design with novel abstractions for a tighter integration of memory with not just an accelerator ecosystem but also network and storage.
Bio
Babak Falsafi is Professor in the School of Computer and Communication Sciences and the founding director of the EcoCloud research center at EPFL. He has worked on server architecture for quite some time and has had contributions to a few industrial platforms including the WildFire/WildCat family of multiprocessors by Sun Microsystems (now Oracle), memory system technologies for IBM BlueGene/P and Q and ARM cores, and server evaluation methodologies in use by AMD, HP and Google (PerfKit). His recent work on scale-out server processor design laid the foundation for the first server-grade ARM CPU, Cavium ThunderX. He is a fellow of ACM and IEEE.
Abstract
The increasing demand of big data analytics for more main memory capacity in datacenters and exascale computing environments is driving the integration of heterogeneous memory technologies. The new technologies exhibit vastly greater differences in access latencies, bandwidth and capacity compared to the traditional NUMA systems. Leveraging this heterogeneity while also delivering application performance enhancements requires intelligent data placement. We present Kleio, a page scheduler with machine intelligence for applications that execute across hybrid memory components. Kleio is a hybrid page scheduler that combines existing, lightweight, history-based data tiering methods for hybrid memory, with novel intelligent placement decisions based on deep neural networks. We contribute new understanding toward the scope of benefits that can be achieved by using intelligent page scheduling in comparison to existing history-based approaches, and towards the choice of the deep learning algorithms and their parameters that are effective for this problem space. Kleio incorporates a new method for prioritizing pages that leads to highest performance boost, while limiting the resulting system resource overheads. Our performance evaluation indicates that Kleio reduces on average 80% of the performance gap between the existing solutions and an oracle with knowledge of future access pattern. Kleio provides hybrid memory systems with fast and effective neural network training and prediction accuracy levels, which bring significant application performance improvements with limited resource overheads, so as to lay the grounds for its practical integration in future systems.
Kleio was a best paper award finalist at HPDC ’19 (28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, AZ, USA – June 22 – 29, 2019).
Bio
Thaleia Doudali is a PhD student in Computer Science at Georgia Tech advised by Ada Gavrilovska. Her current research focuses on building system-level solutions that optimize application performance on systems with heterogeneous memory components, such as DRAM and Non Volatile Memory. Thaleia has industry experience and patents from internships at AMD, VMware and Dell EMC. Prior to Georgia Tech, she received an undergraduate diploma in Electrical and Computer Engineering from National Technical University of Athens where she was advised by Nectarios Koziris and Ioannis Konstantinou. Currently, Thaleia is serving on the Shadow Program Committee of EuroSys 2020.
Abstract
With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk — a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table. To accelerate page walks, we introduce Address Translation with Prefetching (ASAP), a light-weight technique for directly indexing individual levels of the page table radix tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page table without first accessing the preceding levels, thus lowering the page walk latency. ASAP is non-speculative and is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation.
Bio
Dmitrii is a senior PhD student at the University of Edinburgh (UoE), co-advised by Prof. Boris Grot (UoE) and Prof. Edouard Bugnion (EPFL), whose research interests span across Computer Systems and Architecture with a focus on software and hardware support for memory systems.
Abstract
We propose synergistic software and hardware mechanisms that alleviate the address translation overhead in virtualized systems. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging is applicable to the hypervisor and guest OS memory manager independently, as well as in native systems. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings in both native and virtualized scenarios, even when memory is fragmented, and (ii) SpOT exploits the provided contiguity and reduces address translation overhead of nested paging from ~18\% to ~1.2\%.
Bio
Chloe Alverti is a 2nd year PhD student at Computing Systems Laboratory (CSLab) of the Electrical and Computer Engineering School (ECE, NTUA) under the supervision of professor Georgios Goumas. Her studies include efficient virtual memory mechanisms focusing on both operating system and architectural support optimizations. Preliminary results of her current work were presented in the PACT 2018 ACM Student Research Competition poster session (awarded with the 1st place). Prior to her PhD studies, she was employed for two years as a research assistant at Chalmers Univeristy of Technology, Sweden, working on FP7 EU project EuroServer under the supervision of professor Per Stenstrom. Her main research area was system software support for main memory compression. Research findings of EuroServer led to the foundation of ZeroPoint Technologies AB, a start-up company designing novel memory compression technology, where she was employed as a system software engineer for a year. She still maintains collaboration with ZeroPoint Technologies as they work jointly on EU2020 project EuroExa.
Abstract
Pushing functionality to the hardware layer offers numerous advantages, such as performance increase, transparency of functionality, software simplification, energy efficiency, etc. In this presentation, I will talk about two solutions for increasing security as well as performance leveraging hardware. The first solution introduces microarchitectural changes inside the CPU to implement Instruction Set Randomization and Control Flow Integrity. The second solution leverages heterogeneous hardware architectures by utilizing GPU-CPU hybrid systems to increase computational performance. I will close the presentation with an outlook of future research directions.
Bio
Dr. Sotiris Ioannidis received a BSc degree in Mathematics and an MSc degree in Computer Science from the University of Crete in 1994 and 1996 respectively. In 1998 he received an MSc degree in Computer Science from the University of Rochester and in 2005 he received his PhD from the University of Pennsylvania. Ioan- nidis held a Research Scholar position at the Stevens Institute of Technology until 2007 and he is a Research Director at the Institute of Computer Science (ICS) of the Foundation for Research and Technology – Hellas (FORTH). In 2019 he was elected Associate Professor at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC). He has been a Member of the ENISA Permanent Stakeholders Group (PSG) since 2017. His research interests are in the area of systems and network security, security policy, pri- vacy and high-speed networks. Ioannidis has authored more than 150 publications in international conferences and journals, as well as book chapters, and has both chaired and served in numerous program committees in prestigious conferences, such as ACM CCS, IEEE S&P, etc. Ioannidis is a Marie-Curie Fellow and has participated in numerous international and European projects. He has coordinated several European and National projects (e.g., PASS, EU-INCOOP, GANDALF, SHARCS etc.), and currently is the project Coordinator of the CYBERSURE, IDEAL-CITIES, I-BIDAAS, THREAT-ARREST, C4IIoT, BIO-PHOENIX and H2020, as well as the CERTCOOP INEA/CEF European projects. Finally, Dr. Ioannidis is the deputy Coordinator of one of the four Cybersecurity Pilot Projects, funded by EU H2020 programme, CONCORDIA.
Abstract
The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines the first speculation-aware memory analyzer, new efficient speculative privatization methods, and a parallelization planner to find the best performing set of parallelization techniques. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory machine) than Privateer (11.5x), the most applicable prior automatic speculative-DOALL system.
Bio
I am a PhD student in the Liberty Research group at Princeton University, under the supervision of Prof. David I. August. I have also collaborated with Prof. Simone Campanoni from Northwestern University. My research focus is on compilers and automatic parallelization. During my time at Princeton, I have also worked on system security and computer architecture. During my internships at Facebook (summer 2018) and Intel (summer 2017), I worked on binary analysis. Before joining Princeton, I earned my diploma in Electrical and Computer Engineering at the National Technical University of Athens, Greece. For my undergraduate thesis, I worked with Georgios Goumas, Nikela Papadopoulou, and Nectarios Koziris on performance prediction of large-scale systems.
Abstract
Extreme heterogeneity in high-performance computing has led to a plethora of programming models for intra-node programming. The increasing complexity of those approaches and the lack of a unifying model has rendered the task of developing performance-portable applications intractable. To address these challenges, we present the Data-centric Parallel Programming (DAPP) concept, which decouples program definition from its optimized implementation. The latter is realized through Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that combines fine-grained data dependencies with high-level control-flow and is amenable to program transformations. We demonstrate the potential of the data-centric viewpoint with OMEN, a state-of-the-art quantum transport (QT) solver. We reduce the original C++ code of OMEN from 15k lines to 3k lines of python code and 2k SDFG nodes. We subsequently tune the generated code for two of the fastest supercomputers in the world (June ’19), and achieve up to two orders of magnitude higher performance; sustained 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision.
Bio
Alex is a PhD student at the Scalable Parallel Computing Laboratory at ETH Zurich, under the supervision of Prof. Torsten Hoefler. He received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens, under the supervision of Prof. Georgios Goumas. His research interests lie in performance optimization and modeling for parallel and distributed computing systems. Recently, he has been working on data-centric representations and optimizations for High-Performance Computing applications. He was awarded the 2019 Gordon Bell prize for his work on optimizing Quantum Transport Simulations.
Abstract
Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of μs. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network RPC scheduling. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. R2P2 exposes pairs of requests and responses and allows efficient and scalable RPC routing by separating the RPC target selection from request and reply streaming. Leveraging R2P2, we implement a novel join-bounded-shortest-queue (JBSQ) RPC load balancing policy, which lowers tail latency by centralizing pending RPCs in the router and ensures that requests are only routed to servers with a bounded number of outstanding requests. The R2P2 router logic can be implemented either in a software middlebox or within a P4 switch ASIC pipeline. Our evaluation, using a range of microbenchmarks, shows that the protocol is suitable for μs-scale RPCs and that its tail latency outperforms both random selection and classic HTTP reverse proxies. The P4-based implementation of R2P2 on a Tofino ASIC adds less than 1μs of latency whereas the software middlebox implementation adds 5μs latency and requires only two CPU cores to route RPCs at 10 Gbps line-rate. R2P2 improves the tail latency of web index searching on a cluster of 16 workers operating at 50% of capacity by 5.7× over NGINX. R2P2 improves the throughput of the Redis key-value store on a 4-node cluster with master/slave replication for a tail-latency service-level objective of 200μs by more than 4.8× vs. vanilla Redis.
Bio
Marios Kogias is a 5th year PhD student at EPFL working with Edouard Bugnion. His research focuses on datacenter systems, and specifically on microsecond-scale Remote Procedure Calls. He is interested in improving the tail-latency of networked systems by rethinking both operating systems mechanisms, e.g. schedulers, and networking, e.g. transport protocols, while leveraging new emerging datacenter hardware for in-network compute. Marios has interned at Microsoft Research, Google, and Cern, and he is an IBM PhD Fellow.
Abstract
Over the past decade, a plethora of systems have emerged to support data analytics in various domains such as SQL and machine learning, among others. In each of the data analysis domains, there are now many different specialized systems that leverage domain-specific optimizations to efficiently execute their workloads. An alternative approach is to build a general-purpose data analytics system that uses a common execution engine and programming model to support workloads in different domains. In this work, we choose representative systems of each class (Spark, TensorFlow, Presto and Hive) and benchmark their performance on a wide variety of machine learning and SQL workloads. We perform an extensive comparative analysis on the strengths and limitations of each system and highlight major areas for improvement for all systems. We believe that the major insights gained from this study will be useful for developers to improve the performance of these systems.
Bio
Evdokia Kassela is a PhD student and researcher at the Computing Systems Laboratory of the National Technical University of Athens (NTUA). She received her diploma in Electrical and Computer Engineering from NTUA in 2013 and is currently in the fourth year of her PhD studies. Her research interests lie in the field of distributed systems, cloud computing and big-data technologies. She has co-authored 3 papers and 2 posters, and in the latest 5 years she has been involved in the development of various EU and GR funded research and industrial projects.
Nikodimos Provatas was born in 1993 in Athens, Greece. He graduated from the School of Electrical and Computer Engineering of National Technical University of Athens at 2016 with GPA 8.78/10, with a diploma thesis on the field of distributed systems entitled “Exact and Approximate Algorithms for Multiengine Optimization on the Cloud”. From 2016 he has started his PhD at Computing Systems Laboratory. His research interests focus on machine learning on big data. At 2018 he also started the M.Sc. of National Technical University of Athens entitled “Data Science and Machine Learning”.
Abstract
Preserving computational performance increase after the end of scaling of traditional CMOS relies on making the most of emerging technologies such as devices, memories, photonics, specialized architectures, and others. However, if we are wildly successful in accelerating computation, the bottleneck will quickly shift to communication or system management of heterogeneous resources. In this talk, I will discuss my current and future approach to change the computational model and architectures to better fit emerging devices, as well as provide the capability to evaluate emerging devices rapidly at the system scale to better guide future device and architecture research. Then, I will discuss communication where photonics can provide the means to efficiently perform resource disaggregation and also increase the performance over power efficiency of hierarchical networks by better matching network connectivity to application demands.
Bio
George Michelogiannakis is a research scientist at Lawrence Berkeley national laboratory and an adjunct professor at Stanford University. He has extensive work on networking (both off- and on-chip) and computer architecture. His latest work focuses on the post Moore’s law era looking into specialization, emerging devices (transistors), memories, photonics, and 3D integration. He is also currently working on optics and architecture for HPC and datacenter networks.