Parallel Computing

Our research focuses on hardware architectures, runtime management environments, system software, programming models and applications for computing platforms ranging from multicore systems to datacenters and supercomputers.

Hardware Architectures

We are always keen on exploring modern and emerging architecture designs, evaluating their performance and energy efficiency. The main focus of our work has been on the effective utilization of the multiple hardware contexts of today’s processors. We have successfully employed them for Thread-Level Paralellism (TLP) or Speculative Execution [ICPP06, HPCC06, SCJ08], while looking for ways to balance their conflicting requirements for high responsiveness and low resource consumption [MTAAP08]. Our ongoing research has been targeting energy efficiency [HPPAC13].

Our group also focuses on various challenges regarding on- and off-chip memory hierarchies of modern multicore systems. We are interested both in proposing new cache designs (insertion/replacement policies, cache partitioning [SAMOS08]) as well as evaluating the hierarchies integrated in today’s systems and prototypes (NUMA [PRACE Whitepaper]). Our current research efforts target the employment of data compression techniques to reduce the latency introduced by data transfers and/or increase the effective storage capacity of memory modules.

Parallel Programming

CSLab has a long tradition in parallel programming dealing with several aspects such as programming languages, tools, compilers and run-time systems. CSLab members have proposed one of the first methodologies to automatically generate message-passing code for stencil computations based on the polyhedral model [TPDS03PARCO06] and written one of the seminal papers in hybrid MPI-OpenMP programming [IPDPS04].

Our ongoing research on the field focuses on Transactional Memory (TM) and its implications on parallel algorithms [MTAAP09ICPP09] and concurrent data structures [TRANSACT15PDP16]. Hardware support for TM incorporated in modern processors has given a further boost to our motivation, enabling us to apply our research on hardware platforms awarded by Intel & IBM.

Resource-aware Scheduling

Scheduling tasks on compute engines within the context of parallel computing is one of the longest activities in CSLab. We have worked on task-graph mapping on multi-processor architectures [PDP00], hyperplane schedules to enable computation and communication overlapping [IPDPS01, JPDC03], communication-aware scheduling [TPDS09], and (memory) bandwidth-aware scheduling [ICPADS06].

Currently, our focus is on interference-aware scheduling for CMP architectures, where several critical hardware components (e.g. cache and memory link) are shared among the various application (single or multi-threaded). Within this framework, we are also dealing with the scheduling challenges within the ecosystems of cloud infrastructures and extreme-scale supercomputers. For instance, deploying multiple Virtual Machines (VMs) running diverse types of workloads on current many-core cloud computing infrastructures raises an important challenge: the Virtual Machine Monitor (VMM) has to efficiently multiplex VM accesses to the hardware. Having a mixture of VMs with different types of workloads running concurrently leads to poor I/O performance. We argue that altering the scheduling concept can optimize the system’s overall performance and energy consumption. We focus on a system with multiple scheduling policies that co-exist and service VMs according to their workload characteristics. To this end, we have designed a framework [VHPC2011] that provides three basic coexisting scheduling policies and implement it in the Xen paravirtualized environment.


We believe that modern High Performance Interconnection Networks provide abstractions that can be exploited in Virtual Machine execution environments but lack support in sharing architectures. Previous work [VHPC2011] has shown that integrating the semantics of Virtualization in specialized software that runs on Network Processors can isolate and finally minimize the overhead of the Hypervisor regarding access to the device by Guest VMs. Direct I/O has been proposed as the solution to the CPU overhead imposed by guest VM transparent services that can lead to low throughput for high bandwidth links. However, minimizing CPU overhead comes at the cost of giving away the benefits of the split-driver model [VHPC2010]. Integrating protocol offload support (present in most modern NICs) in virtual network device drivers can lead to performance improvement. Bypassing the Hypervisor in data movement can also minimize the overhead imposed by heavy I/O but at the cost of safety and memory protection.

To this direction, we have developed Xen2MX [JSS2014], a paravirtual interconnection framework, binary compatible with Myrinet/MX and wire compatible with MXoE. Xen2MX features zero-copy communication as well as memory sharing techniques in order to construct the most efficient data path for high-performance communication in virtualized environments that can be achieved with software techniques.

Additionally, our team has focused on optimizing inter-VM, intra-node communication. We examine both memory mapping (libxenvchan) as well  as copy techniques (V4VSockets) to provide an efficient data-transfer mechanism for VMs residing on the same container. Working towards this direction will enable us to study the effects of hypervisor involvement in the data-path and examine the trade-offs that occur. V4Vsockets encompasses our work on this topic, providing socket semantics to applications running on co-existing VMs achieving nearly 4x the performance of the generic approach (TCP/IP communication).

Systems support for data-intensive applications

With the advent of Cloud computing, data-intensive application deployments in virtualized environments is a common and popular choice. In this case, the performance of the I/O subsystem varies, depending on the underlying infrastructure and the application characteristics. Additionally, this increases the complexity of I/O management as well as its characterization/prediction.

Moreover, our team focuses on the effects of the diversity of I/O subsystems present in popular private or public cloud providers. To this end, we are working towards an I/O benchmarking scheme and its subsequent modelling for data-intensive applications in Cloud environments. We believe that such work will offer valuable information related to which application workloads and storage characteristics affect high-end I/O performance, and thus the application’s behavior.

Previous work [JCC2010] has focused on minimizing the impact of remote block I/O operations due to memory and peripheral bus bandwidth limitations on the server side; to overcome these limitations, we have developed GMBlock, a scalable block-level storage sharing system over Myrinet, so that shared disk filesystems may be deployed over a shared-nothing architecture. In this case, every cluster node assumes a dual role; it is both a compute and a storage node.

Sparse Matrix-Vector Multiplication (SpMV)

Greatly motivated by the specific demands of real-life, resource demanding applications, we have focused our research effort on one of the most important and demanding computational kernels, Sparse Matrix-Vector Multiplication (SpMV). SpMV is extremely memory-bandwidth hungry and scales poorly on modern multicore and manycore platforms. We have performed thorough evaluation to shed light on the subtle performance issues of SpMV [SUPE09] and proposed a new efficient storage format for sparse matrices called Compressed Sparse eXtended [PPOPP11TPDS13IPDPS13]. CSX is the basic building block of SparseX, a new, optimized library for sparse computations.

Performance Modeling

Performance modeling is a key activity that supports critical decision-making components in the parallel computing ecosystem. In the near past, we have worked on performance models for SpMV [ICPP09] and a synergistic scalability prediction approach for multi-threaded applications [MUCOCOS2013]. Currently, we are taking the challenge of devising a new communication prediction model for large-scale parallel applications running on supercomputers.


Nowadays, GPUs have become more powerful and generalized, allowing them to be applied to general purpose parallel computing tasks with excellent power efficiency. As a result, accelerator-based systems have become the mainstream solution in heterogeneous platforms in the context of High Performance Computing (HPC).

CSLab has focused on integrating the benefits of heterogeneous systems into a distributed environment; we work towards enabling a generic compute cluster to benefit from accelerated application execution by offloading computation to a remote GPU node. As data transfers (to/from the GPU node) are realized through the network, we study methods to minimize the communication overhead between the node running the application and the GPU node. To this end, we examine several high-performance interconnection frameworks, such as Infiniband and 10GbE. We have developed ib_gpudirect, a prototype for Infiniband networks supporting the CUDA framework which transfers data directly from the GPU memory to the network, bypassing the local memory hierarchy.