Distributed Computing

Our research lies in the fields of Cloud Computing and Big Data. We particularly focus on Big Data management, performance and anonymity aspects of Big Data Analytics and elasticity of Cloud Computing platforms.

Elasticity of Cloud Applications/Platforms

One of the most appealing (albeit challenging) characteristics of cloud computing is the ability to support elastic computing. Elasticity refers to the ability of the infrastructure (IaaS), platform (PaaS) or software (SaaS) to expand or contract dedicated resources in order to meet the exact demand at runtime. Cloud elasticity using simple yet customizable rules can be provided so that application performance can be throttled in a multi-grained, controlled manner, bringing profits for cloud providers and customers.

Members of the distributed systems group have designed and implemented TIRAMOLA [CIKM11, SIGMOD12, SWIM12, CCGRID13], a framework to exploit the elastic behavior of modern NoSQL databases in the cloud. It currently supports creation, monitoring, increasing and decreasing the size of the DB cluster based on user-defined metrics. TIRAMOLA is open-sourced and can be downloaded from here: tiramola.googlecode.com.

Aspects of Distributed Big Data Analytics

Nowadays data are produced at an astounding rate: Market globalization, business process automation, the growing use of sensors and other data-producing devices, along with the increasing affordability of hardware have contributed to this continuous trend. Large companies, scientific organizations as well as more specialized heavily rely on data analysis in order to identify behavioral patterns and discover interesting trends/associations. Our group focuses on various aspects of Big Data Analytics.

CsLab members have proposed and implemented various tree and hypercube structures to efficiently store, index and query multidimensional [HPDC10CIKM10JPDC11b], hierarchical [WIDM08P2P08CooPIS08HPDC09CC10DEXA11JPDC11aTLDKS12] and time-series data [MDAC10 ] in a distributed manner. The key feature of these structures is their adaptability to the incoming workload, which allows for efficient OLAP operations with minimum storage overhead, while ensuring fault-tolerance.

Current research efforts exploit multi-engine environments for big data analytics, combining the different models, cost and quality of existing analytics engines. Such environments require an intelligent management system for orchestrating complex analytics tasks over the various engines [].

Furthermore, group members currently deal with privacy issues arising from the analysis of Big Data in Distributed/Cloud environments [PersDB11]. The goal is to ensure that individuals are unidentifiable in released data, even when they are combined with external sources and background information.  

Distributed RDF Databases

The pace at which data are described, queried and exchanged using the RDF specification has been ever increasing with the proliferation of Semantic Web. Modern, large scale RDF datasets cannot be effectively managed by centralized RDF databases and thus call for efficient and scalable distributed solutions. Indexing and querying huge RDF datasets using modern distributed computing frameworks is one of the main interests of our work.

Our group has successfully implemented H2RDF [WWW12SWIM12SWIM14], a fully distributed RDF store that combines the MapReduce processing framework with HBase; a distributed NoSQL database. H2RDF can process complex join queries in a highly scalable fashion through an adaptive query planner that makes informed, cost-based decisions over both the join order and execution type. Joins are executed using either single-machine or distributed jobs using an amount of resources proportional to the amount of processing required.

In succession, our group proposed H2RDF+ [BD13SIGMOD14] which extends H2RDF and performs efficient, highly scalable distributed Merge and Sort-Merge RDF joins. Both H2RDF and H2RDF+ databases are open-sourced and can be downloaded from h2rdf.googlecode.com.

Load Balancing of Distributed Range Queriable Data Structures

Systems such as P2P overlays or NoSQL databases have been shown to efficiently support the processing of range queries over large numbers of participating hosts. In such systems, uneven load allocation has to be effectively tackled in order to minimize overloaded peers and optimize their performance. Researchers of our group have detected two basic methodologies to achieve load-balancing: Iterative key re-distribution between neighbors and node migration. NIXMIG is a hybrid method that adaptively utilizes these two extremes to achieve both fast and cost-effective load-balancing [CoopIS08, P2P09, TPDS11, SIGMOD13].

Big Data Indexing Techniques

Members of CsLab have created a distributed processing platform suitable for indexing, storing and serving large amounts (in the orders of TB and more) of content data under heavy request [MDAC10].

In succession, the system has been enhanced with a mechanism fast and frequent updates on web-scale Inverted Indexes [DMC12]. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. The system is open-sourced and can be downloaded from here hmr-index-updater.googlecode.com

Profiling of Cloud Applications

Cloud computing has enabled numerous companies to develop and deploy their applications over cloud infrastructures for a wealth of reasons including (but not limited to) decrease costs, avoid administrative effort, rapidly allocate new resources, etc. Virtualization however, adds an extra layer in the software stack, hardening the prediction of the relation between the resources and the application performance, which is a key factor for every industry.

To address this challenge, members of the group currently design and implement a profiling system able to automatically deploy cloud applications in representative resource combinations and predict the application performance for all acceptable deployment combinations.