Diagnosing Performance Bottlenecks in Massive Data Parallel Programs

Vinícius  Dias,  Ruens  Moreira,  Wagner  Meira  Jr.,  and  Dorgival  Guedes

In: 16th  IEEE/ACM  International  Symposium  on Cluster, Cloud and Grid  Computing (CCGrid)


The increasing amount of data being stored and the variety of applications being proposed recently to make use of those data enabled a whole new generation of parallel programming environments and paradigms. Although most of these novel environments provide abstract programming interfaces and embed several run-time strategies that simplify several typical tasks in parallel and distributed systems, achieving good performance is still a challenge. In this paper we identify some common sources of performance degradation in the Spark programming environment and discuss some diagnosis dimensions that can be used to better understand such degradation. We then describe our experience in the use of those dimensions to drive the identification performance problems, and suggest how their impact may be minimized considering real applications.

Caracterização do Serviço de Táxi a partir de Corridas Solicitadas por um Aplicativo de Smartphone

Átila M. Silva Júnior, Miguel L. M. Sousa, Faber Z. Xavier, Wender Z. Xavier, Jussara M. Almeida, Artur Ziviani, Francisco Rangel, Cláudio Ávila, Humberto T. Marques-Neto

In: XXXIV Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC '16)


The use of mobile apps to support a plethora of services offers a valuable opportunity to study population dynamics in urban areas. Somehow, this can guide improvements of the mobile Internet access as well as of these systems. In particular, data obtained from mobile applications for requesting taxi rides allows for the study of human mobility. Moreover, the analysis of the user behavior patterns of such services enables valuable insights of the user needs. In this paper, we present a characterization study of the taxi rides from mobile apps data. In total, we analyze 37,183 rides requested by 16,442 users and serviced by 3,663 distinct taxi drivers in Belo Horizonte during one week. We find, for instance, that 51% of the rides have been requested in the southern part of downtown and among the cancellations, 49% occurred within the first minute of waiting before the taxi attendance.

Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the Earth System Grid Federation eco-system

S. Fiore, M. Płóciennik, C. Doutriaux, C. Palazzo, J. Boutte, T. Żok, D. Elia, M. Owsiak, A. D’Anca, Z. Shaheen, R. Bruno, M Fargetta, M. Caballer, G. Moltó, I. Blanquer, R. Barbera, M. David, G. Donvito, D. N. Williams, V. Anantharaj, D. Salomoni, and G. Aloisio

In: IEEE Big data 2016 Conference (Washington), December 5-9, 2016


A case study on climate models intercomparison data analysis addressing several classes of multi-model experiments is being implemented in the context of the EU H2020 INDIGO-DataCloud project. Such experiments require the availability of large amount of data (multi-terabyte order) related to the output of several climate models simulations as well as the exploitation of scientific data management tools for large-scale data analytics. More specifically, the paper discusses in detail a use case on precipitation trend analysis in terms of requirements, architectural design solution, and infrastructural implementation. The experiment has been tested and validated on CMIP5 datasets, in the context of a large scale distributed testbed across EU and US involving three ESGF sites (LLNL, ORNL, and CMCC) and one central orchestrator site (PSNC).

Testing Web Applications Using Poor Quality Data

Nuno Laranjeiro, Seyma Nur Soydemir, Jorge Bernardino

In: Latin-American Symposium on Dependable Computing (LADC 2016)


Web applications are nowadays being used to support enterprise-level business operations and usually rely on back-end databases to deliver service to clients. Research and industry reports indicate the huge impact the quality of the data can have on businesses, especially when applications are not prepared for handling low quality data. In fact, even in widely tested and used applications, the presence of poor data can sometimes result in severe failures and bring in disastrous consequences for clients and providers, including financial or reputation losses. In this paper, we present an approach based on the runtime injection of poor quality data on the database interface used by web applications, which allows understanding how vulnerable the application is to the presence of poor quality data. Results indicate that the approach can be easily used to disclose critical problems in web applications and supporting middleware, helping developers in building more reliable services.

Towards Understanding the Value of False Positives in Static Code Analysis

Carlo Dimastrogiovanni, Nuno Laranjeiro

In: Latin-American Symposium on Dependable Computing (LADC 2016)


Static code analysis is a well-known technique used to detect potential software security issues. Nowadays, given the large variety of vulnerabilities and the increasing complexity of web applications, it is difficult for static code analyzers to identify vulnerabilities in a precise manner. The main problem is with the typically high number of false positives reported by these tools, which refer to vulnerabilities that, in practice, do not exist. The common view is that the information regarding false positives is useless. In this paper we give an initial step towards investigating the hypothesis that false positives may be, in fact, a link to potential security problems. We analyzed 3 open-source web applications using a well-known static analyzer, then identified false positives and linked these to potential security problems. Preliminary results suggest that, in many cases, the presence of a false positive indicates a fragility of the application, which is prone, in different degrees, to turn into a real vulnerability.

Experimenting Machine Learning Techniques to Predict Vulnerabilities

Henrique Alves, Baldoino Fonseca, Nuno Antunes

In: Latin-American Symposium on Dependable Computing (LADC 2016)


Software metrics can be used as a indicator of the presence of software vulnerabilities. These metrics have been used with machine learning to predict source code prone to contain vulnerabilities. Although it is not possible to find the exact location of the flaws, the models can show which components require more attention during inspections and testing. Each new technique uses his own evaluation dataset, which many times has limited size and representativeness. In this experience report, we use a large and representative dataset to evaluate several state of the art vulnerability prediction techniques. This dataset was built with information of 2186 vulnerabilities from five widely used open source projects. Results show that the dataset can be used to distinguish which are the best techniques. It is also shown that some of the techniques can predict nearly all of the vulnerabilities present in the dataset, although with very low precisions. Finally, accuracy, precision and recall are not the most effective to characterize the effectiveness of this tools.

Challenges on Anonymity, Privacy and Big Data

Tania Basso, Roberta Matsunaga, Regina Moraes, Nuno Antunes

In: Workshop on Dependability in Evolving Systems, 2016, Cali, Colombia


This paper provides an overview of the tools and methodologies for data privacy protection that can cope with the challenges raised by the Big Data storage and analytics processing, with focus on anonymity. Preserving individual privacy is one of the major issues in the context of Big Data, as while handling huge volumes of data, it is possible that sensitive or personally identifiable information ends up disclosed. In fact, even when dealing with anonymized raw data, sensitive information may be extracted through analytics. Preserving anonymity is particularly difficult because it should be done while allowing the analytics to produce useful insight about the data. We further discuss these challenges and future research directions in order to perform big data analytics in a privacy-compliant way.

Experimental Assessment of NoSQL Databases Dependability

Luís Ventura, Nuno Antunes

In: 12th European Dependable Computing Conference (EDCC 2016)


NoSQL databases are the response to the sheer volume of data being generated, stored, and analysed by modern users and applications. They are extremely scalable horizontally and store data with less rigid structures than the relational ones. NoSQL databases are known to compromise consistency in favour of availability, partition tolerance, and performance. Several studies evaluated the performance of these databases, but the users also need to understand how they behave in the presence of faults and quantify the impact of those faults. This paper presents an experimental evaluation of NoSQL databases' dependability using fault injection, which compares three widely used NoSQL engines based on how they perform in the presence of operator faults. The results show clearly that many times the integrity of the data is affected, even in the presence of simple faults. It is also shown how different databases handle the workloads and the faults differently, evidencing that users must carefully select the solution to use in their systems.

Traffic Accident Diagnosis in the last decade - A case study

Isabelle C. Luís, Nádia P. Kozievitch, Tatiana M. C. Gadda

In: 19th International Conference on Intelligent Transportation System


Accident reduction projects must undergo the research stage of the current situation and trends based on a historical record of accidents. Identification of hot spots for accidents is a vital step in safety management, since roadway improvements are supposed to be applied to random locations. In this context we present a diagnosis of the traffic accidents in the Cidade Industrial de Curitiba, CIC (Brazil), and to identify several measures that could contribute to the reduction of accidents. We take advantage of GIS and exploratory analysis to identify road accident hot spots both visually and statistically. Finally, some guidelines are suggested, using 10 years of data in a case study.

Combining K-means Method and Complex Network Analysis to Evaluate City Mobility

Emerson L. C. da Silva,  Marcelo de Oliveira Rosa,  Keiko V. O. Fonseca, Ricardo Luders, Nádia P. Kozievitch

In: 19th International Conference on Intelligent Transportation System


Complex networks have been used to model public transportation systems (PTS) considering the relationship between bus lines and bus stops. Previous works focused on statistically characterize either the whole network or their individual bus stops and lines. The present work focused on statistically characterize different regions of a city (Curitiba, Brazil) assuming that a passenger could easily access different unconnected bus stops in a geographic area. K-means algorithm was used to partition the bus stops in (K =) 2 to 40 clusters with similar geographic area. Results showed strong inverse relationship (p <; 2 × 10-16 and R2 = 0.74 for K = 40 in a log model) between the degree and the average path length of clustered bus stops. Regarding Curitiba, it revealed well and badly served regions (downtown area, and few suburbs in Southern and Western Curitiba, respectively). Some of these well served regions showed quantitative indication of potential bus congestion. By varying K, city planners could obtained zoomed view of the behavior of their PTS in terms of complex networks metrics.