Giovanni Paolo Gibilisco, Min Li, Li Zhang, Danilo Ardagna
In: Proceedings of the IEEE International Conference on Cloud Computing
Spark has grown both in popularity and complexity in recent years. In order to use available resources in an efficient way, users need to understand how the behavior of their applications is affected by the size of the datasets and various configuration settings. Indeed, Spark allows users to specify many configuration parameters and understanding the impact of these choices with respect to the application execution time is not easy. An accurate estimate of application execution time is important for cluster capacity planning and/or runtime scheduling. In this work we propose a gray-box approach to analyze the performance of Spark applications deployed in public cloud infrastructures. The approach is divided into two phases: during application profiling, the application is executed multiple times against different subsets of the input datasets to understand the effect of the data size on the execution time and its dependency on the main configuration parameters. Next, during the estimation phase, we use the data gathered in the first step to predict the execution time of the application, run against the entire dataset. The prediction approach builds several models in order to estimate separately the growth of the time required to execute each stage within the application. Finally, the DAG used by Spark to schedule the execution of stages is analyzed to aggregate the predictions of the stages execution times into the overall application execution time. Both phases are supported by our SLAP open source tool. Experimental results show that our model can effectively and accurately predict application execution time. The approach outperforms pure black-box polynomial regression methods obtaining 1-3% relative error.