Data Analytic development framework
DQaaS is a service that aims to provide information about the quality of a requested dataset. Data Quality helps applications and users in understanding the degree with which a dataset is suitable for their goals. In particular, considering a dataset, the service (i) offers the access to different quality metrics periodically evaluated and (ii) allows applications and users to perform additional quality evaluations at different granularities. Politecnico di Milano (POLIMI) designed and developed the data quality assessment algorithms by considering the datasets available within the EUBra-BIGSEA project.
Developed for Data Quality-oriented preprocessing.
All Big Data scientists, big data software developers coming especially from the Business analytics software development community who have the needs to analyze the available data sources and deal with big data issues (i.e., volume, velocity and variety).
A researcher aims to analyse available data sources by applying data mining algorithms. The invocation of DQaaS allows the researcher to understand the characteristics of the data source, to evaluate the suitability of the quality level of the source to the analysis that s/he aims to perform and to discard the data that do not satisfy her/his requirements. In the specific case, (i) data volume should be significant in order to have significant results, (ii) accuracy and completeness should be high in order to have a correct output, (iii) timeliness is less relevant since usually data mining algorithms are historical evaluations and no specific time constraints are needed. DQaaS is able to evaluate these dimensions and inform the researcher about the appropriateness of the considered sources. In summary, the DQaaS module supports data preprocessing operations by detecting errors, inconsistencies or missing data and avoiding that such issues have a negative impact on the results provided by the mining algorithms.
A first release of the DQaaS has been completed in April 2017. For the last version, it has been improved by introducing new features such as the evaluation of the quality metric at the tuple level and the possibility to filter the considered data sources by considering data quality requirements. The current version of the tool can be downloaded accessing the following link: https://github.com/eubr-bigsea/DQaaS
The service can be invoked by preparing a configuration file, i.e., a JSON file in which the user has to specify the source to analyse, the dimensions to assess and the desired granularity levels. No specific skills are required. The users should just have a proper knowledge about DQ analysis in order to express their requirements in a suitable way with respects the use that they aim to do with data.
The tool has been implemented in Python using the Spark Python API (PySpark). It is based on the Apache 2.0 license that allows the free use of the code.
The execution costs are related to the time and resources required for completing the assessment operations. In average a complete analysis of 1 Gb of raw data needs about 1 hour time.
Dr. Cinzia Cappiello of Politecnico di Milano: firstname.lastname@example.org
View related publications
--> Tiago Brasileiro Araújo, Cinzia Cappiello, Nádia Puchalski Kozievitch, Demetrio Gomes Mestre, Carlos Eduardo Santos Pires, Monica Vitali: Towards Reliable Data Analyses for Smart Cities. IDEAS 2017: 304-308.
--> Cinzia Cappiello, Monica Vitali: Quality awareness for a Successful Big Data Exploitation. IDEAS 2018
--> Danilo Ardagna, Cinzia Cappiello, Walter Samà, Monica Vitali: Context-aware Data Quality Assessment for Big Data.