Skip to main content

EUROPE - BRAZIL COLLABORATION OF BIG DATA SCIENTIFIC RESEARCH THROUGH CLOUD-CENTRIC APPLICATIONS

  • Partners
  • Communication kit
  • Contact
  • About
logo
  • News & Events
    • Events
      • WACC 2017
    • News
    • Library
    • Deliverables
  • Technology
    • Data Analytics development framework
    • QoS Cloud services
    • Toolbox of descriptive and predictive models
  • Application field
    • Connected societies
  • Standards
  • Europe - Brazil Cooperation
Press [ esc ] or close+

Search form

DQaaS - Data Quality-As-A-Service

You are here

Home
CATEGORY

Data Analytic development framework

How can we help you ?
Do you have any questions about the EUBra-BigSea technology?
Contact us
Share
Facebook Google Plus LinkedIn Twitter 

DQaaS is a service that aims to provide information about the quality of a requested dataset. Data Quality helps applications and users in understanding the degree with which a dataset is suitable for their goals. In particular, considering a dataset, the service (i) offers the access to different quality metrics periodically evaluated and (ii) allows applications and users to perform additional quality evaluations at different granularities. Politecnico di Milano (POLIMI) designed and developed the data quality assessment algorithms by considering the datasets available within the EUBra-BIGSEA project.

 

Who should use it? 


Developed for Data Quality-oriented preprocessing. 

All Big Data scientists, big data software developers coming especially from the Business analytics software development community who have the needs to analyze the available data sources and deal with big data issues (i.e., volume, velocity and variety). 

User scenario

A researcher aims to analyse available data sources by applying data mining algorithms. The invocation of DQaaS allows the researcher to understand the characteristics of the data source, to evaluate the suitability of the quality level of the source to the analysis that s/he aims to perform and to discard the data that do not satisfy her/his requirements. In the specific case, (i) data volume should be significant in order to have significant results, (ii) accuracy and completeness should be high in order to have a correct output, (iii) timeliness is less relevant since usually data mining algorithms are historical evaluations and no specific time constraints are needed. DQaaS is able to evaluate these dimensions and inform the researcher about the appropriateness of the considered sources. In summary, the DQaaS module supports data preprocessing operations by detecting errors, inconsistencies or missing data and avoiding that such issues have a negative impact on the results provided by the mining algorithms.

 

Download & Resources


A first release of the DQaaS has been completed in April 2017. For the last version, it has been improved by introducing new features such as the evaluation of the quality metric at the tuple level and the possibility to filter the considered data sources by considering data quality requirements. The current version of the tool can be downloaded accessing the following link: https://github.com/eubr-bigsea/DQaaS

The service can be invoked by preparing a configuration file, i.e., a JSON file in which the user has to specify the source to analyse, the dimensions to assess and the desired granularity levels. No specific skills are required. The users should just have a proper knowledge about DQ analysis in order to express their requirements in a suitable way with respects the use that they aim to do with data.

License


The tool has been implemented in Python using the Spark Python API (PySpark). It is based on the Apache 2.0 license that allows the free use of the code.

The execution costs are related to the time and resources required for completing the assessment operations. In average a complete analysis of 1 Gb of raw data needs about 1 hour time.

 

Contact


Dr. Cinzia Cappiello of Politecnico di Milano: cinzia.cappiello@polimi.it

 

What to learn more? 


View related publications 

--> Tiago Brasileiro Araújo, Cinzia Cappiello, Nádia Puchalski Kozievitch, Demetrio Gomes Mestre, Carlos Eduardo Santos Pires, Monica Vitali: Towards Reliable Data Analyses for Smart Cities. IDEAS 2017: 304-308.

--> Cinzia Cappiello, Monica Vitali: Quality awareness for a Successful Big Data Exploitation. IDEAS 2018

--> Danilo Ardagna, Cinzia Cappiello, Walter Samà, Monica Vitali: Context-aware Data Quality Assessment for Big Data.

Comments

Switch to plain text editor
Social Share
Facebook Google Plus LinkedIn Twitter 
avisione logo

EUBra-BIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI).  | Disclaimer | Privacy Policy | 

Subscribed