Data scientists & domain researchers
Data Analytic development framework
Ophidia (http://ophidia.cmcc.it) is a CMCC Foundation research project addressing big data challenges in eScience. It exploits advanced parallel computing techniques and a hierarchical storage organization to execute intensive data analysis over multi-terabytes datasets.
Ophidia provides a Big Data analytics framework for parallel I/O and the analysis of multi-dimensional datasets. It leverages the datacube abstraction and comes with an extensive set of OLAP-oriented parallel operators, supporting e.g. datacube sub-setting, datacube aggregation, NetCDF file import and export, datacube intercomparison. Additionally it provides several primitives to operate on n-dimensional arrays that allow, among the others, sub-setting, data aggregation, array concatenation, algebraic expressions, predicate evaluation, statistical analysis and regression.
Ophidia is used mainly in scientific sectors like in the climate change domain. It has been extended and used in several research projects like: FP7 EUBRazilCloudConnect, FP7 CLIP-C and H2020 INDIGO-DataCloud. Ophidia is part of the EOSC-Hub project service portfolio, as a big data service tackling analytics needs from the scientific communities. Potential use is in other scientific domains such as, for example, weather and astrophysics.
The typical exploitation scenario for Ophidia relates to a scientific user that wants to analyse huge amount of data from large experiments, by performing server-side and parallel data analysis. In eScience contexts, data analysis and mining on large data volumes have become key tasks in many scientific domains. Often, such data (e.g., in life sciences, climate, astrophysics, engineering) are multidimensional and require specific primitives for subsetting (e.g., slicing and dicing), data reduction (e.g., by aggregation), pivoting, statistical analysis, and so forth. Large volumes of scientific data strongly need the same kind of On-Line Analytical Processing (OLAP) primitives typically used to carry out data analysis and mining. However, current general-purpose OLAP systems are not adequate in big-data scientific contexts for several reasons:
In several disciplines, scientific data analysis on multidimensional data is made possible by using domain-specific tools, libraries, and command line interfaces that provide the needed analytics primitives. However, these tools often fail at the tera- to petabyte scale because (i) they are not available in parallel versions, (ii) they do not rely on scalable storage models (e.g., exploiting partitioning, distribution and replication of data) to deal with large volumes of data, (iii) they do not provide a declarative language for complex dataflow submission, and/or (iv) they do not expose a server interface for remote processing (usually they run on desktop machines through command line interfaces and need, as a preliminary step, the download of the entire raw data).
Ophidia offers a proper solution to large-scale scientific data analytics, by providing a declarative, programmable, server-side and fast parallel processing approach both for HPC and cloud environments.
The latest Ophidia release is v1.2.0 (February 2018). The current release includes all the extensions developed during the EUBra-BIGSEA project, e.g. the improved version of the PyOphidia module used for the integration with the abstraction layer, as well as the extensions to support cloud-based elasticity, security and data privacy.
The source code of the Ophidia components is available online on github at: https://github.com/OphidiaBigData. Binary packages (RPMs and DEBs) are also available at https://download.ophidia.cmcc.it/rpm/ and https://download.ophidia.cmcc.it/deb/. A complete documentation regarding usage and installation of the latest release can be found at: http://ophidia.cmcc.it/documentation/.
Ophidia is an open source framework released under the GPLv3 license. We support the development of new releases applying to competitive calls for new proposals in national, trans-national, and international projects.
View related publications
--> A. D'Anca, C. Palazzo, D. Elia, S. Fiore, I. Bistinas, K. Böttcher, V. Bennett, G. Aloisio, “On the Use of In-Memory Analytics Workflows to Compute eScience Indicators from Large Climate Datasets”. 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Madrid, Spain, May 14-17, 2017.
--> S. Fiore, et al, "Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the earth system grid federation eco-system”. 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 2911-2918.
--> D. Elia, S. Fiore, A. D’Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “An in-memory based framework for scientific data analytics”. In Proceedings of the ACM International Conference on Computing Frontiers (CF ’16), May 16-19, 2016, Como, Italy, pp. 424-429.
A complete list of papers can be found at: http://ophidia.cmcc.it/overview/