EMaas - Entity Matching-as-a-Service
Entity Matching-as-a-Service (EMaaS) targets the problem of identifying records that refer to the same entity of the real world.
This task is known to be challenging due to its pair-wise comparison nature, especially when the datasets involved in the matching process have a high volume (Big Data). Since the EM task has critical importance for data cleaning and integration, e.g., to find duplicate points of interest in different databases, studies about challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark (Spark), have become an important demand nowadays.
For this reason, the EMaaS service, to be provided by the main API of the EUBra-BIGSEA, consists of a bag of tools and functions that can process the Entity Matching task (e.g., geo/spatial- matching) in parallel by using Apache Spark.
The EMaaS service will attend the requests from applications/systems interested in submitting Entity Matching tasks to the cluster environment. To this end, the service will establish a connection to the Hadoop Eco-system to perform the necessary operations such as submitting artifacts (e.g. datasets) to the HDFS or starting the execution of Spark jobs.
- Automatic parallelization of sequential code without the need to adopt any specific Application Programming Interface (API). Support to Java, C/C++ and Python. The same code can be executed transparently with regards to the underlying infrastructure.
- Automatic scaling and elasticity features so the number of available resources can be adapted to the actual execution needs.
- Interoperability with different cloud providers to run computational loads on multi cloud environments without the need of code adaptation.
- Availability of tools that ease: the COMPSs applications implementation by means of an Integrated Development Environment (IDE); the application deployment in distributed infrastructures by means of the Programming Model Enactment Service (PMES); and the monitoring of executions by means of the Monitoring and Tracing tools.
- COMPSs and PMES are constantly maintained and updated. The software is available as install packages and source code. They are also available on the EGI Application Database (AppDB) for virtual appliances.
- Open source and licensed under Apache 2.
- The installation of the packages automatically resolves the dependencies.
- COMPSs does not include any API, the idea is that the user code is optimized by the runtime. The user is only required to provide information on the tasks composing the application. Tutorials and manuals are available on the BSC webiste at http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar.
EMaaS targets every scientific/industrial sector interested in a service to process large-scale geo/spatial data matching.