Linked Open Data Integration Benchmark (LODIB) Specification - V1.0

Authors:: Carlos R. Rivero (University of Sevilla, Spain); Andreas Schultz (Freie Universität Berlin, Germany); Chris Bizer (Freie Universität Berlin, Germany)
This version:: http://www4.wiwiss.fu-berlin.de/bizer/lodib/20120220
Latest version:: http://www4.wiwiss.fu-berlin.de/bizer/lodib/
Publication Date: 02/20/2012

Abstract

Linked Data sources on the Web use a wide range of different vocabularies to represent data describing the same type of entity. For some types of entities, like people or bibliographic record, common vocabularies have emerged that are used by multiple data sources. But even for representing data of these common types, different user communities use different competing common vocabularies. Linked Data applications that want to understand as much data from the Web as possible, thus need to overcome vocabulary heterogeneity and translate the original data into a single target vocabulary. To support application developers with this integration task, several Linked Data translation systems have been developed. These systems provide languages to express declarative mappings that are used to translate heterogeneous Web data into a single target vocabulary. This document specifies the LODIB - Linked Open Data Integration Benchmark. LODIB is a benchmark for comparing the expressivity as well as the runtime performance of Linked Data translation systems. The benchmark aims to reflect the real-world heterogeneities that exist on the Web of Linked Data and has thus been designed based on statistics that were derived from the LOD Cloud.

1. Motivation
2. Mapping Patterns

2.1. Rename Class
2.2. Rename Property
2.2. Rename Class Based on Property
2.4. Rename Class Based on Value
2.5. Reverse Property
2.6. Resourcesify
2.7. Deresourcesify
2.8. Transform – 1:1 Value to Value
2.9. Transform – Value to URI
2.10. Transform – URI to Value
2.11. Transform – Change Datatype
2.12. Transform – Add Language Tag
2.13. Transform – Remove Language Tag
2.14. Transform – N:1 Value to Value
2.15. Aggregate
2.16. Summary Table

3. LODIB Grounding
4. Benchmark Data Set

4.1. Namespaces
4.2. Source 1
4.3. Source 2
4.4. Source 3
4.5. Target
4.6. Source 1 - Target
4.7. Source 2 - Target
4.8. Source 3 - Target
4.9. Source to Target Data Translation Example

5. Scaling and Dataset Population

5.1. Source 1
5.2. Source 2
5.3. Source 3

6. Data Set Generator and Tools

6.1. Data set Generator
6.2. Qualification

7. Bibliography

Changes

Acknowledgements

1. Motivation

The Web of Linked Data is growing rapidly and covers a wide range of different domains, such as media, life sciences, publications, governments, or geographic data [BizerHB09], [HeathB11]. Linked Data sources use vocabularies to publish their data, which consist of more or less complex data models that are represented using RDFS or OWL [HeathB11]. Some data sources try to reuse as much from existing vocabularies as possible in order to ease the integration of data from multiple sources [BizerHB09]. Other data sources use completely proprietary vocabularies to represent their content or use a mixture of common and proprietary terms [BizerS10].

Due to these facts, there exists heterogeneity amongst vocabularies in the context of Linked Data. According to [LODTerms11], on the one hand, 104 out of the 295 data sources in the LOD Cloud only use proprietary vocabularies. On the other hand, the rest of the sources (191) use common vocabularies to represent some of their content, but also often extend and mix common vocabularies with proprietary terms to represent other parts of their content. Some examples of the use of common vocabularies are the following: regarding publications, 31.19% data sources use the Dublin Core vocabulary, 4.75% use the Bibliographic Ontology, or 2.03% use the Functional Requirements for Bibliographic Records; in the context of people information, 27.46% data sources use the Friend of a Friend vocabulary, 3.39% use the vCard ontology, or 3.39% use the Semantically-Interlinked Online Communities ontology; finally, regarding geographic data sets, 8.47% data sources use the Geo Positioning vocabulary, or 2.03% use the GeoNames ontology.

To solve these heterogeneity problems, mappings are used to perform data translation, i.e., exchanging data from the source data set to the target data set [RiveroHR11A], [RiveroHR11B]. Data translation, a.k.a. data exchange, is a major research topic in the database community, and it has been studied for relational, nested relational, and XML data models [ArenasL08], [FaginKMP05], [FuxmanHH06]. Current approaches to perform data translation rely on two types of mappings that are specified at different levels, namely: correspondences (modelling level) and executable mappings (implementation level). Correspondences are represented as declarative mappings that are then combined into executable mappings, which consist of queries that are executed over a source and translate the data into a target [BizerS10], [QinDL07], [RiveroHR11A].

In the context of executable mappings, there exists a number of approaches to define and also automatically generate them. Qin et al. [QinDL07] devised a semi-automatic approach to generate executable mappings that relies on data-mining; Euzenat et al. [EuzenatPS08] and Polleres et al. [PolleresSS07] presented preliminary ideas on the use of executable mappings in SPARQL to perform data translation; Parreiras et al. [ParreirasSS08] presented a Model-Driven Engineering approach that automatically transforms handcrafted mappings in MBOTL (a mapping language by means of which users can express executable mappings) into executable mappings in SPARQL or Java; Bizer and Schultz [BizerS10] proposed a SPARQL-like mapping language called R2R, which is designed to publish expressive, named executable mappings on the Web, and to flexible combine partial executable mappings to perform data translation. Finally, Rivero et al. [RiveroHR11A] devised an approach called Mosto to automatically generate executable mappings in SPARQL based on constraints of the source and target data models, and also correspondences between these data models. In addition, translating amongst vocabularies by means of mappings is one of the main research challenges in the context of Linked Data, and it is expected that research efforts on mapping approaches will be increased in the next years [BizerHB09]. As a conclusion, a benchmark to test data translation systems in this context seems highly relevant.

To the best of our knowledge, there exist two benchmarks to test data translation systems: STBenchmark and DTSBench. STBenchmark [AlexeTV08] provides eleven patterns that occur frequently when integrating nested relational models, which makes it difficult for at least some of the patterns to extrapolate to our context due to a number of inherent differences between nested relational models and the graph-based RDF data model that is used in the context of Linked Data [MotikHS09]. DTSBench [RiveroHR11B] allows to test data translation systems in the context of Linked Data using synthetic data translation tasks only, without taking real-world data from Linked Data sources into account.

This document describes a benchmark to test data translation systems in the context of Linked Data. Our benchmark provides a catalogue of fifteen data translation patterns, each of which is a common data translation problem in the context of Linked Data. To motivate that these patterns are common in practice, we have analyzed 84 random examples of data translation in the Linked Open Data Cloud. After this analysis, we have studied the distribution of the patterns in these examples, and have designed LODIB, the Linked Open Data Integration Benchmark, to reflect this real-world heterogeneity that exists on the Web of Data.

The benchmark provides a data generator that produces three different synthetic data sets, which reflect the pattern distribution. These source data sets need to be translated into a single target vocabulary by the system under test. This generator allows us to scale source data and it also automatically generates the expected target data, i.e., after performing data translation over the source data. The data sets reflect the same e-commerce scenario that we already used for the BSBM benchmark [BizerS09].

LODIB is designed to measure the following: 1) Expressivity: the number of mapping patterns that can be expressed in a specific data translation system; 2) Time performance: the time needed to perform the data translation, i.e., loading the source file, executing the mappings, and serializing the result into a target file. In this context, LODIB provide a validation tool that examines if the source data is represented correctly in the target data set: we perform the data translation task in a particular scenario using LODIB, and the target data that we obtain are the expected target data when performing data translation using a particular system.

This specification is organised as follows: Section 2 presents the mapping patterns of our benchmark; in Section 3, we describe the 84 data translation examples from the LOD Cloud that we have analyzed, and the counting of the occurrences of mapping patterns in the examples; Section 4 deals with the design of our benchmark; Section 5 describes how we scale three data translation scenarios that we have devised, and how we automatically populate them; and, finally, Section 6 presents the tools that we have devised to implement our benchmark.

2. Mapping Patterns

A mapping pattern represents a common data translation problem that should be supported by any data translation system in the context of Linked Data. Our benchmark provides a catalogue of fifteen mapping patterns that we have repeatedly discovered as we analyzed the heterogeneity between different data sources in the Linked Open Data Cloud (see Section 3). In the following sections, we describe each mapping pattern in detail.

2.1 Rename Class

Description:

Every source instance of a class C is reclassified into the same instance of the renamed class C' in the target. An example of this pattern is the renaming of class location.citytown in Freebase into class City in DBpedia.

Example:

Source triples

@prefix fb: <http://rdf.freebase.com/ns/> .
 
fb:en.toledo_oregon
	a	fb:location.citytown .

Target triples

@prefix dbp: <http://dbpedia.org/ontology/> .
 
fb:en.toledo_oregon
	a	dbp:City .

2.2 Rename Property

Description:

We wish to change the URI of a property from the source into a new URI in the target. In the following example, we rename property elevation in DBpedia into ele in LinkedGeoData.

Example:

Source triples

@prefix dbp: <http://dbpedia.org/ontology/> .
@prefix dbp-r: <http://dbpedia.org/resource/> .

dbp-r:John_F._Kennedy_International_Airport
	dbp:elevation	3.9624 .

Target triples

 
@prefix lgd: <http://linkedgeodata.org/triplify/> .
@prefix lgdo:  <http://linkedgeodataorg/ontology/> .

lgd:node369043778
	lgdo:ele	3.9624 .

2.3 Rename Class Based on Property

Description:

In this case, we reclassify an instance of a source class into a target class if and only if that source instance is the domain or range of a certain source property. In the following example, we rename class Person in DBpedia into people.deceased_person in Freebase if and only if the instances of Person are related by an instance of deathDate property.

Example:

Source triples

@prefix dbp: <http://dbpedia.org/ontology/> .
@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

dbp-r:William_Shakespeare
	a		dbp:Person ;
	dbp:deathDate	"1616-04-23"^^xsd:date .

Target triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix fb: <http://rdf.freebase.com/ns/> .
 
dbp-r:William_Shakespeare
	a	fb:people.deceased_person .

2.4 Rename Class Based on Value

Description:

In this pattern, we wish to reclassify an instance from a source class into a target class if and only if that instance is the domain a certain source property, and it has a specific value.

Example:

Source triples

@prefix gw: <"http://govwild.org/ontology/> .
@prefix gw-p: <http://govwild.org/id/person/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

gw-p:Kurt_Joachim_Lauk_euParliament_1840_P
	a		gw:Person ;
	gw:profession	"politician"^^xsd:string .

Target triples

@prefix gw-p: <http://govwild.org/id/person/> .
@prefix fb: <http://rdf.freebase.com/ns/> .
 
gw-p:Kurt_Joachim_Lauk_euParliament_1840_P
	a	fb:government.politician .

2.5 Reverse Property

Description:

The goal of this pattern is to reverse the domain and range of a property, i.e., a source property has a class A as domain and another class B as range, thus we reverse the source property into a target property that has class B as domain and class A as range. In the following example, we reverse property airports_operated in Freebase into operator in DBpedia.

Example:

Source triples

@prefix fb: <http://rdf.freebase.com/ns/> .
 
fb:en.aena
	fb:aviation.airport_operator.airports_operated	fb:en.madrid_barajas_international_airport .

Target triples

@prefix fb: <http://rdf.freebase.com/ns/> .
@prefix dbp: <http://dbpedia.org/ontology/> .
 
fb:en.madrid_barajas_international_airport
	dbp:operator	fb:en.aena .

2.6 Resourcesify

Description:

In this pattern, the target needs more instances than the source to represent the same information, so new instances have to be created in the target. Since we want these resources to be addressable, this boils down to generating new URIs for the target. In the following example, we resourcesify property runtime in DBpedia into duration in BBC by creating a new instance of Version.

Example:

Source triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix dbp: <http://dbpedia.org/ontology/> .

dbp-r:Unusual_Suspects
	dbp:runtime		2580.0 .

Target triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix po: <http://purl.org/ontology/po/> .
@prefix bbc: <http://www.bbc.co.uk/programmes/> .

dbp-r:Unusual_Suspects
	po:version		bbc:Unusual_Suspects_Version .

bbc:Unusual_Suspects_Version
	a		po:Version ;
	po:duration		2580.0 .

2.7 Deresourcesify

Description:

In this pattern, two different instances related by an object property in the source are translated into a single instance in the target. In the following example, we deresourcesify property city in DBpedia into city_served in LinkedGeoData.

Example:

Source triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix dbp: <http://dbpedia.org/ontology/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

dbp-r:John_F._Kennedy_International_Airport
	dbp:city		dbp-r:New_York_City .

dbp-r:New_York_City
	rdfs:label		"New York City" .

Target triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix lgdp: <http://linkedgeodata.org/property/> .

dbp-r:John_F._Kennedy_International_Airport
	lgdp:owner		"New York City" .

2.8 Transform – 1:1 Value to Value

Description:

In this pattern, we need to perform a transformation of a source value to translate into a target value. In the following example, we transform property runtime in DBpedia into runtime in LinkedMDB.

Example:

Source triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix dbp: <http://dbpedia.org/ontology/> .

dbp-r:The_Shining_(film)
	dbp:runtime		8520.00 .

Target triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix movie: <http://data.linkedmdb.org/resource/movie/> .

dbp-r:The_Shining_(film)
	movie:runtime		142 .

2.9 Transform – Value to URI

Description:

The goal of this pattern is to transform a value into URI. In the following example, we transform the value of property omim in DBpedia into an URI in Bio2RDF.

Example:

Source triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix dbp: <http://dbpedia.org/ontology/> .

dbp-r:Von_Willebrand_disease
	dbp:omim		193400 .

Target triples

http://bio2rdf.org/omim:193400
	a		http://bio2rdf.org/omim_resource:MendelianDisorders .

2.10 Transform – URI to Value

Description:

omim

Example:

Source triples

http://bio2rdf.org/omim:193400
	a		http://bio2rdf.org/omim_resource:MendelianDisorders .

Target triples

@prefix dbp: <http://dbpedia.org/ontology/> .

http://bio2rdf.org/omim:193400
	dbp:omim		193400 .

2.11 Transform – Change Datatype

Description:

This pattern focuses on changing the datatype of a particular data property from the source into the target. In the following example, we change the datatype of source property people.person.date_of_birth in Freebase into target property birthDate in DBpedia.

Example:

Source triples

@prefix fb: <http://rdf.freebase.com/ns/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

fb:en.clint_eastwood
	fb:people.person.date_of_birth		"May 31, 1930"^^xsd:dateTime .

Target triples

@prefix fb: <http://rdf.freebase.com/ns/> .
@prefix dbp: <http://dbpedia.org/ontology/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

fb:en.clint_eastwood
	dbp:birthDate		"May 31, 1930"^^xsd:date .

2.12 Transform – Add Language Tag

Description:

This pattern focuses on adding a language tag to a particular data property from the source into the target. In the following example, we add a language tag to source property genericName in Drug Bank into property label in DBpedia.

Example:

Source triples

@prefix db: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/> .
@prefix dg: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/> .

dg:DB00001
	db:genericName		"Lepirudin" .

Target triples

@prefix dg: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

dg:DB00001
	rdfs:label		"Lepirudin"@en .

2.13 Transform – Remove Language Tag

Description:

In this pattern, we remove a language tag from a particular data property from the source into the target. In the following example, we remove a language tag to source property altLabel in DataGov Statistics into property altLabel in Ordnance Survey.

Example:

Source triples

@prefix county: <http://statistics.data.gov.uk/doc/county/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

county:41
	skos:altLabel		"Staffordshire"@en .

Target triples

@prefix county: <http://statistics.data.gov.uk/doc/county/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

county:41
	skos:altLabel		"Staffordshire" .

2.14 Transform – N:1 Value to Value

Description:

This pattern focuses on transforming the value of a number of source properties into a single value in a target property using a particular transformation function. In the following example, we concatenate source properties givenName and surname in DBpedia into target property type.object.name in Freebase.

Example:

Source triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix foaf: <http://xmlns.com/foaf/spec/> .

dbp-r:William_Shakespeare
	foaf:givenName		"William" ; 
	foaf:surname		"Shakespeare" .

Target triples

@prefix dbp-r: <http://dbpedia.org/resource/> .
@prefix fb: <http://rdf.freebase.com/ns/> .

dbp-r:William_Shakespeare
	fb:type.object.name		"William Shakespeare" .

2.15 Aggregate

Description:

In this pattern, we need to count the number of source instances to translate into the target. In the following example, we aggregate source property metropolitan_transit.transit_system.transit_lines in Freebase into numberOfLines in DBpedia.

Example:

Source triples

@prefix fb: <http://rdf.freebase.com/ns/> .

fb:en.berlin_u-bahn
	fb:metropolitan_transit.transit_system.transit_lines		fb:en.u1 ; 
	fb:metropolitan_transit.transit_system.transit_lines		fb:en.u3 .

Target triples

@prefix fb: <http://rdf.freebase.com/ns/> .
@prefix dbp: <http://dbpedia.org/ontology/> .

fb:en.berlin_u-bahn
	dbp:numberOfLines		2 .

2.16 Summary Table

Finally, we present a summary of these mapping patterns in the following table. The first column of this table stands for the code of each pattern; the second and third columns establish the triples to be retrieved in the source and the triples to be constructed in the target using a SPARQL-like notation. Note that properties are represented as P and Q, classes as C, constant values as v, tag languages as TAG, and data types as TYPE.

Code	Name	Source triples	Target triples
RC	Rename Class	?x a C	?x a C’
RP	Rename Property	?x P ?y	?x P' ?y
RCP	Rename Class based on Property	?x a C FILTER EXISTS { {?x P ?y} UNION {?y P ?x} }	?x a C'
RCV	Rename Class based on Value	?x a C ?x P v	?x a C'
RvP	Reverse Property	?x P ?y	?y P' ?x
Rsc	Resourcesify	?x P ?y	?x Q ?z ?z P' ?y [?z must be a unique consistent URI in the target dataset]
DRsc	Deresourcesify	?x Q ?z ?z P ?y	?x P' ?y
1:1	Transform – 1:1 Value to Value	?x P ?y	?x P' f(?y)
VtU	Transform – Value to URI	?x P ?y	?x P' toURI(?y)
UtV	Transform – URI to Value	x P ?y	?x P' toLiteral(?y)
CD	Transform – Change Datatype	?x P ?y^^TYPE	?x P' ?y^^TYPE'
ALT	Transform – Add Language Tag	?x P ?y	?x P' ?y@TAG
RLT	Transform – Remove Language Tag	?x P ?y@TAG	?x P' ?y
N:1	Transform – N:1 Value to Value	?x P1 ?a ?x P2 ?b ... ?x PK ?k	?x P' f(?a, ?b, ..., ?k)
Agg	Transform – Aggregate	?x P ?y	?x Q count(?y)

3. LODIB Grounding

To motivate that the mapping patterns LODIB comprises are common in practice, we analyzed several real-world examples in the LOD cloud to find evidences of these patterns. First, we select different Linked Data sources by exploring the LOD data set catalog maintained on CKAN. The criteria we followed was to choose sources that comprise a great number of sameAs links with other Linked Data sources, i.e., more than 25,000. Therefore, the selected Linked Data sources are the following: ACM (RKB Explorer), DBLP (RKB Explorer), Dailymed, Drug Bank, DataGov Statistics, Ordnance Survey, DBpedia, GeoNames, Linked GeoData, LinkedMDB, New York Times, Music Brainz, Sider, GovWILD, ProductDB, and OpenLibrary. Note that all of these Linked Data sources cover all domains of the LOD cloud except the domain of User- generated content.

After selecting these sources, we randomly select 42 examples using sameAs links, each of which comprises two instances. For each of these examples, we analyze both directions: one instance is the source and the other instance is the target, and backwards. Therefore, the total number of examples we analyze was 84. Then, we manually count the number of patterns we found among these examples (see our statistic files).

In the next step, we computed the averages of our mapping patterns grouped by the pair of source and target data set. To compute them, in some cases, we analyzed the translation of one single instance since the data set of the Linked Data source comprises only a couple of classes, such as Drug Bank or Ordnance Survey. In other cases, we analyzed more than one instance since the data set comprises a large number of classes, such as DBpedia or Freebase.

The following table the statistics of the mappings patterns that we have found in the LOD Cloud. The two first columns stand for the source and target Linked Data data sets, the following columns contain the averages of each mapping pattern according to the source and the target, i.e., we count the occurrences of mapping patterns in a number of examples and compute the average. Note that, for certain data sets, we analyzed several examples of the same type; therefore, the final numbers of these columns are real numbers (no integers). Finally, the last column contains the total number of instances that we analyzed for each pair of Linked Data data sets.

Source	Target	RC	RP	RCP	RCV	RvP	Rsc	DRsc	1:1	VtU	UtV	CD	ALT	RLT	N:1	Agg	Total Instances
ACM (RKB Explorer)	DBLP (RKB Explorer)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Dailymed	Drug Bank	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
DataGov Statistics	Ordnance Survey	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	1
DBLP (RKB Explorer)	ACM (RKB Explorer)	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
DBpedia	Freebase	2.14	8.57	0.64	0.00	2.21	2.29	0.00	0.57	0.00	0.00	1.14	0.00	0.00	0.14	0.07	14
DBpedia	Geonames	1.00	3.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	3
DBpedia	Linked GeoData	1.00	3.50	0.00	0.00	0.00	0.00	1.50	0.13	0.00	0.00	2.25	0.00	0.00	0.00	0.00	8
DBpedia	LinkedMDB	1.00	5.50	0.33	0.00	0.33	0.00	0.00	0.33	0.00	0.00	0.00	0.00	0.00	0.00	0.00	6
DBpedia	Drug Bank	1.00	1.00	0.00	0.00	0.00	0.00	0.00	1.00	1.00	0.00	0.00	0.00	3.00	0.00	0.00	1
DBpedia	New York Times	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	1
DBpedia	Music Brainz	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Drug Bank	DBpedia	1.00	1.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	1.00	0.00	3.00	0.00	1.00	0.00	1
Drug Bank	Freebase	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Drug Bank	Sider	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Drug Bank	Dailymed	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Freebase	DBpedia	2.14	8.57	0.29	0.07	2.21	0.00	2.29	0.79	0.00	0.00	1.14	0.00	0.00	0.00	0.14	14
Freebase	GovWILD	1.00	4.50	0.00	0.00	0.00	0.00	2.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	2
Freebase	Drug Bank	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
GeoNames	DBpedia	1.00	3.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
GovWILD	Freebase	1.00	4.50	0.00	0.50	0.00	2.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.50	0.00	2
Linked GeoData	DBpedia	1.00	3.50	0.88	0.75	0.00	1.50	0.00	0.13	0.00	0.00	2.25	0.00	0.00	0.00	0.00	8
LinkedMDB	DBpedia	1.00	5.50	0.00	0.00	0.33	0.00	0.00	0.33	0.00	0.00	0.00	0.00	0.00	0.00	0.00	6
Music Brainz	DBpedia	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
New York Times	DBpedia	0.00	0.00	0.00	0.00	0.00	0.00	0.00	3.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
OpenLibrary	ProductDB	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Ordnance Survey	DataGov Statistics	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	1
ProductDB	OpenLibrary	0.00	0.00	0.00	0.00	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1
Sider	Drug Bank	1.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1

On the one hand, Rename Class and Rename Property mapping patterns appear in the vast majority of the analyzed examples, since these patterns are very common in practice. On the other hand, there are some patterns that are not so common, e.g., Value to URI and URI to Value patterns appear only once in all analyzed examples (between DBpedia and Drug Bank).

Finally, the following table presents the average occurrences of the LODIB mapping patterns over all analyzed examples.

RC	RP	RCP	RCV	RvP	Rsc	DRsc	1:1	VtU	UtV	CD	ALT	RLT	N:1	Agg
0.87	2.01	0.08	0.05	0.18	0.24	0.24	0.30	0.04	0.04	0.24	0.14	0.14	0.09	0.01

RC: Rename Class
RP: Rename Property
RCP: Rename Class based on Property
RCV: Rename Class based on Value
RvP: Reverse Property
Rsc: Resourcesify
DRsc: Deresourcesify
1:1: Transform – 1:1 Value to Value
VtU: Transform – Value to URI
UtV: Transform – URI to Value
CD: Transform – Change Datatype
ALT: Transform – Add Language Tag
RLT: Transform – Remove Language Tag
N:1: Transform – N:1 Value to Value
Agg: Transform – Aggregate

4. Benchmark Dataset

Based on the previously described statistics, we have designed the LODIB Benchmark. The benchmark consists of three different source data sets that need to be translated by the system under test into a single target vocabulary. The topic of the data sets is the same e-commerce data set that we already used for the BSBM Benchmark. The data sets describe products, reviews, people and some more lightweight classes, such as product price using different source vocabularies. For translation from the representation of an instance in the source data sets to the target vocabulary, data translation systems need to apply several of the presented mapping patterns.

In this section, we describe these data sets that take the previously computed averages of Table 4 into account by multiplying them by a constant (11), and divided each one by another constant (3, the total number of data translation tasks, i.e., from each source data set to the target data set). As a result, each of the three data translation tasks comprises a number of mapping patterns, and we present the numbers in the following table, in which the total number of mapping patterns for each task is 18:

Scenario	RC	RP	RCP	RCV	RvP	Rsc	DRsc	1:1	VtU	UtV	CD	ALT	RLT	N:1	Agg	Total number of patterns
Source 1 - Target	3	7	1	0	1	1	1	1	0	1	1	0	1	0	0	18
Source 2 - Target	3	7	0	1	0	1	1	1	1	0	1	1	0	1	0	18
Source 3 - Target	3	7	0	0	1	1	1	1	0	0	1	1	1	0	1	18

In the following subsections, we describe each data set in detail and, finally, we present the executable mappings in these tasks. Specifically, Section 4.1 presents the namespaces that we use to describe the sources; Sections 4.2, 4.3, 4.4, and 4.5 describe the three sources and target data sets that we have devised, respectively; furthermore, Sections 4.6, 4.7 and 4.8 present the mappings between each source and the target data sets; finally, Section 4.9 shows an example in this context.

4.1. Namespaces

Prefix	Namespace
rdfs:	http://www.w3.org/2000/01/rdf-schema#
foaf:	http://xmlns.com/foaf/0.1/
dc:	http://purl.org/dc/elements/1.1/
xsd:	http://www.w3.org/2001/XMLSchema#
src1:	http://www4.wiwiss.fu-berlin.de/bizer/lodib/vocabulary/source1/
src2:	http://www4.wiwiss.fu-berlin.de/bizer/lodib/vocabulary/source2/
src3:	http://www4.wiwiss.fu-berlin.de/bizer/lodib/vocabulary/source3/
tgt:	http://www4.wiwiss.fu-berlin.de/bizer/lodib/vocabulary/target/

4.2. Source 1

Class src1:Product

src1:name (xsd:string)
src1:description (xsd:string)
src1:vendor (xsd:string)
src1:hasPrice (src1:ProductPrice)
src1:hasReview (src1:Review)

Example RDF Instance:

src1-data:Canon-Ixus-20010
	a			src1:Product ;
	src1:name		"Canon Ixus 20010"^^xsd:string ;
	src1:description	"Canon Ixus camera"^^xsd:string ;
	src1:vendor		"Camerama"^^xsd:string ;
	src1:hasPrice		src1-data:Price-210 ;
	src1:hasReview		src1-data:Review-CI-001 .

Class src1:ProductPrice

src1:productPrice (xsd:double)

Example RDF Instance:

src1-data:Price-210
	a			src1:ProductPrice ;
	src1:productPrice	"210.99"^^xsd:double .

Class src1:Review

src1:revText (xsd:string, @en)
dc:date (xsd:string)
dc:title (xsd:string)
src1:author (src1:Person)

Example RDF Instance:

src1-data:Review-CI-001
	a		src1:Review ;
	src1:revText	"This camera is awesome!"@en ;
	dc:date		"01/10/2011"^^xsd:string ;
	dc:title	"Canon Ixus 20010"^^xsd:string ;
	src1:author	src1-data:Smith-W .

Class src1:Person

foaf:name (xsd:string)
foaf:mbox (xsd:string)
src1:personHomepage (foaf:Document)
src1:birthDate (xsd:date)
src1:created (xsd:date)

Example RDF Instance:

src1-data:Smith-W
	a			src1:Person ;
	foaf:name		"Walter Smith"^^xsd:string ;
	foaf:mbox		"wsmith@example.org"^^xsd:string ;
	src1:personHomepage	http://www.example.org/WSmith ;
	src1:birthDate		"06/07/1979"xsd:date ;
	src1:created		"10/12/2008"xsd:date .

4.3. Source 2

Class src2:Product

src2:name (xsd:string)
src2:description (xsd:string)
src2:vendor (xsd:string)
src2:price (xsd:double)
src2:productHomepage (xsd:string)
src2:outdated (xsd:string)
src2:hasReview (src2:Review)

Example RDF Instance:

src2-data:HTC-Wildfire-S
	a			src2:Product ;
	src2:name		"HTC Wildfire S"^^xsd:string ;
	src2:description	"Phone from HTC"^^xsd:string ;
	src2:vendor		"Phone for You"^^xsd:string ;
	src2:price		"199.99"^^xsd:double ;
	src2:productHomepage	"http://htc.com/"^^xsd:string ;
	src2:outdated		"No"^^xsd:string ;
	src2:hasReview		src2-data:Review-HTC-W-S .

Class src2:Review

dc:date
dc:title (xsd:string)
src2:hasText (src2:ReviewText)

Example RDF Instance:

src2-data:Review-HTC-W-S
	a		src2:Review ;
	dc:date		"13/09/2011" ;
	dc:title	"HTC Wildfire S"^^xsd:string ;
	src2:hasText	src2-data:Review-HTC-W-S-Text .

Class src2:ReviewText

src2:revText (xsd:string)

Example RDF Instance:

src2-data:Review-HTC-W-S-Text
	a		src2:ReviewText ;
	src2:revText	"I like this phone a lot"^^xsd:string .

Class src2:Person

foaf:firstName (xsd:string)
foaf:surname (xsd:string)
src2:mbox (xsd:string)
src2:mini-cv (xsd:string)
src2:birthDate (xsd:date)
src2:review (src2:Review)

Example RDF Instance:

src2-data:Doe-J
	a			src2:Person ;
	foaf:firstName		"John"^^xsd:string ;
	foaf:surname		"Doe"^^xsd:string ;
	src2:mbox		"john-doe@example.org"^^xsd:string ;
	src2:mini-cv		"I like pizza."^^xsd:string ;
	src2:birthDate		"06/07/1979"xsd:date ;
	src2:review		src2-data:Review-HTC-W-S .

4.4. Source 3

Class src3:Product

src3:name (xsd:string)
src3:description (xsd:string)
src3:vendor (xsd:string)
src3:comment (xsd:string)
src3:hasPrice (src3:ProductPrice)
src3:hasReview (src3:Review)

Example RDF Instance:

src3-data:VPCSE1S9E
	a			src3:Product ;
	src3:name		"Sony Vaio VPCSE1S9E"^^xsd:string ;
	src3:description	"Sony Vaio S series VPCSE1S9E"^^xsd:string ;
	src3:vendor		"Cheap Laptops"^^xsd:string ;
	src3:comment		"Out of stock"^^xsd:string ;
	src3:hasPrice		src3-data:Price-1145 ;
	src3:hasReview		src3-data:Review-VPCSE1S9E .

Class src3:ProductPrice

src3:productPrice (xsd:double)

Example RDF Instance:

src3-data:Price-1145
	a			src3:ProductPrice ;
	src3:productPrice	"1145.99"^^xsd:double ;

Class src3:Review

src3:revText (xsd:string, @en)
dc:date (xsd:date)
dc:title (xsd:string)
src3:author (src3:Person)

Example RDF Instance:

src3-data:Review-VPCSE1S9E
	a		src3:Review ;
	src3:revText	"This laptop has a good quality and a cheap price"@en ;
	dc:date		"03/03/2011"^^xsd:string ;
	dc:title	"Canon Ixus 20010"^^xsd:string ;
	src3:author	src3-data:Johnson-P .

Class src3:Person

foaf:name (xsd:string)
src3:mbox (xsd:string)
src3:mini-cv (xsd:string)
src3:birthDate (xsd:date)
src3:memberSince (xsd:date)

Example RDF Instance:

src3-data:Johnson-P
	a			src3:Person ;
	foaf:name		"Paul Johnson"^^xsd:string ;
	foaf:mbox		"pj@example.org"^^xsd:string ;
	src3:mini-cv		"I live in the US."^^xsd:string ;
	src3:birthDate		"24/03/1965"xsd:date ;
	src3:memberSince	"04/11/2005"xsd:date .

4.5. Target

Class tgt:Product

rdfs:label (xsd:string)
rdfs:comment (xsd:string)
tgt:productVendor (xsd:string)
tgt:productPrice (xsd:double)
tgt:productHomepage (foaf:Document)
tgt:totalReviews (xsd:integer)
tgt:productReview (tgt:Review)

Example RDF Instance:

tgt-data:HTC-Wildfire-S
	a			tgt:Product ;
	rdfs:label		"HTC Wildfire S"^^xsd:string ;
	rdfs:comment		"Phone from HTC"^^xsd:string ;
	tgt:productVendor	"Phone for You"^^xsd:string ;
	tgt:productPrice	"199.99"^^xsd:double ;
	tgt:productHomepage	"http://htc.com/"^^xsd:string ;
	tgt:totalReviews	"25"^^xsd:integer ;
	tgt:productReview	tgt-data:Review-HTC-W-S .

Class tgt:OutdatedProduct

subclass of tgt:Product

Example RDF Instance:

tgt-data:Out-D-Product
	a	tgt:OutdatedProduct .

Class tgt:Review

tgt:text (xsd:string)
dc:date (xsd:date)
tgt:reviewSubject (xsd:string)

Example RDF Instance:

tgt-data:Review-HTC-W-S
	a			tgt:Review ;
	tgt:text		"I like this phone a lot"^^xsd:string ;
	dc:date			"03/03/2011"^^xsd:date ;
	tgt:reviewSubject	"HTC Wildfire S"^^xsd:string .

Class tgt:Person

tgt:name (xsd:string)
foaf:mbox (xsd:string)
tgt:bio (xsd:string, @en)
tgt:personHomepage (xsd:string)
tgt:initialDate (xsd:date)
tgt:author (tgt:Review)
tgt:birth (tgt:Birth)

Example RDF Instance:

tgt-data:Doe-J
	a			tgt:Person ;
	tgt:name		"John Doe"^^xsd:string ;
	foaf:mbox		"john-doe@example.org"^^xsd:string ;
	tgt:bio			"I like pizza."^^xsd:string ;
	tgt:personHomepage	"http://johndoe.org"^^xsd:string ;
	tgt:initialDate		"17/05/2008"^^xsd:string ;
	tgt:author		tgt-data:Review-HTC-W-S ;
	tgt:birth		tgt-data:Doe-J-Birth .

Class tgt:Birth

tgt:birthDate (xsd:date)

Example RDF Instance:

tgt-data:Doe-J-Birth
	a		tgt:Birth ;
	tgt:birthDate	"05/06/1967"^^xsd:date .

Class tgt:Reviewer

subclass of tgt:Person

Example RDF Instance:

tgt-data:Doe-J
	a	tgt:Reviewer .

4.6. Source 1 - Target

The following table shows the executable mappings between Source 1 and Target:

Code	Source triples	Target triples
RC	?x a src1:Product	?x a tgt:Product
RC	?x a src1:Review	?x a tgt:Review
RC	?x a src1:Person	?x a tgt:Person
RP	?x src1:name ?y	?x rdfs:label ?y
RP	?x src1:description ?y	?x rdfs:comment ?y
RP	?x src1:vendor ?y	?x tgt:productVendor ?y
RP	?x src1:hasReview ?y	?x tgt:productReview ?y
RP	?x dc:title ?y	?x tgt:reviewSubject ?y
RP	?x src1:created ?y	?x tgt:initialDate ?y
RP	?x foaf:name ?y	?x tgt:name ?y
RCP	?x a src1:Person ?y src1:author ?x	?x a tgt:Reviewer
RvP	?x src1:author ?y	?y tgt:author ?x
Rsc	?x src1:birthDate ?y	?x tgt:birth ?z ?z tgt:birthDate ?y
DRsc	?x src1:hasPrice ?z ?z src1:productPrice ?y	?x tgt:productPrice ?y
1:1	?x foaf:mbox ?y	?x foaf:mbox f(?y) where f:=sha1sum
UtV	?x src1:personHomepage ?y	?x tgt:personHomepage toLiteral(?y)
CD	?x dc:date ?y^^xsd:string	?x dc:date ?y^^xsd:date
RLT	?x src1:revText ?y@en	?x tgt:text ?y

4.7. Source 2 - Target

The following table shows the executable mappings between Source 2 and Target:

Code	Source triples	Target triples
RC	?x a src2:Product	?x a tgt:Product
RC	?x a src2:Review	?x a tgt:Review
RC	?x a src2:Person	?x a tgt:Person
RP	?x src2:name ?y	?x rdfs:label ?y
RP	?x src2:description ?y	?x rdfs:comment ?y
RP	?x src2:vendor ?y	?x tgt:productVendor ?y
RP	?x src2:hasReview ?y	?x tgt:productReview ?y
RP	?x dc:title ?y	?x tgt:reviewSubject ?y
RP	?x src2:mbox ?y	?x foaf:mbox ?y
RP	?x src2:review ?y	?x tgt:author ?y
RCV	?x a src2:Product ?x src2:outdated “Yes”	?x a tgt:OutdatedProduct
Rsc	?x src2:birthDate ?y	?x tgt:birth ?z ?z tgt:birthDate ?y
DRsc	?x src2:hasText ?z ?z src2:revText ?y	?x tgt:text ?y
1:1	?x src2:price ?y	?x tgt:productPrice usDollarsToEuros(?y) where usDollarsToEuros(x):=x*0.767754319
VtU	?x src2:productHomepage ?y	?x tgt:productHomepage toURI(?y)
CD	?x dc:date ?y	?x dc:date ?y^^xsd:date
ALT	?x src2:mini-cv ?y	?x tgt:bio ?y@en
N:1	?x foaf:firstName ?y ?x foaf:surname ?z	?x name concat(?y, ' ', ?z)

4.8. Source 3 - Target

The following table shows the executable mappings between Source 3 and Target:

Code	Source triples	Target triples
RC	?x a src3:Product	?x a tgt:Product
RC	?x a src3:Review	?x a tgt:Review
RC	?x a src3:Person	?x a tgt:Person
RP	?x src3:name ?y	?x rdfs:label ?y
RP	?x src3:description ?y	?x rdfs:comment ?y
RP	?x src3:vendor ?y	?x tgt:productVendor ?y
RP	?x src3:hasReview ?y	?x tgt:productReview ?y
RP	?x dc:title ?y	?x tgt:reviewSubject ?y
RP	?x src3:memberSince ?y	?x tgt:initialDate ?y
RP	?x foaf:name ?y	?x tgt:name ?y
RvP	?x src3:author ?y	?y tgt:author ?x
Rsc	?x src3:birthDate ?y	?x tgt:birth ?z ?z tgt:birthDate ?y
DRsc	?x src3:hasPrice ?z ?z src3:productPrice ?y	?x tgt:productPrice ?y
1:1	?x foaf:mbox ?y	?x foaf:mbox f(?y) where f:=sha1sum
CD	?x dc:date ?y^^xsd:string	?x dc:date ?y^^xsd:date
ALT	?x src3:mini-cv ?y	?x tgt:bio ?y@en
RLT	?x src3:revText ?y@en	?x tgt:text ?y
Agg	?x a src3:Product ?x src3:hasReview ?y	?x a tgt:Product ?x tgt:totalReviews sum(?y) [Group by ?x]

4.9. Source to Target Data Translation Example

We now show some concrete examples of the source and target data that is generated by our data generator. In addition to that we explain how the mapping patterns are reflected in the data sets. Firstly we show how example data is represented under one of the source schemas:

# Source data set
# The numbers on the left enumerate the triples

      src1:product1
01:     a src1:Product ;
02:     src1:description "lkdkmw xk iqkrebe..."^^xsd:string ; # The literal is shortened for presentation
03:     src1:hasPrice src1:productPrice1 ;
04:     src1:hasReview src1:review5,
05:                    src1:review6 ;
06:     src1:name "uxyz"^^xsd:string ;
07:     src1:vendor "jasd dobzxkv ootvn wtxjlt vqyen"^^xsd:string .

      src1:productPrice1
08:     a src1:ProductPrice ;
09:     src1:productPrice 311.167906992248 .

      src1:person2
10:     a src1:Person ;
11:     src1:personHomepage src1:homepage5 ;
12:     src1:birthDate "1976-12-01"^^xsd:date ;
13:     foaf:mbox "ajjoxmy"^^xsd:string,
14:               "qnuf"^^xsd:string ;
15:     foaf:name "vviret"^^xsd:string .

      src1:review1
16:     a src1:Review ;
17:     src1:author src1:person2 .

      src1:review4
18:     a src1:Review ;
19:     src1:author src1:person2 .

Next we show how this shown excerpt of the source data set is represented under the target schema after it got translated:

# Target data set
# The numbers on the left enumerate the triples

      src1:product1
01:     a tgt:Product ;
02:     rdfs:comment "lkdkmw xk iqkrebe..."^^xsd:string ;
03:     rdfs:label "uxyz"^^xsd:string ;
04:     tgt:productPrice 311.167906992248 ;
05:     tgt:productReview src1:review5,
06:                       src1:review6 ;
07:     tgt:productVendor "jasd dobzxkv ootvn wtxjlt vqyen"^^xsd:string .

      src1:person2
08:     a tgt:Person,
09:       tgt:Reviewer ;
10:     tgt:author src1:review1,
11:                src1:review4 ;
12:     tgt:name "vviret"^^xsd:string ;
13:     tgt:personHomepage "http://www4.wiwiss.fu-berlin.de/bizer/lodib/vocabulary/source1/homepage5"^^xsd:string ;
14:     foaf:mbox "0da0dba8f828d6bd039445e6e6be16bfa03ab723"^^xsd:string ,
15:               "63e5f09f5aae3de577492f68abebae79479fdace"^^xsd:string ;
16:     tgt:birth _:A35e13de6X3aX135aad3a9adX3aXX2dX7ffb .

      _:A35e13de6X3aX135aad3a9adX3aXX2dX7ffb
17:     tgt:birthDate "1976-12-01"^^xsd:date .

As can be seen alot of the translated data was won by applying simple renaming mapping patterns for classes and properties. In addition to these simple mappings following mapping patterns would have to be executed to translate this source data set to its target representation:

Rename Class based on Property (RCP): Triples 10 and 17 of the source data set are mapped to triple 9 in the target. In words: "A person who reviewed something (source) becomes a reviewer (target)"
Reverse Property (RvP): Triple 17 of the source data set is reversed in the target data set, which results in triple 10.
Deresourcify (DRsc): From the triples 3 and 9 (source) there originates triple 4 (target).
1:1 Value Transformation (1:1): The foaf:mbox values from the source (triples 13, 14) are translated to their SHA1 checksum values in the target (triples 14, 15).
Resourcify (Rsc): The birth date property triple in the source (triple 12) is translated to the two triples in the target involving a newly created resource (triples 16 and 17). Instead of a blank node generating a URI would also be a valid option.
URI to Value (UtV): A person's homepage is represented as resource in the source (triple 11) whereas it is a string literal in the target (triple 13).

5. Scaling and Dataset Population

We define a number of rules to populate and scale the three source data sets that we have specified in the previous section. These data are scaled based on the number of product instances. Furthermore, these rules are implemented by the data generator tool.

Note that our rules entail the automatic generation of properties and values. To generate these properties, we use a uniform distribution to select how many properties are related to a particular instance. In addition, for object properties that relate two URIs, we also use a uniform distributions to randomly select a range URI to be related to a given domain URI. To generate constants, we rely on a number of generators that apply different techniques to create these values, such as random alphabetic string characters, random doubles, or random dates. Finally, our tool provides 44 statistical distributions, including Uniform, Normal, Exponential, Zipf, Pareto and empirical distributions, to mention a few.

5.1. Source 1

Class src1:Product

Number of instances: #NumberOfProducts
Property src1:name

Cardinality: Exact 1
Value: String of 1-3 words

Property src1:description

Cardinality: Exact 1
Value: String of 50-150 words

Property src1:vendor

Cardinality: Min 1 using Uniform distribution
Value: String of 1-5 words

Property src1:hasPrice

Cardinality: Exact 1

Class src1:ProductPrice

Number of instances: 1
Property src1:productPrice

Cardinality: Exact 1
Value: Double between 0.05 and 500.0

Property src1:hasReview

Cardinality: Max 50

Class src1:Review

Number of instances: random number between 0-50 using Uniform distribution
Property src1:revText

Cardinality: Exact 1
Value: String of 25-500 words, @en tag

Property dc:date

Cardinality: Exact 1
Value: Date between 2005-01-01 and 2011-12-31

Property dc:title

Cardinality: Max 1 using Uniform distribution
Value: String of 1-15 words

Property src1:author

Cardinality: Exact 1

Class src1:Person

Number of instances: 1
Property foaf:name

Cardinality: Exact 1
Value: String of 1-5 words

Property foaf:mbox

Cardinality: Max 5 using Uniform distribution
Value: String of 1 word

Property src1:personHomepage

Cardinality: Max 3 using Uniform distribution
Value: String of 1 word

Property src1:birthDate

Cardinality: Max 1 using Uniform distribution
Value: Date between 1945-01-01 and 1993-12-31

Property src1:created

Cardinality: Exact 1
Value: Date between 2005-01-01 and 2011-12-31

5.2. Source 2

Class src2:Product

Number of instances: #NumberOfProducts
Property src2:name

Cardinality: Exact 1
Value: String of 1-3 words

Property src2:description

Cardinality: Exact 1
Value: String of 50-150 words

Property src2:vendor

Cardinality: Exact 1
Value: String of 1-3 words

Property src2:price

Cardinality: Exact 1
Value: Double between 0.05 and 500.0

Property src2:productHomepage

Cardinality: 0 or 1
Value: String of 1 word

Property src2:outdated

Cardinality: Exact 1
Value: "Yes" or "No using Uniform distribution

Property src2:hasReview

Cardinality: Avg 2

Class src2:Review

Number of instances: 2 * #NumberOfProducts
Property dc:date

Cardinality: Exact 1
Value: Date between 2005-01-01 and 2011-12-31

Property dc:title

Cardinality: Max 1 using Uniform distribution
Value: String of 1-15 words

Property src2:hasText

Cardinality: Exact 1

Class src2:ReviewText

Number of instances: same as nr. of reviews
Property src2:revText

Cardinality: Exact 1
Value: String of 20-100 words

Class src2:Person

Nr. of instances: 3/5 * #NumberOfProducts
Property foaf:firstName (range xsd:string)

Cardinality: Exact 1
Value: String of one word

Property foaf:surname (range xsd:string)

Cardinatlity: Exact 1
Value: String of one word

Property src2:mbox (range xsd:string)

Cardinality: 1 - 2 using Uniform Distribution
Value: String of one word

Property src2:mini-cv (range xsd:string)

Cardinality: 0 or 1 using Uniform Distribution
Value: String of 10 - 30 words

Property src2:birthDate (range xsd:date)

Cardinality: 0 or 1 using Uniform Distribution
Value: date value between 1945-01-01 and 1993-12-31 using Uniform Distribution

Property src2:review (range src2:Review)

Cardinality: Avg. 10/3, uniformly distributed over src2:Person

5.3. Source 3

Class src3:Product

Nr. of instances: #NrOfProducts
Property src3:name (xsd:string)src3:description (xsd:string)

Cardinality: Exact 1
Value: String of 1-3 words

Property src3:vendor (xsd:string)

Cardinality: Exact 1
Value: String of 1-3 words

Property src3:comment (xsd:string)

Cardinality: 0 or 1 uniformly distributed
Value: String of 1-3 words

Property src3:hasPrice (src3:ProductPrice)

Cardinality: Exactly 1

Property src3:hasReview (src3:Review)

Cardinality: Avg. 2, uniformly distributed over src3:Product

Class src3:ProductPrice

Nr. of instances: #NrOfProducts
Property src3:productPrice (xsd:double)

Cardinality: Exact 1
Value: Double between 0.05 and 500.0

Class src3:Review

Nr. of instances: #NrOfProducts * 2
Property src3:revText (xsd:string, @en)

Cardinality: Exact 1
Value: String of 25-500 words, @en tag

Property dc:date (xsd:date)

Cardinality: Exact 1
Value: Date between 2005-01-01 and 2011-12-31

Property dc:title (xsd:string)

Cardinality: Exactly 1
Value: String of 1-15 words

Property src3:author (src3:Person)

Cardinality: Exactly 1

Class src3:Person

Nr. of instances: #NrOfProducts * 3/5
Property foaf:name (xsd:string)

Cardinality: Exactly 1
Value: String of 2 - 3 words

Property src3:mbox (xsd:string)

Cardinality: Exactly 1
Value: String of 1 word

Property src3:mini-cv (xsd:string)

Cardinality: 0 or 1 using Uniform Distribution
Value: String of 10 - 30 words

Property src3:birthDate (xsd:date)

Cardinality: Exactly 1
Value: date value between 1945-01-01 and 1993-12-31 using Uniform Distribution

Property src3:memberSince (xsd:date)

Cardinality: Exactly 1
Value: date value between 2007-03-14 and 2011-01-12 using Uniform Distribution

6 Data Set Generator and Test Tools

The data set generator and all other tools for the LODIB benchmark can be downloaded here.

6.1 Benchmark Data Set Generator

A small demo use case with only around 100 triples for each source data set can be generated with the following command:

bin/generateDemoUseCase demo

This will create the source data sets in demo/sources and the target data sets in demo/targets.

For higher scaling, use cases can be generated with the generateUseCases script. As first argument it gets the output directory for the generated files. In addition you can specify one or more integer parameters. Each number represents the overall amount of triples in the source data sets in millions. Here is one example that will generate two use cases, one with about 50 million source triples and one with about 100 million source triples:

bin/generateUseCases usecases 50 100

This will generate all three source data sets and the expected target data set for each of the use cases. For this example it will generate the directories usecases/50M and usecases/100M, each including the proper source and target data sets. This can take several hours - mainly due to the generation of the target data sets.

6.2 Qualification

Besides looking at the expressivity of the used mapping language, we also validated the produced results of each system. To validate a system following steps have to be executed:

Generate a small test use case by typing:

bin/generateDemoUseCase demo       # this will generate a small demo use case in the directory demo

Map the source data sets found in demo/sources/X, where X is either 1, 2 or 3 with the data translation system that should be validated. Let's assume the output data sets are called targetX.nt (X stands for 1, 2 or 3).
For each pair of given target data set and mapped target data set run following command - here exemplified with the target data sets generated from Source1:
```
bin/validator.sh demo/targets/1/data.nt target1.nt validationResult.txt
```
If the validator has found any problems you can use the verbose flag to output more detailed descriptions of the problem.
```
bin/validator.sh -v demo/targets/1/data.nt target1.nt validationResult.txt
```

Note that the validator does not compare the results in a data type aware way. Thus it can happen that different representations of e.g. a double value can lead to reported problems although the value in both data sets is the same (e.g. 2.1 vs. 2.1e0). These problems can usually be ruled out by looking at the verbose output of the validator.

7. Bibliography

[AlexeTV08] B. Alexe, W. C. Tan, and Y. Velegrakis. STBenchmark: towards a benchmark for mapping systems. PVLDB, 1(1):230-244, 2008.
[ArenasL08] M. Arenas, and L. Libkin. XML data exchange: Consistency and query answering. J. ACM, 55 (2). 2008.
[BizerHB09] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - the story so far. Int. J. Semantic Web Inf. Syst., 2009
[BizerS09] C. Bizer, and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web Inf. Syst., 5 (2): 1-24. 2009.
[BizerS10] C. Bizer, and A. Schultz. The R2R Framework: Publishing and Discovering Mappings on the Web. In COLD, 2010.
[EuzenatPS08] J. Euzenat, A. Polleres, and F. Scharffe. Processing Ontology Alignments with SPARQL. In CISIS, pages 913-917, 2008.
[FaginKMP05] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336 (1): 89-124. 2005.
[FuxmanHH06] A. Fuxman, M. A. Hernández, H. Ho, R. J. Miller, P. Papotti, and L. Popa. Nested Mappings: Schema Mapping Reloaded. In VLDB, pages 67-78. 2006.
[HeathB11] T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011
[MotikHS09] B. Motik, I. Horrocks, and U. Sattler. Bridging the gap between OWL and relational databases. J. Web Sem., 7(2):74-89, 2009.
[LODTerms11] C. Bizer, A. Jentzsch, and R. Cyganiak. State of the LOD Cloud. 2011.
[ParreirasSS08] F. S. Parreiras, S. Staab, S. Schenk, and A. Winter. Model driven specification of ontology translations. In ER, pages 484–497, 2008.
[PolleresSS07] A. Polleres, F. Scharffe, and R. Schindlauer. SPARQL++ for Mapping Between RDF Vocabularies. In ODBASE, pages 878-896, 2007.
[QinDL07] H. Qin, D. Dou, and P. LePendu. Discovering executable semantic mappings between ontologies. In ODBASE, pages 832–849, 2007.
[RiveroHR11A] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118-131, 2011.
[RiveroHR11B] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613-1618, 2011.

Appendix A: Changes

2012-04-04: LODIB version 1.0
2012-02-20: Initial version of this document

Appendix B: Acknowledgements

This work was supported by the EU FP7 grants LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943), the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010-21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010-09988-E).

Linked Open Data Integration Benchmark (LODIB) Specification - V1.0

Abstract

Table of Contents

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

Description:

Example:

Source triples

Target triples

6 Data Set Generator and Test Tools

6.1 Benchmark Data Set Generator

6.2 Qualification