DataIO

Abstract

This document describes Logical Source and Logical Target to access data sources and targets.

A Logical Source is a formal model and common representation for describing access to data sources. A Logical Target is a formal model and a common representation for specifying how a Knowledge Graph should be exported to a given target.

Logical Source and Logical Target reuses existing data access descriptions and is therefore not limited to a specific set of targets or data sources. The current document describes the Logical Source and Logical Target concepts through definitions and examples.

The version of this document is v0.1.

Prefix	Namespace
`rml`	http://semweb.mmlab.be/ns/rml#
`formats`	https://www.w3.org/ns/formats/
`comp`	http://semweb.mmlab.be/ns/rml-compression#
`void`	http://rdfs.org/ns/void#
`sd`	http://www.w3.org/ns/sparql-service-description#
`dcat`	http://www.w3.org/ns/dcat#
`td`	https://www.w3.org/2019/wot/td#
`hctl`	https://www.w3.org/2019/wot/hypermedia#
`htv`	http://www.w3.org/2011/http#

The LogicalSource vocabulary namespace is http://semweb.mmlab.be/ns/rml-source# and it's prefix is rml.

The Logical Source vocabulary consists of 2 classes:

rml:LogicalSource describes how data of a source can be referenced.
rml:Source describes how a source can be accessed, it is part of a rml:LogicalSource.

A Logical Source is any data source providing data to be mapped to RDF triples.

A Logical Source (rml:LogicalSource) MUST contains the following properties:

The source (rml:source) specifies how a source is accessed through a rml:Source.
The reference formulation (rml:referenceFormulation) defines the reference formulation used to refer to the elements of a data source. The reference formulation must be specified in the case of databases, CSV, TSV, XML, and JSON data sources. By default rr:SQL2008 for databases, ql:CSV for CSV and TSV data sources. XPath for XML and JSONPath for JSON and JSONL data sources.

The following properties MAY be specified in a Logical Source:

The logical iterator (rml:iterator) defines the iteration loop used to map the data of the input source. The iterator defines how to refer to any of the following:
- a row in the case of databases, CSV or TSV data sources
- a repetition pattern expressed as an element in the case of XML documents,
- a repetition pattern expressed as an object in the case of a JSON data sources.
- etc...

By default, the iterator is considered a row, if not specified:

In the case of databases, CSV or TSV data sources, the value of the rml:iterator, if not specified, is a "row".
In the case of XML and JSON data sources, it is a valid reference to an element or an object respectively considering the reference formulation specified.

The Logical Source definition requires only the source (rml:source) to be specified, all other properties are optional. If a property is specified, it MUST NOT be specified multiple times.

Property	Domain	Range
`rml:source`	`rml:LogicalSource`	`Source`
`rml:referenceFormulation`	`rml:LogicalSource`	`ql:ReferenceFormulation`
`rml:iterator`	`rml:LogicalSource`	`Literal`

Source structure — Figure 1 The structure of Source

Each Logical Source has a reference formulation to define how to reference to elements of the data of the input source. Several reference formulations (rml:ReferenceFormulation) are defined in this specification:

rr:SQL2008: SQL 2008 standard for relational databases
ql:CSV: CSV or TSV data sources
ql:JSONPath: JSON documents
ql:XPath: XML documents, a shortcut for ql:XPathReferenceFormulation with default parameters
ql:XPathReferenceFormulation: XML documents with optionally the definition of XML namespaces used in references. By default, no namespaces are defined.

ql:XPathReferenceFormulation may specify zero or more ql:namespace properties with a ql:Namespace. A ql:Namespace contains the following required properties:

ql:namespacePrefix: A Literal with the prefix used for the XML namespace.
ql:namespaceURL: A Literal with the URL identifying the XML namespace.

@prefix dcat : <http://www.w3.org/ns/dcat#> .
<#XMLNamespace> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a dcat:Dataset;
         dcat:distribution [ a dcat:Distribution;
           dcat:accessURL <file:///path/to/data.xml>;
         ];
       ];
     ];
     rml:referenceFormulation [ a ql:XPathReferenceFormulation;
       ql:namespace [ a ql:Namespace;
         ql:namespacePrefix "ex";
         ql:namespaceURL "http://example.org";
       ];
     ];
     rml:iterator "/xpath/ex:namespace/expression";
.

SQL databases require a SQL query to be performed to retrieve a table or view from the database. This is specified through the rr:SQL2008 reference formulation from the W3C R2RML recommendation.

@prefix d2rq : <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
<#SQLDatabase> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a d2rq:Database;
          d2rq:jdbcDSN "jdbc:mysql://localhost/example";
          d2rq:jdbcDriver "com.mysql.jdbc.Driver";
          d2rq:username "user";
          d2rq:password "password" .
       ];
     ];
     rml:referenceFormulation rr:SQL2008;
     rml:query "SELECT name FROM student;"
.

Tabular data are widely used and described by existing standards such as W3C CSVW recommendation. Refering to these data can be done by referring to column names through the ql:CSV reference formulation.

In the following example, a CSV file is accessed, but the CSV reference formulation is not limited to files. Other type of data sources in a CSV format can use the same reference formulation.

@prefix csvw : <http://www.w3.org/ns/csvw#> .
<#CSVFile> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a d2rq:Database;
         csvw:url "file:///data/file.csv" ;
         csvw:dialect [ a csvw:Dialect;
           csvw:delimiter ";";
           csvw:encoding "UTF-8";
           csvw:header "1"^^xsd:boolean;
         ];
       ];
     ];
     rml:referenceFormulation ql:CSV;
.

JSON data is hierarchical and can be refered to using JSONPath which is specified through the ql:JSONPath reference formulation.

In the following example, a JSON file is accessed, but the JSONpath reference formulation is not limited to files. Other type of data sources in a JSON format can use the same reference formulation.

@prefix dcat : <http://www.w3.org/ns/dcat#> .
<#JSONFile> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a dcat:Dataset;
         dcat:distribution [ a dcat:Distribution;
           dcat:downloadURL "http://example.org/file.xml";
         ];
       ];  
     ];
     rml:referenceFormulation ql:JSONPath;
.

XML data is hierarchical and can be refered to using XPath which is specified through the ql:XPath reference formulation. If an XML namespace needs to be specified, ql:XpathReferenceFormulation class can be used which allows to define one or multiple XML namespaces.

In the following example, a JSON file is accessed, but the CSV reference formulation is not limited to files. Other type of data sources in a CSV format can use the same reference formulation.

@prefix dcat : <http://www.w3.org/ns/dcat#> .
<#XMLNamespace> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a dcat:Dataset;
         dcat:distribution [ a dcat:Distribution;
           dcat:accessURL <file:///path/to/data.xml>;
         ];
       ];
     ];
     rml:referenceFormulation [ a ql:XPathReferenceFormulation;
       ql:namespace [ a ql:Namespace;
         ql:namespacePrefix "ex";
         ql:namespaceURL "http://example.org";
       ];
     ];
     rml:iterator "/xpath/ex:namespace/expression";
.

A Source (rml:Source) defines how a data source should be accessed. It MUST contain the follow properties:

rml:access describes where a source is located. It is a URI [RFC3986] or Literal [RDF-Concepts] that represents the input data source's location. External vocabulary such as DCAT, VoID, SD is allowed here.

Optionally, the following properties MAY be specified:

rml:encoding specifies the encoding of the data inside the source. Defaults to enc:UTF-8 if not specified.
rml:null describes which data values inside the source should be considered as NULL. Defaults to the default NULL character if available. If none is available such as CSV, no values are considered NULL, unless specified. Example: CSV does not have a default NULL character, so no value is considered NULL. However, JSON has a NULL character specified: null, this one is used together with the ones specified through rml:null.
rml:compression specifies if the source is compressed and the used compression algorithm. Defaults to no compression.

<#JSON> a rml:LogicalSource;
     rml:source [ a rml:Source
       rml:access [ a dcat:Dataset;
         dcat:distribution [ a dcat:Distribution;
           dcat:accessURL <file:///path/to/data.json.gz>;
         ];
       ];
       rml:null ""; # empty string as NULL besides default null character
       rml:compression comp:gzip; # GZip compression
       rml:encoding enc:UTF-16; # UTF-16 encoding
     ];
     rml:referenceFormulation ql:JSONPath;
     rml:iterator "$.jsonpath.expression";
.

Property	Domain	Range
`rml:access`	`rml:Source`	`URI or Literal`
`rml:encoding`	`rml:Source`	`enc:Encoding`
`rml:null`	`rml:Source`	`Literal`
`rml:compression`	`rml:Source`	`comp:Compression`

The following example show a Source of an CSV file.

<#CSV> a rml:LogicalSource;
     rml:source [ a csvw:Table;
         csvw:url "/path/to/data.csv";
     ];
     rml:referenceFormulation ql:CSV;
.

Note that there is not rml:iterator is present because its default is row.

The following example shows a Source specified for a database.

<#RDB> a rml:LogicalSource;
     rml:source [ a d2rq:Database;
        d2rq:jdbcDSN "jdbc:mysql://localhost/example";
        d2rq:jdbcDriver "com.mysql.jdbc.Driver";
        d2rq:username "user";
        d2rq:password "password";
     ];
     rml:referenceFormulation ql:SQL2008;
.

Note that there is not rml:iterator is present because its default is row.

The following example shows a Source of a XML file

<#XML> a rml:LogicalSource;
     rml:source [ a dcat:Dataset;
       dcat:distribution [ a dcat:Distribution;
         dcat:accessURL <file:///path/to/data.xml>;
       ];
     ];
     rml:referenceFormulation ql:XPath;
     rml:iterator "/xpath/iterator/expression";
.

<#JSON> a rml:LogicalSource;
     rml:source [ a dcat:Dataset;
       dcat:distribution [ a dcat:Distribution;
         dcat:accessURL <file:///path/to/data.json>;
       ];
     ];
     rml:referenceFormulation ql:JSONPath;
     rml:iterator "$.jsonpath.expression";
.

The Target vocabulary namespace is http://semweb.mmlab.be/ns/rml-target# and it's prefix is rml.

The Target vocabulary consists of a single class: rml:LogicalTarget to describe how a knowledge graph must be exported after generation.

A Target is any target to where RDF triples are exported to.

A Target (rml:LogicalTarget) contains the following properties:

The target (rml:target) locates the output target. It is a URI [RFC3986] or Literal [RDF-Concepts] that represents the target's location. External vocabulary such as DCAT, VoID, SD is allowed here. Each rml:LogicalTarget MUST have one rml:target property. The target MAY be a Literal containing the path the file to where the knowledge graph is exported to, this is allowed to stay backwards compatibility with existing data access descriptions.
The serialization format (rml:serialization) MAY specify the serialization format for exporting a knowledge graph. The serialization format is described using the W3C formats namespace. By default, the serialization format is N-Quads [N-Quads].
The compression algorithm (rml:compression) MAY describe the compression algorithm to apply when exporting a knowledge graph. The compression format is specified through the comp namespace. By default, no compression is applied.
The encoding (rml:encoding) MAY specify which encoding must be used when exporting a knowledge graph. The encoding is specified through enc namespace.

The Target definition requires only the target (rml:target) to be specified, all other properties are optional.

Property	Domain	Range
`rml:target`	`rml:LogicalTarget`	`URI or Literal`
`rml:serialization`	`rml:LogicalTarget`	`formats:Format`
`rml:compression`	`rml:LogicalTarget`	`comp:Compression`
`rml:encoding`	`rml:LogicalTarget`	`enc:Encoding`

Target structure — Figure 2 The structure of Target

The following example show a Target of an RDF dump in Turtle [Turtle] format with GZip compression and UTF-8 encoding:

<#VoIDDump> a rml:LogicalTarget;
     rml:target [ a void:Dataset;
         void:dataDump <file:///data/dump.ttl>;
     ];
     rml:serialization formats:Turtle;
     rml:compression comp:gzip;
     rml:encoding enc:UTF-8;
.

The following example shows a Target of a [SPARQL] endpoint with SPARQL UPDATE:

<#SPARQLEndpoint> a rml:LogicalTarget;
     rml:target [ a sd:Service;
       sd:endpoint  <http://example.com/sparql-update>;
       sd:supportedLanguage sd:SPARQL11Update ;
     ];
.

The following example shows a Target of a DCAT dataset in N-Quads format with Zip compression:

<#DCATDump> a rml:LogicalTarget;
     rml:target [ a dcat:Dataset;
       dcat:distribution [ a dcat:Distribution;
         dcat:accessURL <http://example.org/dcat-access-url>;
       ];
     ];
     rml:serialization formats:N-Quads;
     rml:compression comp:zip;
.

The following example shows a Target of a MQTT stream in N-Quads format without compression:

<#MQTTStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "mqtt://localhost/topic";
           hctl:forContentType "application/n-quads";
           # Set MQTT parameters through W3C WoT Binding Template for MQTT
           mqv:controlPacketValue "SUBSCRIBE";
           mqv:options ([ mqv:optionName "qos"; mqv:optionValue "1" ] [ mqv:optionName "dup" ]);
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

The following example shows a Target of a TCP stream in N-Quads format without compression:

<#MQTTStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "tcp://localhost:1234/topic";
           hctl:forContentType "application/n-quads";
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

The following example shows a Target of a Kafka stream in N-Quads format without compression:

<#KafkaStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "kafka://localhost:8089/topic";
           hctl:forContentType "application/n-quads";
           # Kafka parameters through W3C WoT Binding Template for Kafka
           kafka:groupId "MyAwesomeGroup";
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

The following example shows a Target of a HTTP Server Sent Events in N-Quads format without compression:

<#HTTPSSEStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "http://localhost:4242/";
           hctl:forContentType "application/n-quads";
           # Set HTTP method and headers through W3C WoT Binding Template for HTTP
           htv:methodName "POST";
           htv:headers ([
             htv:fieldName "User-Agent";
             htv:fieldValue "Processor";
           ]);
           # Max-Age CoAP property has number 14. Value is in seconds RFC7252
           cov:options ([ cov:optionName "14"; cov:optionValue "360" ]);
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

The following example shows a Target of a HTTP Server Sent Events stream in N-Quads format without compression:

<#HTTPSSEStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "http://localhost:4242/";
           hctl:forContentType "text/event-stream";
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

The following example shows a Target of a WebSocket in N-Quads format without compression:

<#WebSocketStream> a rml:LogicalTarget;
     rml:target [ a td:Thing;
       td:hasPropertyAffordance [
         td:hasForm [
           # URL and content type
           hctl:hasTarget "ws://localhost:5555/";
           hctl:forContentType "application/n-quads";
         ];
       ];
     ];
     rml:serialization formats:N-Quads;
.

DataIO

Abstract

Status of This Document

1. Conformance

2. Overview

3. Logical Source vocabulary

3.1 Defining Logical Sources

3.2 Reference formulations

3.2.1 SQL databases

3.2.2 Tabular CSV & TSV data

3.2.3 JSON data

3.2.4 XML data

3.3 Source

3.4 Examples

4. Target vocabulary

4.1 Defining Targets

4.2 Examples

A. References

A.1 Normative references