Knowledge Graph Construction
W3C Community Group

Kick-Off, TPAC Meetings

Virtual, 12-16 October 2020

See Presentation

Knowledge Graph Construction W3C Community Group

The overall goal of this community group is to support its participants into developing better methods for Knowledge Graphs construction. The Community Group will (i) study current Knowledge Graph construction methods and implementations, (ii) identify the corresponding requirements and issues that hinter broader Knowledge Graph construction, (iii) discuss use cases, (iv) formulate guidelines, best practices and test cases for Knowledge Graph construction, (v) develop methods, resources and tools for evaluating Knowledge Graphs construction, and in general (vi) continue the development of the W3C-recommended R2RML language beyond relational databases. The proposed Community Group could be instrumental to advance research, increase the level of education and awareness and enable learning and participation with respect to Knowledge Graph construction.

Agenda

Schedule*

Session 1 - 12th October (14:00 - 15:45 UTC+2): Welcome

Click here to join the session via Zoom
  • 14:00 - 14:15 Welcome
  • 14:15 - 14:45 Introduce to each other
  • 14:45 - 15:15 Goals of the Community Group
  • 15:15 - 15:30 Organization decissions
  • 15:30 - 15:40 Closing

Session 2 - 12th October (16:00 - 18:00 UTC+2): Retrospectives from the RDB2RDF WG

Click here to join the session via Zoom
  • 16:00 - 16:15 Welcome
  • 16:15 - 17:30 Panel with R2RML and Direct Mapping editors:
    • Boris Villazón-Terrazas, Tinámica.
    • Juan Sequeda, Data.World
    • Marcelo Arenas, Pontificia Universidad Católica de Chile
    • Souripriya Das, Oracle
  • 17:30 - 17:50 Open Discussion
  • 17:50 - 18:00 Closing

Session 3 - 13th October (14:00 - 16:00 UTC+2): Mapping Languages

Click here to join the session via Zoom
  • 14:00 - 14:15 Welcome
  • 14:15 - 14:45 Presentation of existing languages**
  • 14:45 - 15:15 Pitching challenges**
  • 15:15 - 15:45 Open discussion
  • 15:45 - 16:00 Closing

Session 4 - 15th October (14:00 - 16:00 UTC+2): Tools and Demos

Click here to join the session via Zoom
  • 14:00 - 14:15 Welcome
  • 14:15 - 15:30 Presentation of existing tools splitted by category**
  • 15:30 - 15:50 Open discussion and questions
  • 15:50 - 16:00 Closing

Session 5 - 16th October (14:00 - 16:00 UTC+2): Use Cases

Click here to join the session via Zoom
  • 14:00 - 14:15 Welcome
  • 14:15 - 15:30 Presentation of use cases splitted by category**
  • 15:30 - 15:50 Next steps: requirements extraction
  • 15:50 - 16:00 Closing

*Links to each videocall will be added soon

**We contacted a few people that we expected would present their languages/tool/use-cases/challenges, considering the contributions on github, but we might have missed something! if we didn’t contact you and you want to present, please drop us an email!

Important dates

07 October, 2020

Contribute

Do you have a real use case constructing Knowledge Graphs? Tools, mapping language specifications or benchmarks? Contribute with your resources in our GitHub repository

08 October, 2020
12-16 October, 2020

Kick-Off Meeting

Attend 4 different sessions to discuss and define with us the future of Knowledge Graph Construction

Organization

Anastasia Dimou

Senior Researcher, imec - IDLab (UGent)

David Chaves Fraga

PhD Student, OEG - UPM


And almost 90 participants....

Report

Last week we had the kick-off of the Knowledge Graph construction Community Group (KG-construct CG)! We missed the chance to count the exact number of participants as we were overwhelmed with the participation that exceeded our expectations! When the kick-off happened, the CG had about 80 members registered and the attendance during the first day exceeded the 50 participants! In total, we’re confident that throughout all days, we had around 60 participants which seems to be a great participation number! Thank you all for joining!


history

This CG builds further on the work from the RDB2RDF W3C Working Group and the Knowledge Graph Building Workshop. The RDB2RDF W3C Working Group published two W3C recommendations: A Direct Mapping of Relational Data to RDF and R2RML: RDB to RDF Mapping Language. The Knowledge Graph Building workshop aimed to bring together the community of people working on Knowledge Graph construction beyond DBs only. Given its success, forming such a CG was the only logical next step!


aim, goals and output

The aim of this CG is to bring together people working on KG construction from semi-structured data, where the KG is in RDF (at the moment) and semi-structured data in XML, CSV, JSON etc. During the kick-off day, the specific goals of the CG were also identified: (i) study current methods and implementations, (ii) discuss use cases and derived non-covered requirements, (iii) formulate guidelines and best practices, and (iv) develop methods, resources and tools for evaluations. The expected outcome of the CG includes: white papers, guidelines, best practices and future challenges. Of course, the goals and output are open to discussion and can be more concretely shaped as the CG proceeds.


audience

There was a broad discussion regarding the audience of the CG. Juan Sequeda pointed to the lack of industry adoption and, thus, the need to involve people from industry. Maria Esther Vidal had quite some objections on Juan’s point, emphasizing on the importance of keeping the CG group focused. She mentioned that we should have scientific contributions that work in real-world scenarios. Both are very good points, an in-between verdict could be that we do need people from industry without losing our CG’s focus. In any case, these remarks will bother us during our bi-weekly meetings! It should become clearer what the CG’s audience is!


retrospectives with representatives from the RDB2RDF WG

The first day, not only the CG kicked off but we had a retrospective panel with representatives of the RDB2RDF WG! Juan Sequeda and Marcelo Arenas, editors of the Direct Mapping recommendation, Souripriya (Souri) Das, editor of the R2RML recommendation, and Boris Marcelo Villazon Terrazas, editor of the R2RML Implementation Report. We were very excited about the discussions this panel could trigger but it exceeded our expectations!

With this panel, we wanted to bridge the gap between the two generations, i.e. the generation that worked on constructing KGs from databases and the new generation that worked on constructing KGs from heterogeneous data, beyond databases. Apparently, after the recommendations were launched, the community faded out and there was no continuation of the works. The panelists agreed that this was unfortunate and emphasized on the importance of building a community and disseminating. Especially nowadays thanks to the advancement of technologies, this is easier! In the case of Machine Learning (ML), the community spirit prevailed and this is reflected on the results. Good news: we are in the right direction!

The motivations and expectations were very well-defined when the RDB2RDF Working Group was formed, the panelists mentioned. Their expectation was clear: a recommendation to generate KGs from data in relational databases, so was the motivation: keep it simple but powerful! There was also a good mixture of industry and academia. The R2RML approach felt natural, but in the process, it became clear that an alternative that would allow a quick start was also necessary and that’s how the direct mapping alternative emerged. In both cases, the goal was well-set, how to express it was only vague. advice for our CG: set clear goals!


R2RML

During the retrospectives, a lot of questions that we had about R2RML were answered, especially regarding joins, lists, and, of course, inverse expressions!

Joins: It was explained how the joins were chosen. Juan mentioned that the D2R language, R2RML’s predecessor, had conditions constructs separated from the SQL query and Souri joked that he got scared when he saw it :-) In the end a simple join condition was kept on R2RML and more complicated data transformations and joins were pushed on SQL. This approach is reconsidered though when heterogeneous data is considered to construct a KG. Not all data formats have as powerful query languages as SQL.
RDF lists: It is a fact that RDF lists cannot be generated with R2RML. Boris, Marcelo and Juan didn’t remember RDF lists being discussed. Souri mentioned supporting RDF lists was just too hard, but some constructs can be generated with R2RML. Question is how necessary is it? Should we be able to generate RDF lists from heterogeneous data?
Inverse expressions: It was also explained that the inverse expressions were a last-minute addition and the panelists apologized for including it! advice for our CG: ignore the inverse expressions with the editors’ blessing! We are sure a lot of people are relieved with this statement! :-)


advice and perspectives

Throughout the discussions the panelist gave their advice and hinted towards points worth-to-be-investigated! Marcelo suggested looking into creating libraries as they do in ML and Juan agreed, as well as implementing good tools as they are important for the adoption of the standards, i.e. the adoption of the standards depends on the tools. Juan also suggested to keep an open eye outside the Semantic Web community. Souri suggested looking at generating RDF (materialized) views on top of natural language. He also provided some well-documented suggestions in a slideck.

The panelists also agreed that defining rules is not a technical issue but social. Different views may come up out of the same data depending on the user and the desired use. Juan asked that we as computer scientists do not know: Who is the user? We don’t know that. User studies? Taking care of the users, e.g., providing examples of how things should be done and corresponding solutions is one of the things they regret they did not do after R2RML became a recommendation. advice for our CG: emphasis on providing material


mapping languages

Besides R2RML, mapping languages for constructing KGs from heterogeneous data were discussed in a dedicated session. There was a good variety of alternatives presented, representing the dominant trends. In all cases, the mapping languages rely on an existing specification for other purposes and reuse existing grammars to refer to the input data: RML and xR2RML based on the R2RML W3C recommendation, SPARQL-Generate based on the SPARQL W3C recommendation and ShExML based on ShEx specification. The Ontop mapping language was also presented and YARRRML as a human-friendly representation of [R2]RML.

While there are quite some similarities, there are also differences between the mapping languages proposed so far. RML was proposed with minimum changes over R2RML in mind to prove that it is possible to construct KGs with the same mapping language. xR2RML follows the RML principles but dives in more details on limitations that RML inherited from R2RML, e.g., accessing outer fields while iterating over hierarchical data or supporting RDF lists/containers. SPARQL-Generate follows the same principles but targets an audience more familiar with SPARQL. Last ShExML tries to bring the generation and validation one step closer in a different way than it was by combining RML and RDFUnit. Franck Michel suggested looking towards a generic model for a mapping model from a scientific point of view with concrete guidelines on how we can achieve this from a pragmatic point of view. It was concluded that we not only need to investigate the differences but also to build alignments between languages and corresponding implementations to easily switch from one language to another.

Of course behind the success stories of mapping languages, there are still challenges to be addressed, as they were brought up in the past by Valdimir Alexiev, Pano Maria and Ben De Meester. These challenges are related to e.g., generating language and datatype tags based on data, data transformations, multi-valued subject maps, joins especially for hierarchical data and when to process them (even though the latter is an implementation aspect), data streams as input, etc. https://github.com/kg-construct/mapping-challenges summarizes all challenges and we hope we see soon more challenges listed and lots of updates on mapping languages!


tools and use cases

There were 17 tools and 12 use cases presented during the TPAC meetings but there were more collected in the repositories (https://github.com/kg-construct/resources and https://github.com/kg-construct/use-cases) and hopefully even more will come in the future! So, feel free to add your tool or use case to the repositories!

Discussions were raised about tools and what we further need. We indicatively mention some questions that popped up: Francesco Osborne raised the question that we need a list of properly-tested tools. Which tools are still missing? Similarly, Vladimir suggested that we need an evaluation of tools, not only with respect to their conformity to the test cases, but also with respect to their functionality and performance. Should we have a testing infrastructure?

Despite the exhausting week, even on the last day, there were a lot of participants and interesting remarks emerged around the use cases. Apparently, it is common practice to manually preprocess the data to clean them up before the mapping rules are defined. The mapping rules are always manually defined as well as the model’s/ontology’s definition. It seems that preprocessing steps per domain are alike, and it might be a good idea to look into reusable libraries and packages. It was also noticed that we don’t build tools but pipelines, so we need to investigate more work towards reproducible workflows. Last, Francesco mentioned that we should consider heterogeneous data sources but in combination with existing KGs and NLP techniques for entity recognition to boost some automation.


logistics

The CG decided on continuing its works with bi-weekly meetings and smaller groups that work in more focused topics. The bi-weekly meetings will take place on Monday at 15.00 UTC+2 starting from Monday 26/10. There were already some smaller groups identified with respect to (i) languages, (ii) R2RML implementation report and (iii) test cases. Regarding use cases, there was a great variety and some conclusions indicating that some preprocessing steps may need to be addressed per domain. We need to further dive into the use cases to decide how we proceed and we still need to identify more use cases, in particular from industry.