Import RDF files to Neo4j database

1. Introduction

In this blog, I will introduce how to import RDF(Resource Description Framework) files to graph database (Neo4j), As we known, RDF is a framework for describing resources on the web in XML format, it is designed to be machine-readable so that computer can understand RDF easily. This means that RDFs are not easily to be read by people. For deeper visualization and analysis data represented by RDF format, we need to transfer those data to Neo4j database.

2. How data is stored in RDF files

RDF statement is a triple, a combination of Resource, Property, and property-value, which is also refer to Subject, predicate and objects. Written as

(Subject, Predicate, Object)

Subject and predicate are resources while Object can be either resources or literals. If the object is literals, this means that itself cannot be appeared in another RDF statement as subject.

We need to define set of rules so that it is possible for importing XML/RDF.

2.1.1 Rule One

Since every single subject has its URI(universal resources identifier). Therefore, subjects of triples are mapped into nodes n graph database. A node in neo4j represents an RDF resources will have label Resource and with a property uri with the resources uri.

(Subject, Predicate, Object) => (Resource {uri:Subject})

2.1.2 Rule Two

If the object of a RDF triple is literal, then predicate and object should be mapped into node property and property value.

(Subject,Predicate,Object) && isLiteral(Object) => (:Resource {uri:Subject, Property:Object})

2.1.3 Rule Three

If the object of a RDF triple is also a resource. Then, it should map to other node with URI as the universal identifier, and predicate will be viewed as the relationship between the subject and object.

(Subject,Predicate,Object) && !isLiteral(Object) => (:Resource {uri:Sbject})-[:Predicate]->(:Resource {uri:Object})

2.2 Some Exceptions.

The previous three rules defines a generic way for mapping rdf nodes into nodes and relationships in neo4j database. However, there are some exceptions

2.2.1 Exception One,

In rdf files, we would have triple rdf:type statements, The rdf:type statements are usually mapped into categories in Neo4j. By the 3rd rule, we may have many nodes linked to one specific category. In order to have less relationships and make retrieving easier, we would treate the object as the label of that node.

(Subject ,rdf:type, Category) => (:Category {uri:Subject})

2.2.2 Exception two.

In rdf files, we could only define one-to-one dependency relationships for subjects and objects. If we would have one-to-many dependency relationships, we would have a Blank node as anonymous resources so that many objects would link to anonymous resource in this way. For easily retrieving those virtual node, we would have to create index on them. The neo4j database will give every single blank node an unique id to avoid clash.

The Cypher Query Statement for creating index is as follows

CREATE INDEX ON : BNode(uri)

2.2.3 Exception three.

Literal object usually have datatypes associate with them. If datatypes are not explicit declared in the RDF triples , by default, they should be loaded as String type.

3. Existing tools for Importing RDFs into Neo4j Database.

If we already have some parsers as existing tools for importing rdf files to neo4j, then we don’t need to implements the codes for these rules from scratch. As client, what we only need to do is to call their APIs. I would recommend the neosemantics, it said that Graph+Semantics: Import/Export RDF from Neo4j. Model mapping, inferencing and more…

3.1 installation:

Since the latest stable version of Neo4j is 3.5.5, please dowload neosemantics-3.5.0.2.jar jar file from the released area. If wanting to know how it is implements, we could download the source code and build the jar package by our self.

  1. copy the jar file to your

    [neo4j_home]/plugins

  2. ADD the following code to [your_neo_home]/conf/neo4j.conf

    dbms.unmanaged_extension_classes=semantics.extension=/rdf
  3. restart the server and call the following statement, if the reponse is “ok”, then it has been succesfully installed.

    :GET /rdf/ping

3.2 Learning how to use neosemantics to import RDF files to the Neo4j Database.

In my project, my goal is to import SenticNet5 [https://sentic.net/downloads/] to the graph database. Let us explore the sentic net by looking the RDF files.

senticNet is basically a large dictionary consists of hundreds and thousands of words, every single word has some similarity.

<rdf:Description rdf:about="http://sentic.net/api/en/concept/a_little">
        <rdf:type rdf:resource="http://sentic.net/api/concept"/>
        <text xmlns="http://sentic.net">a little</text>
        <semantics xmlns="http://sentic.net" >
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/least"/>
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/little"/>
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/small_amount"/>
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/shortage"/>
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/scarce"/>
        </semantics>
        <sentics xmlns="http://sentic.net" >
            <pleasantness xmlns="http://sentic.net" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.99</pleasantness>
            <attention xmlns="http://sentic.net" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0</attention>
            <sensitivity xmlns="http://sentic.net" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0</sensitivity>
            <aptitude xmlns="http://sentic.net" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.70</aptitude>
        </sentics>
        <moodtags xmlns="http://sentic.net">
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/sadness"/>
            <concept xmlns="http://sentic.net" rdf:resource="http://sentic.net/api/en/concept/disgust"/>
        </moodtags>
        <polarity xmlns="http://sentic.net">
            <value xmlns="http://sentic.net">negative</value>
            <intensity xmlns="http://sentic.net" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.84</intensity>
        </polarity>
</rdf:Description>

Every word has many semantics which is also a resources. And Each word has moodtags which is the catogries of each word. In addition, each word has 4-dimentional sentics wich are pleasantness, attention, sensitivity and aptitude with value from -1 to 1. The total score can be computed in some way called polarity.

Although it is in RDF format, it is actually not a valid RDF. As we said before, it is not allowed to have one-to-many relationships. We need to refactor those code to make it valid in the first place.

For each concept tag under semantics or mood tags, they need a form that allows the rdf:Description to be omitted. This can be done by putting an rdf:parseType=”Resource” attribute on the containing property element that turns the property element into a property-and-node element, which can itself have both property elements and property attributes. This is the definition of Blank Node.

<rdf:Description rdf:about="http://sentic.net/api/en/concept/a_little">
        <rdf:type rdf:resource="http://sentic.net/api/concept"/>
        <text xmlns="http://sentic.net/">a little</text>
        <semantics xmlns="http://sentic.net/" rdf:parseType="Resource">
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/least"/>
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/little"/>
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/small_amount"/>
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/shortage"/>
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/scarce"/>
        </semantics>
        <sentics xmlns="http://sentic.net/" rdf:parseType="Resource">
            <pleasantness xmlns="http://sentic.net/" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.99</pleasantness>
            <attention xmlns="http://sentic.net/" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0</attention>
            <sensitivity xmlns="http://sentic.net/" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0</sensitivity>
            <aptitude xmlns="http://sentic.net/" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.70</aptitude>
        </sentics>
        <moodtags xmlns="http://sentic.net/" rdf:parseType="Resource">
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/sadness"/>
            <concept xmlns="http://sentic.net/" rdf:resource="http://sentic.net/api/en/concept/disgust"/>
        </moodtags>
        <polarity xmlns="http://sentic.net/" rdf:parseType="Resource">
            <value xmlns="http://sentic.net/">negative</value>
            <intensity xmlns="http://sentic.net/" rdf:datatype="http://www.w3.org/2001/XMLSchema#float">-0.84</intensity>
        </polarity>
</rdf:Description>

Simply call the API, we will get the triples imported into the graph database.For more details about the usage of API, the author of neosemantics has a very detailed explanation on github page and his blogs.

CALL semantics.importRDF("file:///C:/Users/OneDrive/Desktop/try.xml",
"RDF/XML",
{ shortenUrls: true, typesToLabels: true, commitSize: 9000 })

Notice: Be careful at the file path . The valid file URI in Windows would be something like ‘file:///C:/Documents%20and%20Settings/davris/FileSchemeURIs.doc’

Smiley face

We could see that the parser will automatically generate the namespace and labels for both nodes and relationships. These ns0 , ns1… are not suitable to read and query. Hence, let us define the namespace prefix before we importing rdf triples.

CREATE (:NamespacePrefixDefinition {`<http://sentic.net/`>: 'sentic_net', `http://sentic.net/api/`: 'keyword'});

For ‘http://sentic.net/‘, the mapping of url would be senticnet, and the ‘/api/‘ would be called keyword by my own definition.

Smiley face

The result Here looks more reasonable. when we want to find all other similar word. We will follow only two relationships, sentic_net_sementic for finding the blankNode and sentic_net_concept to find all similar word resources.

For fast retrieve blank Node , resources and other things, do not forget to create Index for those nodes and relationships that are frequently retrieved`

CREATE INDEX ON: Resource(uri)

CREATE INDEX ON: URI(uri)

CREATE INDEX ON: BNode(uri)

CREATE INDEX ON: Class(uri)

Since we would like to retrieve all the words with pleasantness, attention in some range or with a specific value. Instead importing them to key value pairs as properties. we would treat them as other resources.

Therefore, slightly change the rdf file.

<sentics xmlns="http://sentic.net/" rdf:parseType="Resource">
    <pleasantness xmlns="http://sentic.net/" rdf:resource="http://sentic.net/pleasantness/0.99"/>
    <attention xmlns="http://sentic.net/" rdf:resource="http://sentic.net/attention/0"/>
    <sensitivity xmlns="http://sentic.net/" rdf:resource="http://sentic.net/sensitivity/0"/>
    <aptitude xmlns="http://sentic.net/" rdf:resource="http://sentic.net/aptitude/-0.70"/>
</sentics>

<polarity xmlns="http://sentic.net/" rdf:parseType="Resource">
    <value xmlns="http://sentic.net/">negative</value>
    <intensity xmlns="http://sentic.net/" rdf:resource="http://sentic.net/intensity/-0.84"/>
</polarity>

In addition, design the URL for each attribute as follows,

CREATE (:NamespacePrefixDefinition { `http://sentic.net/`: 'sentic_net',
`http://sentic.net/api/`: 'keyword',
`http://sentic.net/pleasantness/`: 'pleasantness',
`http://sentic.net/attention/`: 'attention',
`http://sentic.net/sensitivity/`: 'sensitivity',
`http://sentic.net/aptitude/`: 'aptitude',
`http://sentic.net/intensity/`: 'intensity'
});

ReRun the code, the graph is as follows:

Smiley face
Smiley face

Those APIs are correct for describing pleasantness, attention and other attributes with correct value.

4. Automatically refactor the code By dom4j

We have successfully import the manually edited rdf file to the neo4j database. Key problems here we need to write codes for automatically modify rdf description to the format that we want.

Here, I use dom4j framework for manipulating the xml/rdf file. Dom4j is an open source framework for processing XML which is integrated with XPath and fully supports DOM, SAX, JAXP and the Java platform such as Java 2 Collections.(More about dom4j, visit the github page(https://github.com/dom4j/dom4j)). And I use maven as the project management tool.

In pom.xml, we add the dependencies

<dependencies>
        <!-- https://mvnrepository.com/artifact/org.dom4j/dom4j -->
        <dependency>
            <groupId>org.dom4j</groupId>
            <artifactId>dom4j</artifactId>
            <version>2.1.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/jaxen/jaxen -->
        <dependency>
            <groupId>jaxen</groupId>
            <artifactId>jaxen</artifactId>
            <version>1.2.0</version>
        </dependency>
</dependencies>

Notice that jaxen, the XPath Engine for Java, would not compatible to the version of dom4j here. If you want to jaxen for finding Nodes in DOM tree fast, I think using previous version of dom4j would be a choice. Here since the rdf file is not that complicated, I use iterator to iterate and find tags.

First, write the factory utility class for read and xml files

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.io.SAXReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
public class RdfUtil {

    public static Document readDoc(String fileName ) throws DocumentException {
        SAXReader reader = new SAXReader();
        Document document  = reader.read(new File(fileName));

        return document;
    }

    public static void writeDoc(String outputFileName, Document document) throws IOException {
        FileWriter out = new FileWriter(outputFileName);
        document.write(out);
        out.close();
    }
}

The enum class for the designed URI for conversion of float values

public enum  URIEnum {
    Pleasantness_URI("http://sentic.net/pleasantness/"),
    Attention_URI("http://sentic.net/attention/"),
    Sensitivity_URI("http://sentic.net/sensitivity/"),
    Aptitude_URI("http://sentic.net/aptitude/"),
    Intensity_URI("http://sentic.net/intensity/");

    private String uri;
    URIEnum(String uri) {
        this.uri = uri;
    }

    public String getUri() {
        return uri;
    }
}

Traverse the tree data structure and modify the document. Here if you want to remove the text of a tag, you should use the setText() to set the text to be null

import org.dom4j.*;

import java.util.Iterator;

public class traverseRdf {

    public static Document traverseAndModify(Document document){
        Element root = document.getRootElement();
        Element description  = root.elementIterator().next();
        // iterate through one rdf:description
        for (Iterator<Element> it = description.elementIterator(); it.hasNext();) {
            Element element = it.next();
            String elementTag = element.getName();
            // add rdf:parseType = "Resource" to create blankNode
            if (elementTag.equals("semantics") || elementTag.equals("sentics") || elementTag.equals("moodtags") ||
                    elementTag.equals("polarity")) {
                addParseType(element);
                // if it is sentics tag, add resource and remove data type
                if(elementTag.equals("sentics")){
                    addResourcesAndRemoveDataType(element);
                }
                if(elementTag.equals("polarity")){
                    modifyIntensity(element);
                }
            }
        }

        return document;
    }

    private static void addParseType(Element element){
        element.addAttribute("rdf:parseType","Resource");
    }

    private static void addResourcesAndRemoveDataType(Element sentics){
        Iterator<Element> senticsIterator = sentics.elementIterator();
        while(senticsIterator.hasNext()){
            Element element = senticsIterator.next();
            if(element.getName().equals("pleasantness")){
                element.addAttribute("rdf:resource", URIEnum.Pleasantness_URI.getUri()+element.getText());
            }else if(element.getName().equals("attention")){
                element.addAttribute("rdf:resource", URIEnum.Attention_URI.getUri()+element.getText());
            }else if(element.getName().equals("sensitivity")){
                element.addAttribute("rdf:resource", URIEnum.Sensitivity_URI.getUri()+element.getText());
            }else if (element.getName().equals("aptitude")){
                element.addAttribute("rdf:resource", URIEnum.Aptitude_URI.getUri()+element.getText());
            }
            // remove first atttribute: rdf:datatype
            Attribute attribute = element.attributes().get(0);
            if(attribute != null ) element.remove(attribute);
            // and remove the text
            element.setText("");
        }
    }

    private static void modifyIntensity(Element polarity){
        Iterator<Element> polarityIterator = polarity.elementIterator();
        while(polarityIterator.hasNext()){
            Element element = polarityIterator.next();
            if(element.getName() == "intensity"){
                element.addAttribute("rdf:resource", URIEnum.Intensity_URI.getUri()+element.getText());
                element.setText("");
                Attribute attribute = element.attributes().get(0);
                if(attribute != null ) element.remove(attribute);
            }
        }
    }
}

Main function

public static void main(String[] args) throws DocumentException, IOException {
    Document document = RdfUtil.readDoc(
            "file_name.xml");

    Document editDocument = traverseRdf.traverseAndModify(document);
    RdfUtil.writeDoc("output_file_name.xml",editDocument);
}

The automatically generated xml would be as follows, which is the same as the “a little” one.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://sentic.net/api/en/concept/a_little_hungry">
        <rdf:type rdf:resource="http://sentic.net/api/concept"/>
        <text xmlns="http://sentic.net">a little hungry</text>
        <semantics xmlns="http://sentic.net" rdf:parseType="Resource">
            <concept rdf:resource="http://sentic.net/api/en/concept/get_full"></concept>
            <concept rdf:resource="http://sentic.net/api/en/concept/hunger_go_away"></concept>
            <concept rdf:resource="http://sentic.net/api/en/concept/feel_full"></concept>
            <concept rdf:resource="http://sentic.net/api/en/concept/hunger"></concept>
            <concept rdf:resource="http://sentic.net/api/en/concept/full"></concept>
        </semantics>
        <sentics xmlns="http://sentic.net" rdf:parseType="Resource">
            <pleasantness rdf:resource="http://sentic.net/pleasantness/0.757"></pleasantness>
            <attention rdf:resource="http://sentic.net/attention/0"></attention>
            <sensitivity rdf:resource="http://sentic.net/sensitivity/0"></sensitivity>
            <aptitude rdf:resource="http://sentic.net/aptitude/0"></aptitude>
        </sentics>
        <moodtags xmlns="http://sentic.net" rdf:parseType="Resource">
            <concept rdf:resource="http://sentic.net/api/en/concept/joy"></concept>
            <concept rdf:resource="http://sentic.net/api/en/concept/joy"></concept>
        </moodtags>
        <polarity xmlns="http://sentic.net" rdf:parseType="Resource">
            <value>positive</value>
            <intensity rdf:resource="http://sentic.net/intensity/0.757"></intensity>
        </polarity>
    </rdf:Description>
</rdf:RDF>

By all this new word rdf to the neo4j data base, we could see that those two have the same sensitivity and attention value equals to zero. Therefore, they are actually connect to the same resources with uri http://sentic.net//attention/0“ and http://sentic.net//sensitivity/0

Smiley face

The rest of work would be how to import the whole large-scale rdf to the neo4j Database.


Author: Liang Tan
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Liang Tan !
  TOC