Loading dbPedia data into a local Virtuoso installation


My next step within a semantic web project is to bring dbPedia data of the types Person, Organization and Place into a local Virtuoso installation.

dbPedia give us its data for downloading in many different sets and formats. In my case I’m working with the 3.7 version.

I started by the Person type where  I plan to make it possible for data entered by end journalists and reporters to be “automatically” linked to dbPedia data. For exemple, when the user says a news article talks about “Lula” I will execute a SPARQL query such as…

SELECT DISTINCT ?s ?label 
WHERE {
 ?s rdf:type <http://dbpedia.org/ontology/Person> .
 ?s rdfs:label ?label .
 FILTER (REGEX(STR(?label), "lula", "i"))
}
LIMIT 100

… in my local Virtuoso installation. The result of such query would be presented to the user for his/her decision about which “lula” the news article talks about. The result of the previous query is…

The 'lula' result SPARQL query

This way one responsible for writing the article will make the decision of whom the article is about. After that I will create a sameAs link between my local data and dbPedia data.

Well, before doing this I discovered it would be a challenge to load dbPedia‘s Person data into the Virtuoso installation in my 4GB RAM notebook. That’s because as stated in Setting up a local DBpedia mirror with Virtuoso loading all dbPedia in a 8 core machine with 32GB RAM it would take 7 hours!

Trying to not to figuring out how much time it would take to load Person data into my Virtuoso, I had another challenge which was how to load the dbPedia data into my Virtuoso.  The problem is that the Quad Store Upload of Virtuoso‘s Conductor seems not to be able to deal with files over than 51MB of triples in it. So… how to import the  531MB of triples in the persondata_en.nt file?

First of all I had to split the persondata_en.nt file into chunks files of 100.000 lines each. Since I couldn’t do it with neither Notepad++ nor Replace Pioneer, I had to resort to a the Linux’s split built in program. The command split -l 100000 persondata_en.nt solved my first problem.

The second one was how to load each 12MB chunck file into Virtuoso. I chosed  Virtuoso’s Bulk data loader. There are to very important things to pay attention to when following the instructions of this documentation.

The first one is that it seems to have an error in the load_grdf procedure of the loader script. I had to change the while condition from while (line <> 0) to while (line <> ”). The second is that it was difficuld to successfuly set of the folder where the chunk files should be placed. After executing the SQL select server_root (), virtuoso_ini_path (); I discovered that C:\virtuoso-opensource\database was my server root folder and that was the place where the chuck files should be placed.

I started the rdf_loader_run(); command at 7:41PM.

It’s 9:47PM now and there are 8 (out of 41) files remmaining. I’ll not wait another hour to write more in this post. See you in the next one!

3 comentários sobre “Loading dbPedia data into a local Virtuoso installation

  1. Marcelo just becareful when spliting files into chunks. Make sure that there are no bnodes split between two different chunks. In other words if there are bnodes inside the file the best way is to separate first triples with bnodes to one file and all other triples to another file. Then the second file can be split further.
    The problem is if you put two triples with the same bnode into two different files virtuoso will treat them as two different one. Cheers simon

    • Hi Simon,

      Would Virtuoso not import 2 triples each one in it’s file if they refer to the same bnode? Why?

      Thanks,

      Marcelo

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s