Skip to content
joelit edited this page Nov 26, 2019 · 97 revisions

Notice - Update in progress for the incoming Skosmos 2.2 release - some details may still be outdated/inaccurate

In this tutorial we will install Skosmos on a fresh Ubuntu 18.04 Server machine. The goal is to get Skosmos 2.2 running together with an Apache Jena Fuseki 3.13.1 triple store on the same machine and serving two example vocabularies that are available as SKOS files: STW Thesaurus and UNESCO Thesaurus.

A fresh Ubuntu 18.04.03 Server (amd64) virtual machine with 16GB RAM was created for this purpose, with up to date system packages. None of the package sets (LAMP server etc) were installed during the OS install. Instead we will install the necessary packages in the steps below.

SSH access to virtual host

You can skip this if you're not using a VirtualBox (or similar) VM environment. For convenience (e.g. copy/paste working in a terminal window) the SSH server package was installed on the virtual machine:

$ sudo apt-get install openssh-server

Since the VM uses NAT networking by default, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="ssh", Protocol="TCP", Host Port="2222", Guest Port="22", other fields can be left blank. Confirm with OK. This can be done also while the VM is running.

After these operations I can ssh in to the virtual machine from my host using ssh -p 2222 localhost.

Install Apache Jena Fuseki

Install Java 8

First we will need a Java 8 environment (JRE is enough for Fuseki but you can also use a JDK).

$ sudo apt-get update
$ sudo apt-get install openjdk-8-jre-headless

This will install a bunch of packages and take a while. Verify that Java is installed by running java -version. It should return information about the Java environment. Check that it's Java 8 i.e. version 1.8. If you get another version such as 1.7.0, it means you still have an older Java installed. Either remove the older Java packages, or set the Java 8 as the default version:

# show the available Java versions
$ sudo update-java-alternatives -l
# set the version to use
$ sudo update-java-alternatives --set java-1.8.0-openjdk-amd64

Install Fuseki

Fuseki is distributed in a tar.gz archive containing everything. We will download it from https://jena.apache.org/download/ to the user home directory and unpack it under /opt. We will also create a symbolic link (simply called /opt/fuseki) to the current version, which will make it easier to upgrade Fuseki in the future by simply changing the symlink to point to the new version.

$ cd
$ wget http://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-3.13.1.tar.gz
$ cd /opt
$ sudo tar xzf ~/apache-jena-fuseki-3.13.1.tar.gz
$ sudo ln -s apache-jena-fuseki-3.13.1 fuseki

Now check that Fuseki starts up:

$ cd /opt/fuseki/
$ ./fuseki-server --help
$ ./fuseki-server --version

If everything works right, these commands should give information about supported command line options and version information.

Create a Fuseki system user

We want to run Fuseki as a non-root user for better security, so we create a system user called fuseki.

$ sudo adduser --system --home /opt/fuseki --no-create-home fuseki

Create directories for Fuseki configuration and databases

The default Fuseki file system layout is mainly aimed at standalone installs. However, for a server install, following the Filesystem Hierarchy Standard (FHS) layout makes sense as it makes e.g. system backups easier. So we will split the Fuseki files into separate system directories so that we get a layout that at least mostly resembles FHS:

  • Fuseki code (the server distribution) goes into /opt/fuseki, as above (actually a symlink)
  • databases go under /var/lib/fuseki
  • log files go under /var/log/fuseki
  • configuration files go under /etc/fuseki

This needs a bit of manual setting up (unfortunately there's no Debian/Ubuntu package that would do this for us) but it's worth the effort in the long run.

# create the database directories
$ cd /var/lib
$ sudo mkdir -p fuseki/{backups,databases,system,system_files}
$ sudo chown -R fuseki fuseki

# create the log directories
$ cd /var/log
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki

# create the configuration directories
$ cd /etc
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki

# finally create symlinks for databases and logs within the configuration directory
$ cd /etc/fuseki
$ sudo ln -s /var/lib/fuseki/* .
$ sudo ln -s /var/log/fuseki logs

Make Fuseki start automatically at boot

We want to have Fuseki always running. To do so, we will need to create and configure a systemd script.

Create the systemd script

To make Fuseki use the above directories, we will create a file /etc/systemd/system/fuseki.service with this content:

[Unit]
Description=Fuseki
[Service]
Environment=FUSEKI_HOME=/opt/fuseki
Environment=FUSEKI_BASE=/etc/fuseki
Environment=JVM_ARGS=-Xmx6G
User=fuseki
ExecStart=/opt/fuseki/fuseki-server
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target

The JVM_ARGS line with -Xmx parameter sets the maximum amount of memory to consume in the Java Virtual Machine and eventually, Fuseki. The default is often too low. This depends on the amount of data you have, how you load it, and what else the server is doing, but generally, giving Fuseki around half the available RAM seems to be a good starting point. Here we have set it to 6GB. STW Thesaurus and UNESCO Thesaurus are quite small so we could get by with the default amount in this case.

Check that Fuseki starts up using the systemd script

Now we test that we can start Fuseki using the above systemd script and configuration.

$ sudo systemctl start fuseki

If everything worked fine, we can see that the Fuseki was started and running by running command $ sudo systemctl status fuseki. If there are problems, you should check the log file /var/log/fuseki/stderrout.log or run $ sudo journalctl -xe for more details.

Add Fuseki as a system service

With the systemd script working, we can enable running Fuseki as a system service using the following command:

$ sudo systemctl enable fuseki

This hooks up the necessary symlinks. To make sure, you can verify that it works by rebooting the machine and checking that the Fuseki process exists after booting, for example using the command ps ax|grep fuseki which should list the Java process of Fuseki.

Create and load vocabularies database

Create and configure a database and text index

There are two ways of creating the Fuseki database: using the web interface, or from the command line.

A. Creating the database using the Fuseki web interface

We can open a browser on http://localhost:3030/ to access the Fuseki web user interface.

Note that if you are running Fuseki within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="fuseki", Protocol="TCP", Host Port="3030", Guest Port="3030", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. You will also need to tell Fuseki to allow management operations for non-localhost access by commenting out the line /$/** = anon in the security configuration /etc/fuseki/shiro.ini and restarting Fuseki. Note that this is potentially dangerous if you open up Fuseki URLs to the world, since anyone will then be able to manage your datasets.

Use the user interface to create a dataset with these options:

  • name: skosmos
  • type: persistent (TDB2)

This creates Jena TDB2 database under the directory /var/lib/fuseki/databases/skosmos and its configuration file /etc/fuseki/configuration/skosmos.ttl.

B. Creating the database from the command line

Fuseki2 has an administrative protocol that we can use to create the dataset using e.g. the curl command line tool:

curl --data "dbName=skosmos&dbType=tdb2" http://localhost:3030/$/datasets

If you get no error, the operation was successful. To verify, you can check that the directory /var/lib/fuseki/databases/skosmos/ exists.

Creating a text index

The newly created dataset doesn't have a text index. Before we load any data, we should create a text index.

First we need to shut down Fuseki temporarily:

$ sudo service fuseki stop

Then we edit the database configuration file /etc/fuseki/configuration/skosmos.ttl to look like this:

@prefix :      <http://base/#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text:  <http://jena.apache.org/text#> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .

ja:DatasetTxnMem  rdfs:subClassOf  ja:RDFDataset .
ja:MemoryDataset  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetOne  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetSink  rdfs:subClassOf  ja:RDFDataset .
ja:RDFDatasetZero  rdfs:subClassOf  ja:RDFDataset .

tdb2:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .

tdb2:GraphTDB  rdfs:subClassOf  ja:Model .
tdb2:GraphTDB2  rdfs:subClassOf  ja:Model .

<http://jena.hpl.hp.com/2008/tdb#DatasetTDB>
    rdfs:subClassOf  ja:RDFDataset .

<http://jena.hpl.hp.com/2008/tdb#GraphTDB>
    rdfs:subClassOf  ja:Model .

text:TextDataset
    rdfs:subClassOf  ja:RDFDataset .

:service_tdb_all  a               fuseki:Service ;
    rdfs:label                    "TDB2+text skosmos" ;
    fuseki:dataset                :text_dataset ;
    fuseki:name                   "skosmos" ;
    fuseki:serviceQuery           "query" , "" , "sparql" ;
    fuseki:serviceReadGraphStore  "get" ;
    fuseki:serviceReadQuads       "" ;
    fuseki:serviceReadWriteGraphStore "data" ;
    fuseki:serviceReadWriteQuads  "" ;
    fuseki:serviceUpdate          "" , "update" ;
    fuseki:serviceUpload          "upload" .

:text_dataset a text:TextDataset ;
    text:dataset :tdb_dataset_readwrite ;
    text:index :index_lucene . 

:tdb_dataset_readwrite
    a tdb2:DatasetTDB2 ;
    # tdb2:unionDefaultGraph true ;
    tdb2:location  "/etc/fuseki/databases/skosmos" .

:index_lucene a text:TextIndexLucene ;
    text:directory <file:/etc/fuseki/databases/skosmos/text> ;
    text:entityMap :entity_map ;
    text:storeValues true .

# Text index configuration for Skosmos
:entity_map a text:EntityMap ;
    text:entityField      "uri" ;
    text:graphField       "graph" ;
    text:defaultField     "pref" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:map (
         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
         # skos:altLabel
         [ text:field "alt" ;
           text:predicate skos:altLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
         # skos:hiddenLabel
         [ text:field "hidden" ;
           text:predicate skos:hiddenLabel ;
           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
         ]
     ) . 

Now start Fuseki again:

$ sudo service fuseki start

If everything went well this will create a jena-text Lucene index under /var/lib/fuseki/databases/skosmos/text i.e. as a subdirectory of the TDB database to which it is linked.

Load data

With the database and text index now ready, we can load the vocabulary data. Again this can be done either using the Fuseki web interface, or via the command line.

First we need to download the example datasets, i.e. STW Thesaurus and UNESCO Thesaurus (these links are to Turtle downloads though Fuseki accepts also other RDF syntaxes). The STW Thesaurus additionally needs to be uncompressed: unzip stw.ttl.zip - you may need to install the unzip tool first using the command sudo apt-get install unzip .

A. Loading data using the Fuseki web interface

Go to the Fuseki web interface again, open the "Dataset" tab and click on "upload files".

  • For STW Thesaurus, enter the graph name http://zbw.eu/stw/, select the file stw.ttl and click on "upload now".
  • For UNESCO Thesaurus, enter the graph name http://skos.um.es/unescothes/, select the file unescothes.ttl and click on "upload now".

The graph names may be arbitrary URIs (here we use the URI namespaces as graph names) but they must match the Skosmos configuration later on.

To be sure that the uploads went well, you can open the "info" tab and click on "count triples in all graphs". It should show that the default graph is empty (0 triples) and the two other graphs should have around 109,000 and 75,000 triples, respectively.

B. Loading data using the command line

Instead of the web interface, we can use the command line tool s-put that comes with Fuseki to load data. However, this tool requires a Ruby interpreter, so you may need to install it first:

sudo apt-get install ruby

Then you can use s-put to load data like this:

/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://zbw.eu/stw/ stw.ttl
/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://skos.um.es/unescothes/ unescothes.ttl

If you get no error message, the operations were succesful. You can verify by checking the size of the database: the command du -sh /var/lib/fuseki/databases/skosmos/ should show that the database is about 250 MB.

Congratulations, now your database is ready!

Install Skosmos and its requirements

Install Apache and PHP

Start by installing Apache and PHP7.

$ sudo apt-get install apache2 libapache2-mod-php7.2
$ sudo apt-get install php7.2 php7.2-curl php7.2-xsl php7.2-intl php7.2-mbstring 

After this you should verify that Apache is running by pointing your web browser at http://localhost/. It should show the Apache default page. If not, one should be able to start it with sudo service apache2 start and to set it to start at boot with sudo systemctl apache2 enable. Additionally, before continuing, set (if not already set) the timezone declaration for php: Open /etc/php/7.2/cli/conf.d/timezone.ini and add a line like date.timezone=$YOUR_TIMEZONE e.g., date.timezone="Europe/Helsinki" to the file. Remember to save the file. Now, you will have to restart the apache server for this setting to kick off.

Note that if you are running Fuseki within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="apache", Protocol="TCP", Host Port="8000", Guest Port="80", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. Then you can open http://localhost:8000/ from the browser on the host machine.

Configure Apache for Skosmos

Then you'll need to allow setting options in directory-specific .htaccess files by editing the apache configuration file in '/etc/apache2/sites-enabled/000-default.conf'. Inside that file, you will find the <VirtualHost *:80> block on line 1. Inside that block, add the following block:

        <Directory /var/www/html>
                Options Indexes FollowSymLinks MultiViews
                AllowOverride All
                Order allow,deny
                allow from all
        </Directory>

You should also enable the Apache modules mod_rewrite and mod_expires since Skosmos requires those to work.

$ sudo a2enmod rewrite
$ sudo a2enmod expires

After these changes you can restart Apache and the installation should be ready for running Skosmos.

$ sudo service apache2 restart

Install Skosmos

Start by cloning the Skosmos repository to your Apache DocumentRoot. To be able to do this we'll need to install the git client.

$ sudo apt-get install git

Then we can clone the Skosmos 2.2 code from GitHub:

$ cd /var/www/html/
$ sudo git clone -b v2.2-maintenance https://github.com/NatLibFi/Skosmos.git

After git has finished cloning the repository enter it and download and install Composer for managing the library dependencies.

$ cd /var/www/html/Skosmos/
$ curl -sS https://getcomposer.org/installer | sudo php

After you have downloaded and installed Composer you can simply install the dependencies required to run Skosmos. If you wish to to do some software development with your Skosmos installation you should omit the "--no-dev" part. Then you'll be able to run the unit tests and update the gettext translations.

$ sudo php composer.phar install --no-dev

Configure Skosmos

After installing the dependencies you need to configure the Skosmos installation. You can start by copying the default configuration files and using those as a basis for building your own configuration file.

$ sudo cp config.ttl.dist config.ttl

Let's start by enabling the fuseki text index we created earlier.

$ sudo nano config.ttl

We'll make the following changes to the configuration:

  1. Set the default SPARQL endpoint to the local Fuseki and the skosmos dataset
  2. Set the default SPARQL dialect to "JenaText" to use the jena-text index
  3. Add German translation to the UI languages

Please note that the Turtle notation requires using ; instead of . whenever the shorthand syntax for predicate lists is used as per Turtle spesification (provided that the triple not the last one for the common subject).

Add triple
:config skosmos:sparqlEndpoint <http://localhost:3030/skosmos/sparql> .
and comment out the other skosmos:sparqlEndpoint declarations for :config.

Switch the following triple
:config skosmos:sparqlDialect "Generic" .
into
:config skosmos:sparqlDialect "JenaText" .
// interface languages available, and the corresponding system locales (you may remove Finnish and Swedish)
:config skosmos:languages (
    [ rdfs:label "en" ; rdf:value "en_GB.utf8" ]
    [ rdfs:label "de" ; rdf:value "de_DE.utf8" ]
  ) .

Your machine may not have English and/or German locales installed, which are necessary for the Skosmos UI translations to work. To generate the locales as well as to ensure that the preliminaries exist, run these commands:

sudo apt-get install gettext
sudo apt-get install php-gettext
sudo locale-gen en_GB.utf8
sudo locale-gen de_DE.utf8

Restart apache in order to have these in effect.

Next we will add vocabulary definitions and configurations for STW and UNESCO Thesaurus so that so Skosmos knows to look for the vocabularies from our Fuseki SPARQL endpoint. Add these blocks of code after the #Skosmos vocabularies line in the config.ttl file.

:unesco a skosmos:Vocabulary, void:Dataset ;
    dc:title "UNESCO Thesaurus"@en ;
    skosmos:shortName "UNESCO";
    dc:subject :cat_general ;
    void:uriSpace "http://skos.um.es/unescothes/";
    skosmos:language "en", "es", "fr", "ru";
    skosmos:defaultLanguage "en";
    skosmos:showTopConcepts true ;
    skosmos:groupClass isothes:ConceptGroup ;
    void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
    skosmos:sparqlGraph <http://skos.um.es/unescothes/> .
 
:stw a skosmos:Vocabulary, void:Dataset ;
    dc:title "STW Thesaurus for Economics"@en ;
    skosmos:shortName "STW";
    dc:subject :cat_general ;
    void:uriSpace "http://zbw.eu/stw/";
    skosmos:language "en", "de";
    skosmos:defaultLanguage "de";
    void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
    skosmos:sparqlGraph <http://zbw.eu/stw/> .

You can remove the ysa and yso example vocabulary definitions, even though they should point to a separate SPARQL endpoint and work out-of-the-box.

Now you should be able to see the STW and Unescothes on the Skosmos front page. Point your browser to http://localhost/Skosmos/ (or http://localhost:8000/Skosmos/ from the host machine) and verify that you can see and open the vocabulary front pages. Replace localhost with your server ip if you're not doing this locally.

Optimizing performance

Now that basic Skosmos functionality is working, we can try to make it faster. But first we need to benchmark how well it performs so that we know that we are making progress.

Measure response time

To measure response time, we will use the simple Apache benchmark tool ab, which needs to be installed first:

sudo apt-get install apache2-utils

Best practice would be to run the benchmarking tool from another machine, but since we are only interested in relative performance, and ab is very lightweight, we can also just run it from the same machine.

For simplicity's sake we will just measure two operations: 1) how long it takes to generate a web page for a single concept - we'll pick the concept "Culture" from the UNESCO Thesaurus - and 2) how long it takes to generate the front page of the STW Thesaurus with the alphabetical index. These commands will load those pages 100 times:

ab -n 100 http://localhost/Skosmos/unesco/en/page/C00926
ab -n 100 http://localhost/Skosmos/stw/en/index

ab will report many figures, but let's just concentrate on the "Requests per second" value. On my example virtual machine, after running this several times, the reported numbers stabilize around 12 and 2.5, respectively. Not too bad, but could be improved!

Install APC

The first optimization step is to install the APC cache for PHP. Skosmos uses APC for caching the vocabulary configuration file since the Turtle parsing can be quite slow when you have many vocabularies in your configuration file. APC is also used for caching queries made to resources other than the Fuseki instance. This alone can considerably speed up your Skosmos page load times.

$ sudo apt-get install php-apcu
$ sudo service apache2 restart

Now we can measure the performance again using ab. On my machine, the requests per second increased to about 17 and 2.8, i.e. about 10-30% faster, not bad for just installing an additional package. With a larger number of vocabularies (and thus a larger vocabularies.ttl file), the improvement would have been even larger.

Install Varnish

Another way to speed up Skosmos is to add a HTTP proxy cache in front of Fuseki. The cache will store answers to recurring SPARQL queries and answer them much faster than Fuseki could. Many of the SPARQL queries that Skosmos performs will be repeated many times, so this will speed up Skosmos. However, it doesn't improve worst case response times.

We will first install the Varnish package. This will install Varnish 5.2.1-1:

sudo apt-get install varnish

By default in Ubuntu 18.04, Varnish will listen on TCP port 6081. It will use a non-persistent in-memory cache of 256MB to store HTTP responses. This is fine for the purposes of this example, but can be changed by editing the /etc/default/varnish file. In particular, the amount of memory allocated to the cache could be increased to improve the cache hit rate if you have lots of vocabulary data.

The Varnish back-end configuration needs to be changed. It must be told to access Fuseki instead of some other web server. Additionally we will ask Varnish to store responses for up to one week (instead of the default 2 minutes) in a compressed form, which will allow many more responses to be stored in the cache, at the cost of some CPU time for compressing and uncompressing. Edit the /etc/varnish/default.vcl to look like this:

vcl 4.0;

backend default {
    .host = "127.0.0.1";
    .port = "3030";
}

sub vcl_fetch {
    # store for a long time (1 week)
    set beresp.ttl = 1w;
    # always gzip before storing, to save space in the cache
    set beresp.do_gzip = true;
}

Then restart Varnish:

sudo service varnish restart

Note that since the cache is non-persistent, you can always clear the cache simply by restarting Varnish, for example if you update your vocabulary data.

Now we need to tell Skosmos to access the Fuseki SPARQL endpoint via Varnish instead of directly. To do this, we will change references to 3030 (the Fuseki port) to 6081 (the Varnish port).

In config.ttl:

For # Skosmos main configuration:

:config a skosmos:Configuration ;
    skosmos:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;

For both stw and unesco:

    void:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;

Now we can measure performance again using ab. This time the result is about 19 requests per second for the concept page and 16 requests per second for the STW index page. For concepts the speedup is about 10% over just using APC (the reason it's not more is that the main SPARQL query for concept information is a POST request cannot be cached). The performance of the alphabetical index improved more than fivefold!

Conclusion

In this tutorial we walked through installing Fuseki and Skosmos on an Ubuntu 18.04 server and also optimized its performance. After having set up a basic Skosmos installation this way, we could [https://github.com/NatLibFi/Skosmos/wiki/Vocabularies](add more vocabularies), configure the text index to fine tune search behavior, or configure Skosmos to behave differently or look different.


Apache Jena and associated module names are trademarks of the Apache Software Foundation.