-
Notifications
You must be signed in to change notification settings - Fork 95
InstallTutorial
In this tutorial we will install Skosmos on a Ubuntu or Rocky Linux Server machine. The goal is to get Skosmos 2.18 running together with an Apache Jena Fuseki 4.6.1 triple store on the same machine and serving two example vocabularies that are available as SKOS files: STW Thesaurus and UNESCO Thesaurus.
You can skip this if you're not using a VirtualBox (or similar) VM environment. For convenience (e.g. copy/paste working in a terminal window) the SSH server package was installed on the virtual machine:
$ sudo apt install openssh-server
Since the VM uses NAT networking by default, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="ssh", Protocol="TCP", Host Port="2222", Guest Port="22", other fields can be left blank. Confirm with OK. This can be done also while the VM is running.
After these operations I can ssh in to the virtual machine from my host using ssh -p 2222 localhost
.
The following installation is also bundled in an inofficial Debian package.
First we will need a Java 11 environment (JRE is enough for Fuseki but you can also use a JDK).
$ sudo apt update
$ sudo apt install default-jre-headless
This will install a bunch of packages and take a while. Verify that Java is installed by running java -version
. It should return information about the Java environment. Check that it's Java 11 i.e. version 11.0.something.
If you get another version such as 1.8.0, it means you still have an older Java installed. Either remove the older Java packages, or set the Java 11 as the default version:
# show the available Java versions
$ sudo update-java-alternatives -l
# set the version to use in Ubuntu
$ sudo update-java-alternatives --set java-1.11.0-openjdk-amd64
# set the version to use in Rocky Linux
$ sudo update-java-alternatives –-set jre-11-openjdk
Fuseki is distributed in a tar.gz archive containing everything. We will download it from apache.org to the user home directory and unpack it under /opt
. We will also create a symbolic link (simply called /opt/fuseki
) to the current version, which will make it easier to upgrade Fuseki in the future by simply changing the symlink to point to the new version.
$ cd ~
$ wget https://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-4.6.1.tar.gz
$ cd /opt
$ sudo tar xzf ~/apache-jena-fuseki-4.6.1.tar.gz
$ sudo ln -s apache-jena-fuseki-4.6.1 fuseki
Now check that Fuseki starts up:
$ cd /opt/fuseki/
$ ./fuseki-server --help
$ ./fuseki-server --version
If everything works right, these commands should give information about supported command line options and version information.
We want to run Fuseki as a non-root user for better security, so we create a system user called fuseki
.
$ sudo adduser --system --home /opt/fuseki --no-create-home fuseki
The default Fuseki file system layout is mainly aimed at standalone installs. However, for a server install, following the Filesystem Hierarchy Standard (FHS) layout makes sense as it makes e.g. system backups easier. So we will split the Fuseki files into separate system directories so that we get a layout that at least mostly resembles FHS:
- Fuseki code (the server distribution) goes into
/opt/fuseki
, as above (actually a symlink) - databases go under
/var/lib/fuseki
- log files go under
/var/log/fuseki
- configuration files go under
/etc/fuseki
This needs a bit of manual setting up but it's worth the effort in the long run.
# create the database directories
$ cd /var/lib
$ sudo mkdir -p fuseki/{backups,databases,system,system_files}
$ sudo chown -R fuseki fuseki
# create the log directories
$ cd /var/log
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki
# create the configuration directories
$ cd /etc
$ sudo mkdir fuseki
$ sudo chown fuseki fuseki
# finally create symlinks for databases and logs within the configuration directory
$ cd /etc/fuseki
$ sudo ln -s /var/lib/fuseki/* .
$ sudo ln -s /var/log/fuseki logs
We want to have Fuseki always running. To do so, we will need to create and configure a systemd script.
To make Fuseki use the above directories, we will create a file /etc/systemd/system/fuseki.service
with this content:
[Unit]
Description=Fuseki
[Service]
Environment=FUSEKI_HOME=/opt/fuseki
Environment=FUSEKI_BASE=/etc/fuseki
Environment=JVM_ARGS=-Xmx4G
User=fuseki
ExecStart=/opt/fuseki/fuseki-server
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target
The JVM_ARGS line with -Xmx parameter sets the maximum amount of memory to consume in the Java Virtual Machine and eventually, Fuseki. The default is often too low. This depends on the amount of data you have, how you load it, and what else the server is doing, but generally, giving Fuseki around half the available RAM seems to be a good starting point. Here we have set it to 4GB. STW Thesaurus and UNESCO Thesaurus are quite small so we could get by with the default amount in this case.
Now we test that we can start Fuseki using the above systemd script and configuration.
$ sudo systemctl start fuseki
If everything worked fine, we can see that the Fuseki was started and running by running command $ sudo systemctl status fuseki
. If there are problems, you should check the log file /var/log/fuseki/stderrout.log
or run $ sudo journalctl -xe
for more details.
With the systemd script working, we can enable running Fuseki as a system service using the following command:
$ sudo systemctl enable fuseki
This hooks up the necessary symlinks. To make sure, you can verify that it works by rebooting the machine and checking that the Fuseki process exists after booting, for example using the command ps ax|grep fuseki
which should list the Java process of Fuseki.
There are two ways of creating the Fuseki database: using the web interface, or from the command line.
We can open a browser on http://localhost:3030/ to access the Fuseki web user interface.
Note that if you are running Fuseki within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="fuseki", Protocol="TCP", Host Port="3030", Guest Port="3030", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. You will also need to tell Fuseki to allow management operations for non-localhost access by commenting out the line /$/** = anon
in the security configuration /etc/fuseki/shiro.ini
and restarting Fuseki. Note that this is potentially dangerous if you open up Fuseki URLs to the world, since anyone will then be able to manage your datasets.
Use the user interface to create a dataset with these options:
- name:
skosmos
- type:
persistent (TDB2)
This creates Jena TDB2 database under the directory /var/lib/fuseki/databases/skosmos
and its configuration file /etc/fuseki/configuration/skosmos.ttl
.
Fuseki2 has an administration protocol that we can use to create the dataset using e.g. the curl
command line tool:
curl --data "dbName=skosmos&dbType=tdb2" http://localhost:3030/$/datasets
If you get no error, the operation was successful. To verify, you can check that the directory /var/lib/fuseki/databases/skosmos/
exists.
The newly created dataset doesn't have a text index. Before we load any data, we should create a text index.
First we need to shut down Fuseki temporarily:
$ sudo service fuseki stop
Then we edit the database configuration file /etc/fuseki/configuration/skosmos.ttl
to look like this:
@prefix : <http://base/#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tdb2: <http://jena.apache.org/2016/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix text: <http://jena.apache.org/text#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
tdb2:DatasetTDB rdfs:subClassOf ja:RDFDataset .
tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
tdb2:GraphTDB rdfs:subClassOf ja:Model .
tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
<http://jena.hpl.hp.com/2008/tdb#DatasetTDB>
rdfs:subClassOf ja:RDFDataset .
<http://jena.hpl.hp.com/2008/tdb#GraphTDB>
rdfs:subClassOf ja:Model .
text:TextDataset
rdfs:subClassOf ja:RDFDataset .
:service_tdb_all a fuseki:Service ;
rdfs:label "TDB2+text skosmos" ;
fuseki:dataset :text_dataset ;
fuseki:name "skosmos" ;
fuseki:serviceQuery "query" , "" , "sparql" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadQuads "" ;
fuseki:serviceReadWriteGraphStore "data" ;
fuseki:serviceReadWriteQuads "" ;
fuseki:serviceUpdate "" , "update" ;
fuseki:serviceUpload "upload" .
:text_dataset a text:TextDataset ;
text:dataset :tdb_dataset_readwrite ;
text:index :index_lucene .
:tdb_dataset_readwrite
a tdb2:DatasetTDB2 ;
# tdb2:unionDefaultGraph true ;
tdb2:location "/etc/fuseki/databases/skosmos" .
:index_lucene a text:TextIndexLucene ;
text:directory <file:/etc/fuseki/databases/skosmos/text> ;
text:entityMap :entity_map ;
text:storeValues true .
# Text index configuration for Skosmos
:entity_map a text:EntityMap ;
text:entityField "uri" ;
text:graphField "graph" ;
text:defaultField "pref" ;
text:uidField "uid" ;
text:langField "lang" ;
text:map (
# skos:prefLabel
[ text:field "pref" ;
text:predicate skos:prefLabel ;
text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
]
# skos:altLabel
[ text:field "alt" ;
text:predicate skos:altLabel ;
text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
]
# skos:hiddenLabel
[ text:field "hidden" ;
text:predicate skos:hiddenLabel ;
text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
]
# skos:notation
[ text:field "notation" ;
text:predicate skos:notation ;
text:analyzer [ a text:LowerCaseKeywordAnalyzer ]
]
) .
Now start Fuseki again:
$ sudo service fuseki start
If everything went well this will create a jena-text Lucene index under /var/lib/fuseki/databases/skosmos/text
i.e. as a subdirectory of the TDB database to which it is linked.
With the database and text index now ready, we can load the vocabulary data. Again this can be done either using the Fuseki web interface, or via the command line.
First we need to download the example datasets, i.e. STW Thesaurus and UNESCO Thesaurus (these links are to Turtle downloads though Fuseki accepts also other RDF syntaxes). The STW Thesaurus additionally needs to be uncompressed: unzip stw.ttl.zip
- you may need to install the unzip
tool first using the command sudo apt install unzip
.
Go to the Fuseki web interface again, open the "Dataset" tab and click on "upload files".
- For STW Thesaurus, enter the graph name
http://zbw.eu/stw/
, select the filestw.ttl
and click on "upload now". - For UNESCO Thesaurus, enter the graph name
http://skos.um.es/unescothes/
, select the fileunescothes.ttl
and click on "upload now".
The graph names may be arbitrary URIs (here we use the URI namespaces as graph names) but they must match the Skosmos configuration later on.
To be sure that the uploads went well, you can open the "info" tab and click on "count triples in all graphs". It should show that the default graph is empty (0 triples) and the two other graphs should have around 109,000 and 75,000 triples, respectively.
Instead of the web interface, we can use the command line tool s-put
that comes with Fuseki to load data. However, this tool requires a Ruby interpreter, so you may need to install it first:
sudo apt install ruby
Then you can use s-put
to load data like this:
/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://zbw.eu/stw/ stw.ttl
/opt/fuseki/bin/s-put http://localhost:3030/skosmos/data http://skos.um.es/unescothes/ unescothes.ttl
If you get no error message, the operations were succesful. You can verify by checking the size of the database: the command du -sh /var/lib/fuseki/databases/skosmos/
should show that the database is about 250 MB.
Congratulations, now your database is ready!
Start by installing Apache and PHP7.
$ sudo apt install apache2 libapache2-mod-php7.4 php7.4 php7.4-xsl php7.4-intl php7.4-mbstring php7.4-curl
After this you should verify that Apache is running by pointing your web browser at http://localhost/. It should show the Apache default page. If not, one should be able to start it with sudo service apache2 start
and to set it to start at boot with sudo systemctl enable apache2
. In case you are using Windows Subsystem for Linux (WSL), you may get Protocol not available: AH00076: Failed to enable APR_TCP_DEFER_ACCEPT
warning that may render http://localhost/ non-functional. To fix this warning, prepend /etc/apache2/apache2.conf
with AcceptFilter http none
and restart Apache.
Additionally, before continuing, set (if not already set) the timezone declaration for php:
Open /etc/php/7.4/apache2/conf.d/timezone.ini
and add a line like date.timezone=$YOUR_TIMEZONE
e.g., date.timezone="Europe/Helsinki"
to the file. Remember to save the file. Now, you will have to restart the apache server for this setting to take effect.
Note that if you are running Apache within a VirtualBox VM and want to use the browser from the host machine, a port forwarding rule needs to be added to the VirtualBox settings (Network -> Port Forwarding...). The rule is: Name="apache", Protocol="TCP", Host Port="8000", Guest Port="80", other fields can be left blank. Confirm with OK. This can be done also while the VM is running. Then you can open http://localhost:8000/ from the browser on the host machine.
Then you'll need to allow setting options in directory-specific .htaccess
files by editing the apache configuration file in '/etc/apache2/sites-enabled/000-default.conf'. Inside that file, you will find the <VirtualHost *:80> block on line 1. Inside that block, add the following block:
<Directory /var/www/html>
Options Indexes FollowSymLinks MultiViews
AllowOverride All
Order allow,deny
allow from all
</Directory>
You should also enable the Apache modules mod_rewrite
and mod_expires
since Skosmos requires those to work.
$ sudo a2enmod rewrite
$ sudo a2enmod expires
After these changes you can restart Apache and the installation should be ready for running Skosmos.
$ sudo service apache2 restart
Start by cloning the Skosmos repository to a directory on the machine. We will create the directory /srv/Skosmos
for this purpose, owned by a regular (non-root) user; in the below command, whoami
is used so that the directory will end up in the ownership of the user performing the operation.
$ cd /srv
$ sudo mkdir Skosmos
$ sudo chown `whoami` Skosmos
To be able to clone Skosmos we'll also need to install the git client:
$ sudo apt install git
Then we can clone the Skosmos 2.18 code from GitHub into /srv/Skosmos
:
$ git clone -b v2.18-maintenance https://github.com/NatLibFi/Skosmos.git /srv/Skosmos
After git has finished cloning the repository enter it and download and install Composer for managing the library dependencies.
$ cd /srv/Skosmos/
$ curl -sS https://getcomposer.org/installer | php
After you have downloaded and installed Composer you can simply install the dependencies required to run Skosmos. If you wish to to do some software development with your Skosmos installation you should omit the --no-dev
part. Then you'll be able to run the unit tests and update the gettext translations. Please note that composer.phar is not recommended to be run using root/super user privileges.
$ php composer.phar install --no-dev
To make Skosmos accessible via Apache, we will add a symlink under /var/www/html
pointing to the directory /srv/Skosmos
where it was installed:
$ sudo ln -s /srv/Skosmos /var/www/html/Skosmos
After installing the dependencies you need to configure the Skosmos installation. You can start by copying the default configuration files and using those as a basis for building your own configuration file.
$ cp config.ttl.dist config.ttl
Let's start by enabling the fuseki text index we created earlier.
$ nano config.ttl
We'll make the following changes to the configuration:
- Set the default SPARQL endpoint to the local Fuseki and the
skosmos
dataset - Set the default SPARQL dialect to "JenaText" to use the jena-text index
- Add German translation to the UI languages
Please note that the Turtle notation requires using ;
instead of .
whenever the shorthand syntax for predicate lists is used as per Turtle specification (provided that the triple not the last one for the common subject).
Add triple
:config skosmos:sparqlEndpoint <http://localhost:3030/skosmos/sparql> .
and comment out the other skosmos:sparqlEndpoint declarations for :config.
Switch the following triple
:config skosmos:sparqlDialect "Generic" .
into
:config skosmos:sparqlDialect "JenaText" .
// interface languages available, and the corresponding system locales (you may remove Finnish and Swedish)
:config skosmos:languages (
[ rdfs:label "en" ; rdf:value "en_GB.utf8" ]
[ rdfs:label "de" ; rdf:value "de_DE.utf8" ]
) .
Your machine may not have English and/or German locales installed, which are necessary for the Skosmos UI translations to work. To generate the locales as well as to ensure that the preliminaries exist, run these commands:
sudo apt install gettext
sudo locale-gen en_GB.utf8
sudo locale-gen de_DE.utf8
Restart apache in order to have these in effect.
Next we will add vocabulary definitions and configurations for STW and UNESCO Thesaurus so that so Skosmos knows to look for the vocabularies from our Fuseki SPARQL endpoint. Add these blocks of code after the #Skosmos vocabularies
line in the config.ttl file.
:unesco a skosmos:Vocabulary, void:Dataset ;
dc:title "UNESCO Thesaurus"@en ;
skosmos:shortName "UNESCO";
dc:subject :cat_general ;
void:uriSpace "http://skos.um.es/unescothes/";
skosmos:language "en", "es", "fr", "ru";
skosmos:defaultLanguage "en";
skosmos:showTopConcepts true ;
skosmos:fullAlphabeticalIndex true ;
skosmos:groupClass isothes:ConceptGroup ;
void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
skosmos:sparqlGraph <http://skos.um.es/unescothes/> .
:stw a skosmos:Vocabulary, void:Dataset ;
dc:title "STW Thesaurus for Economics"@en ;
skosmos:shortName "STW";
dc:subject :cat_general ;
void:uriSpace "http://zbw.eu/stw/";
skosmos:language "en", "de";
skosmos:defaultLanguage "de";
void:sparqlEndpoint <http://localhost:3030/skosmos/sparql> ;
skosmos:sparqlGraph <http://zbw.eu/stw/> .
You can remove the ysa
and yso
example vocabulary definitions, even though they should point to a separate SPARQL endpoint and work out-of-the-box.
Now you should be able to see the STW and Unescothes on the Skosmos front page. Point your browser to http://localhost/Skosmos/ (or http://localhost:8000/Skosmos/ from the host machine) and verify that you can see and open the vocabulary front pages. Replace localhost with your server ip if you're not doing this locally.
Now that basic Skosmos functionality is working, we can try to make it faster. But first we need to benchmark how well it performs so that we know that we are making progress.
To measure response time, we will use the simple Apache benchmark tool ab
, which needs to be installed first:
$ sudo apt install apache2-utils
Best practice would be to run the benchmarking tool from another machine, but since we are only interested in relative performance, and ab
is very lightweight, we can also just run it from the same machine.
For simplicity's sake we will just measure two operations: 1) how long it takes to generate a web page for a single concept - we'll pick the concept "Culture" from the UNESCO Thesaurus - and 2) how long it takes to generate the front page of the STW Thesaurus with the alphabetical index. These commands will load those pages 100 times:
$ ab -n 100 http://localhost/Skosmos/unesco/en/page/C00926
$ ab -n 100 http://localhost/Skosmos/stw/en/index
ab
will report many figures, but let's just concentrate on the "Requests per second" value. On my example virtual machine, after running this several times, the reported numbers stabilize around 16 and 4, respectively. Not too bad, but could be improved!
The first optimization step is to install the APC cache for PHP. Skosmos uses APC for caching the vocabulary configuration file since the Turtle parsing can be quite slow when you have many vocabularies in your configuration file. APC is also used for caching queries made to external resources other than the Fuseki instance. This alone can considerably speed up your Skosmos page load times.
$ sudo apt install php-apcu
$ sudo service apache2 restart
Now we can measure the performance again using ab
. On my machine, the requests per second increased to about 23 and 4.6, i.e. about 15-40% faster, not bad for just installing an additional package. With a larger number of vocabularies (and thus a larger config.ttl
file), the improvement would have been even larger.
Another way to speed up Skosmos is to add a HTTP proxy cache in front of Fuseki. The cache will store answers to recurring SPARQL queries and answer them much faster than Fuseki could. Many of the SPARQL queries that Skosmos performs will be repeated many times, so this will speed up Skosmos. However, it doesn't improve worst case response times.
We will first install the Varnish package. This will install Varnish 6.2:
$ sudo apt install varnish
By default in Ubuntu 20.04, Varnish will listen on TCP port 6081. It will use a non-persistent in-memory cache of 256MB to store HTTP responses. This is fine for the purposes of this example, but can be changed by editing the systemd configuration for Varnish.In particular, the amount of memory allocated to the cache could be increased to improve the cache hit rate if you have lots of vocabulary data.
Here is an example of how to increase the memory allocation to 1GB. You will first need to create a varnish.service.d
directory to store configuration overrides:
$ sudo mkdir /etc/systemd/system/varnish.service.d/
Then create a file called varnish-commandline.conf
in that directory, with this content:
# Override the Varnish command line
[Service]
# Clear existing ExecStart= (required)
ExecStart=
# Set a new ExecStart=
ExecStart=/usr/sbin/varnishd -j unix,user=vcache -F -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,1g
Then activate the new systemd configuration with:
$ sudo systemctl daemon-reload
$ sudo service varnish restart
The Varnish back-end configuration needs to be changed. It must be told to access Fuseki instead of some other web server. Additionally we will ask Varnish to store responses for up to one week (instead of the default 2 minutes) in a compressed form, which will allow many more responses to be stored in the cache, at the cost of some CPU time for compressing and uncompressing. Edit the /etc/varnish/default.vcl
to look like this:
vcl 4.0;
backend default {
.host = "127.0.0.1";
.port = "3030";
}
sub vcl_backend_response {
# store for a long time (1 week)
set beresp.ttl = 1w;
# always gzip before storing, to save space in the cache
set beresp.do_gzip = true;
}
Then restart Varnish:
sudo service varnish restart
Note that since the cache is non-persistent, you can always clear the cache simply by restarting Varnish, for example if you update your vocabulary data.
Now we need to tell Skosmos to access the Fuseki SPARQL endpoint via Varnish instead of directly. To do this, we will change references to 3030 (the Fuseki port) to 6081 (the Varnish port).
In config.ttl
:
For # Skosmos main configuration
:
:config a skosmos:Configuration ;
skosmos:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;
For both stw
and unesco
:
void:sparqlEndpoint <http://localhost:6081/skosmos/sparql> ;
Now we can measure performance again using ab
. This time the result is about 24 requests per second for the concept page and 4.6 requests per second for the STW index page. For the index page, the speedup is about 4% over just using APC. The total improvement in execution times was 50% for the index page and 15% for the concept page. Not bad!
In this tutorial we walked through installing Fuseki and Skosmos on an Ubuntu 20.04 server and also optimized its performance. After having set up a basic Skosmos installation this way, we could add more vocabularies, configure the text index to fine tune search behavior, or configure Skosmos to behave differently or look different.
Apache Jena and associated module names are trademarks of the Apache Software Foundation.