-
Notifications
You must be signed in to change notification settings - Fork 95
FusekiTuning
If you use Fuseki (with Skosmos or otherwise) in a production system, it can be useful to tune it for better performance and resilience. This page contains tips and advice for Fuseki tuning.
By default, the Fuseki startup script sets the JVM option -Xmx1200M, i.e. allocates up to 1.2GB heap memory for the Fuseki process. This can be a bit low and may cause Fuseki to run out of memory, or at least starting to use inordinate amounts of CPU for garbage collection. The amount of memory required also depends on the number of requests performed in parallel, so if you are low on memory, it makes sense to aggressively limit the number of Jetty threads (see below).
You can adjust the -Xmx setting either by editing the startup script or by setting the JVM_ARGS environment variable.
Finto.fi uses -Xmx8G on a machine with 16GB of memory, with the Jetty thread pool set to 4-6 threads.
By default Fuseki has no timeouts. Thus a complex SPARQL query may cause it to consume lots of CPU for a long time, which then badly affects other queries. It is often better to abort such futile queries. You can specify ARQ query timeouts in the Fuseki configuration file like this:
<#dataset> rdf:type tdb:DatasetTDB ;
ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "30000" ] ;
Note: this section was written for Fuseki1. If you use Fuseki2, you probably don't need to do this kind of tuning.
If there are lots of parallel requests, Fuseki easily gets overwhelmed, eats too much memory and the JVM GC starts thrashing. By default, there is no limit on the number of parallel requests in Jetty (the servlet container for Fuseki). Also the queue for incoming requests has no upper limit, so even when the situation starts clearing up there may be a long backlog of requests to process. See the Jetty thread pool tuning documentation for more information.
Fuseki can use a custom Jetty configuration (using the --jetty-config=jetty.xml parameter) where limits can be set on the thread count and the queue size for waiting requests. A Jetty thread count close to the number of CPU cores in the system makes the most sense - queries are generally CPU bound as the TDB database usually fits in disk cache. A thread count significantly above the number of CPU cores will generally just increase Fuseki memory consumption with no improvement in performance.
This custom jetty.xml configuration sets the thread count to between 4 and 6 and the size of the request queue to 100:
<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN"
"http://www.eclipse.org/jetty/configure.dtd">
<!--
Reference: http://wiki.eclipse.org/Jetty/Reference/jetty.xml_syntax
http://wiki.eclipse.org/Jetty/Reference/jetty.xml
-->
<Configure id="Fuseki" class="org.eclipse.jetty.server.Server">
<Call name="addConnector">
<Arg>
<!-- org.eclipse.jetty.server.nio.BlockingChannelConnector -->
<!-- org.eclipse.jetty.server.nio.SelectChannelConnector -->
<New class="org.eclipse.jetty.server.nio.SelectChannelConnector">
<!-- BlockingChannelConnector specific:
<Set name="useDirectBuffer">false</Set>
-->
<!-- Only listen to interface ...
<Set name="host">localhost</Set>
-->
<Set name="port">3030</Set>
<Set name="maxIdleTime">0</Set>
<!-- All connectors -->
<Set name="requestHeaderSize">65536</Set> <!-- 64*1024 -->
<Set name="requestBufferSize">5242880</Set> <!-- 5*1024*1024 -->
<Set name="responseBufferSize">5242880</Set> <!-- 5*1024*1024 -->
</New>
</Arg>
</Call>
<Set name="ThreadPool">
<New class="org.eclipse.jetty.util.thread.QueuedThreadPool">
<!-- specify a bounded queue -->
<Arg>
<New class="java.util.concurrent.ArrayBlockingQueue">
<Arg type="int">100</Arg>
</New>
</Arg>
<Set name="minThreads">4</Set>
<Set name="maxThreads">6</Set>
<Set name="detailedDump">false</Set>
</New>
</Set>
</Configure>
Using a caching reverse proxy (e.g. Varnish or nginx) is recommended, either in front of Skosmos, in front of the SPARQL endpoint, or both. Most SPARQL queries performed by Skosmos are HTTP GET requests, which are easily cached.
The simplest possible setup is using Varnish in front of Fuseki. You can use a VCL configuration such as this one:
backend default {
.host = "127.0.0.1";
.port = "3030";
}
sub vcl_fetch {
# store for a long time (1 week)
set beresp.ttl = 1w;
# always gzip before storing, to save space in the cache
set beresp.do_gzip = true;
}
Note that you need to set the TTL explicitly, as Fuseki will not send any HTTP headers that would allow Varnish to infer how long the responses can be stored. If you do this, you need to flush the Varnish cache each time the data stored in Fuseki changes, otherwise Varnish will serve stale content.
Once you have set up Varnish like this, make sure to change the SPARQL endpoint definitions in vocabularies.ttl and config.inc to point to the Varnish host and port (typically on localhost:6081) instead of using Fuseki directly (typically on localhost:3030).