diff --git a/docs/dev/solr.md b/docs/dev/solr.md new file mode 100644 index 0000000..aeb10ac --- /dev/null +++ b/docs/dev/solr.md @@ -0,0 +1,258 @@ +# Solr and YaCy Integration + +Hint: If you are not a developer, just don't care about this topic. Solr +is already inside YaCy, just do nothing. + +YaCy uses Solr (and other data structures) to store the local search +index (while the remote search index is a RWI data structure). The Solr +index is deeply/programmatically embedded into YaCy, but it is also +possible to use an external Solr index which can then be assigned to +YaCy as external storage. This can be activated with one single click if +you have a running Solr, configured for YaCy. + +The remote index scheme is similar (but extended) to SolrCell; see + We added some +more generic fields, added a second solr core and therefore we need to +use the solr.xml and schema.xml from a YaCy installation. + +## Use the deep-embedded Solr in YaCy and an external Solr concurrently + +This is the default setting. The assignment of a remote Solr and also +switching off of the embedded Solr is done in the servlet +/IndexFederated_p.html. The embedded Solr is switched on, if the flag +"Use deep-embedded local Solr" is switched on. + +## Use an external Solr or Solr Shards to have a distributed Solr-backend for a single YaCy + +In the "Solr URL(s)" field of the + servlet, you can enter +several Solr addresses. If there is more than one Solr assigned, these +are accessed as a 'Shard'. This will cause that each document is hashed +using the document id and stored only in one of the shards. If a query +to Solr is made, then all shards are queried concurrently and the +results are merged. + +## Concurrent usage of the embedded Solr and an external Solr or Solr Shard + +It is possible to leave the "Use deep-embedded local Solr" flag switched +on while using an external Solr. Then each document is stored in the +local and the remote Solr. If a document is searched, this is done +concurrently in the local and remote Solr as if they are a Solr Shard. + +# How to deploy an external Solr for YaCy + +The deployment needs two steps: (1) embedd Solr into a servlet +environment, (2) configure Solr for YaCy. Both is described in each of +the following three options: you can choose between Jetty and Tomcat as +servlet container, do only one of the following three: + +## Use the example-deployment in a Solr package + +This is probably the easiest and fastest way to test a YaCy-Solr +connection. Don't do this for a production environment; one of the next +two options is better for this. The following steps uses Solr 4.1.0; you +can use the most recent version as well. + + - Download solr-4.1.0.tgz from + - Decompress solr-4.1.0.tgz (with 'tar xfz solr-4.1.0.tgz') and put + solr-4.1.0 into ~/ + - We must defined two cores for Solr: the default collection1 and an + addition 'webgraph' core. This is done by copying the YaCy solr.xml + + + + cp ~/yacy/defaults/solr/solr.xml ~/solr-4.1.0/example/solr/collection1/conf/ + + - The webgraph core is basically a copy of the default collection1 + core. Create a configuration for the webgraph as a clone of + collection1: + + + + mkdir ~/solr-4.1.0/example/solr/webgraph + cp -R ~/solr-4.1.0/example/solr/collection1/conf ~/solr-4.1.0/example/solr/webgraph/conf + + - A YaCy schema configuration must be copied to each core. To do this, + we have two options: either copy a generic version of the schema.xml + as used by YaCy + + + + cp ~/yacy/defaults/solr/schema.xml ~/solr-4.1.0/example/solr/collection1/conf/ + cp ~/yacy/defaults/solr/schema.xml ~/solr-4.1.0/example/solr/webgraph/conf/ + +or, using the explicit schema definition which can be extracted from the +YaCy API; start YaCy (if not already running) and execute the following +commands: + + ~/yacy/bin/apicat.sh /api/schema.xml?core=collection1 > ~/solr-4.1.0/example/solr/collection1/conf/schema.xml + ~/yacy/bin/apicat.sh /api/schema.xml?core=webgraph > ~/solr-4.1.0/example/solr/webgraph/conf/schema.xml + + - Finally, start the external Solr with: + + + + cd ~/solr-4.1.0/example/ && java -jar start.jar + +Solr is then running at + + - Start YaCy (if not already running) and open + + - in the "Solr URL(s)" field, enter: (or + a remote address, if you want to run solr on a different server) + - uncheck the "Use deep-embedded local Solr" flag and check the "Use + remote Solr server(s)" flag + +## Deploy Solr in Tomcat + +First you must download and decompress tomcat 6. In this example you +install tomcat to your home directory at `~/tomcat/` + + cd ~ + wget http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.37/bin/apache-tomcat-6.0.37.tar.gz + tar xfz apache-tomcat-6.0.37.tar.gz + mv apache-tomcat-6.0.37 tomcat + +To deploy a solr container, download a solr package and copy the +relevant files to the correct tomcat subdirectory: + + cd ~/tomcat + wget http://www.eu.apache.org/dist/lucene/solr/4.5.1/solr-4.5.1.tgz + tar xfz solr-4.5.1.tgz + cp solr-4.5.1/dist/solr-4.5.1.war . + cp -R solr-4.5.1/example/solr yacyindex + +We need to copy the YaCy schema and the definition of the second core +'webgraph'. Consider that at ~/yacy you have installed a YaCy peer, +then you can simply copy the generic schema file for collection1 to +solr: + + cp ~/yacy/defaults/solr/schema.xml ~/tomcat/yacyindex/collection1/conf/ + +Clone the collection1 to get the webgraph core + + cp -R ~/tomcat/yacyindex/collection1 ~/tomcat/yacyindex/webgraph + +Patch the core.properties in +`~/tomcat/yacyindex/webgraph/core.properties` and replace the line +`name=collection1` with `name=webgraph`. Then copy the solr.xml +definition for two cores: + + cp ~/yacy/defaults/solr/solr.xml ~/tomcat/yacyindex/ + +Copy the solr logging libraries to the tomcat library folder because +Solr uses a different logging in jetty as implemented in solr. In the +~/tomcat directory, do + + cp solr-4.5.1/example/lib/ext/* ~/tomcat/lib/ + +To deploy Solr with the YaCy configuration you must create a Tomcat +Context fragment. This is a file within the conf subdirecty which is +created once tomcat was started. Therefore we start tomcat now: + + ~/tomcat/bin/startup.sh + +Look at the path `~/tomcat/conf/Catalina/localhost` which was now +created. Thats the place where we create the Tomcat Context fragment. +You need the absolute path to the tomcat installation directory which we +consider as `/home/administrator/tomcat` in this example. Create a file in +`/home/administrator/tomcat/conf/Catalina/localhost/solr4yacy.xml` with +the following content: + + + + + + +Restart tomcat to activate this configuration + + ~/tomcat/bin/shutdown.sh && ~/tomcat/bin/startup.sh + +Finished! You can now access Solr with the url + This is the url which you can set in +the "Use remote Solr server(s)" field of the /IndexFederated_p.html +servlet in YaCy to attach the solr-in-tomcat to YaCy as remote storage +server. When doing this you may want to remove the flag "Use +deep-embedded local Solr" so this remote solr becomes the single storage +point for the YaCy search index. + +### User Administration and Search Index Access Protection + +Tomcat can add a password protection to web pages. There is i.e. a +default manager web application available at + which cannot be accessed without +setting a role and a user name for this. We will activate the manager to +test the password protection: write the new role "manager" to the file +`~/tomcat/conf/tomcat-users.xml` and set a password, i.e. + + + + + + + +Re-start tomcat, then open and +manage solr applications there. Log-in with the user name 'admin' and +the password 'tomcat'. We will use this now to access our YaCy search +index in Solr. To do this, we need access rules defined in the web.xml +configuration file to declare a role to be protected. We will call this +role 'user' and the paths to be all paths within tomcat. Open the file +`~/tomcat/conf/web.xml` and add the following lines at the end before the +closing tag \: + + + + /* + + + user + + + + BASIC + tomcat + + + user + + +To use the new role 'user', we add an account in the file +`~/tomcat/conf/tomcat-users.xml`. Add the following lines to +\: + +``` + + +``` + +and restart tomcat. You can now access Solr with the url + This is the url +which you can set in the "Use remote Solr server(s)" field of the +/IndexFederated_p.html servlet in YaCy. The account:password encoding +in the url is used by YaCy to access the solr index within tomcat. + +# Copy the deeply-embedded Solr Index to an external Solr + +This is easy, just copy the Solr directory in +DATA/INDEX/\/SEGMENTS/solr\_40 into the solr data directory of +your remote Solr installation. You can also do this using a script +during runtime. Call + + ~/yacy/bin/indexdump.sh + +which causes that YaCy creates a tar.gz file of the `solr_40` directory +during runtime and the indexdump.sh script returns the file path to this +tar.gz file. This filename can then be processed further with your own +copy-and-deploy script to fill a remote Solr with that. + + + +For cluster solr usage, see [Solr Cloud instructions](./solrcloud.md) + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/dev/solrcloud.md b/docs/dev/solrcloud.md new file mode 100644 index 0000000..c4d1fb8 --- /dev/null +++ b/docs/dev/solrcloud.md @@ -0,0 +1,286 @@ +# YaCy and Solr Cloud + + +This is an advanced [Solr for YaCy](./solr.md) +installation which uses the SolrCloud architecture. If you want to read +and understand this, you should be (at least a little bit) familiar with +debian, Solr and tomcat. + +In this example, we install a shard of 4 Solr instances within the same +server. + + + +## Software Installation + +We install tomcat, zookeeper and YaCy as standard debian packages and +Solr as web app for tomcat. + +### Tomcat Installation + +We will install tomcat as a standard debian system service using apt: + + apt-get install tomcat6 tomcat6-examples tomcat6-admin tomcat6-docs + +The tomcat web service on port 8080 will start automatically and you can +open the default page at The optional packages +tomcat6-examples tomcat6-admin tomcat6-docs are great to develop and +test applications, but it is also possible to omit them. If you +installed the optional packages, then you can test them: + + - is the online-documentation + - links to a set of example tomcat + applications + - and + are tomcat management + applications but their access is restricted. To use them you must + set a password in /etc/tomcat6/tomcat-users.xml, like + + + + + + + + + + + + +After setting this, you must restart tomcat with + + /etc/init.d/tomcat6 restart + +and then you can log in the +[manager](http://localhost:8080/manager/html) and +[host-manager](http://localhost:8080/host-manager/html) servlet with +the user 'admin' and the password 'tomcat'. Please replace the default +password 'tomcat' with your own. + +The relevant paths for the result of this installation are: + + tomcat users: /etc/tomcat6/tomcat-users.xml + CATALINA_HOME: /usr/share/tomcat6 + CATALINA_BASE: /var/lib/tomcat6 + default web page: /var/lib/tomcat6/webapps/ROOT/index.html + +### Zookeeper Installation + +The SolrCloud peers need a common configuration system which is provided +by zookeeper. Zookeeper can be installed with + + apt-get install zookeeper zookeeperd + +This will create a new user named 'zookeeper'. The relevant paths are at + + Zookeeper config: /etc/zookeeper/conf (linked to /etc/zookeeper/conf_example) + Zookeeper data: /var/lib/zookeeper/ + Zookeeper binary: /usr/share/zookeeper/ + +To check if Zookeeper is running, start the Zookeeper shell: + + /usr/share/zookeeper/bin/zkCli.sh + +and run shell scripts like + + ls / + ls /zookeeper + +Because solr is started within tomcat and needs to know the host address +of zookeeper, we must assign this to tomcat as a jvm option. Open the +file /usr/share/tomcat6/bin/catalina.sh and add the following lines at +the begining of the document (right after the comments): + + # added zookeeper host information used by tomcat to find Solr shards for the SolrCloud + CATALINA_OPTS=$CATALINA_OPTS -DzkHost=localhost:2181 + +..and restart tomcat + + /etc/init.d/tomcat6 restart + +### Solr Installation + +Download a solr release from (Solr +4.5.1. worked while Solr 4.6.0 did not work!) i.e. + + cd /opt + wget http://apache.mirrors.spacedump.net/lucene/solr/4.5.1/solr-4.5.1.tgz + tar xfz solr-4.5.1.tgz + ln -s solr-4.5.1 solr + ln -s solr-4.5.1/dist/solr-4.5.1.war solr.war + +Because Solr uses a different logging in jetty as implemented in solr, +we must add slf4j adapters to the tomcat library + + cd /usr/share/tomcat6/lib/ + wget http://www.slf4j.org/dist/slf4j-1.6.6.zip + apt-get install unzip + unzip slf4j-1.6.6.zip + cp slf4j-1.6.6/{jcl-over-slf4j-1.6.6.jar,slf4j-1.6.6/log4j-over-slf4j-1.6.6.jar,slf4j-1.6.6/slf4j-api-1.6.6.jar,slf4j-1.6.6/slf4j-jdk14-1.6.6.jar} . + +and restart tomcat: + + /etc/init.d/tomcat6 restart + +### YaCy Installation + +Follow the [YaCy for Debian installation +instructions](../installation/debianinstall.md) +and select 'webportal' as network to join into (we consider that you do +this not create a standalone-YaCy, not a peer-to-peer participant; you +can of course also use this for a 'freeworld' peer as well). The +relevant paths are at + + YaCy data: /var/lib/yacy + YaCy log: /var/log/yacy + YaCy binary: /usr/share/yacy/ + Solr conf for YaCy: /usr/share/yacy/defaults/solr + +## Software Configuration + +The SolrCloud needs a common configuration of the index cores used by +YaCy. YaCy uses two cores, 'collection1' and 'webgraph'. Both are +defined with a generic index schema and they are exact clones of each +other. It may be also possible to defines these cores with non-generic, +exact defined schema.xml files, but we will not do that right now +because it makes things much more complex. + +### Zookeeper Client for Solr + +First, we need a Zookeeper client for Solr because Solr provides it's +own client app to upload the relevant configuration files. We must +fabricate this client using the libraries inside the Solr war-file and +additional libraries for logging. We use the already installed war file, +you must adopt the paths here if you used a more recent version of Solr: + + unzip -q /opt/solr.war -d /tmp/solr-war/ + mkdir /usr/share/zookeeper/solr-cli-lib + cp /tmp/solr-war/WEB-INF/lib/* /usr/share/zookeeper/solr-cli-lib/ # solr libs + cp /opt/solr/example/lib/ext/* /usr/share/zookeeper/solr-cli-lib/ # logger libs + rm -Rf /tmp/solr-war + +Now we can take advantage of the SolrCloud ZooKeeper CLI commands. + +### Create Solr Configuration of Solr Cores for YaCy Inside Zookeeper + +For a detailed description of the set-up of Solr Clusters and a +SolrCloud configuration, see the [SolrCloud Wiki of +apache.org](http://wiki.apache.org/solr/SolrCloud), the [SolrCloud +Installation in Tomcat](http://wiki.apache.org/solr/SolrCloudTomcat), +a [Guide to SolrCloud +Configuration](http://systemsarchitect.net/painless-guide-to-solr-cloud-configuration/) +and a [SolrCloud Cluster (Single Collection) +Deployment](http://myjeeva.com/solrcloud-cluster-single-collection-deployment.html). +To upload the solr configuration in Zookeeper, we fabricate a config +directory using the solr example config and the YaCy genric schema file +schema.xml: + + cp -R /opt/solr/example/solr/collection1/conf /opt/yacyconf + cp /usr/share/yacy/defaults/solr/schema.xml /opt/yacyconf/ + +We can then use that to upload the configuration to zookeeper: + + java -classpath .:/usr/share/zookeeper/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -zkhost localhost:2181 -cmd upconfig -confdir /opt/yacyconf -confname yacygeneric + +That configuration is good for both collections, 'collection1' and +'webgraph'. We can link this configuration therefore to both +collections: + + java -classpath .:/usr/share/zookeeper/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -zkhost localhost:2181 -cmd linkconfig -collection collection1 -confname yacygeneric + java -classpath .:/usr/share/zookeeper/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -zkhost localhost:2181 -cmd linkconfig -collection webgraph -confname yacygeneric + +Lets see whats inside of zookeeper now, i.e. how the collection1 is +linked against the generic schema: + + /usr/share/zookeeper/bin/zkCli.sh get /collections/collection1 + +#### Create Tomcat Configuration of Solr Web Services + +We want to use four Solr servers as a SolrCloud, each with two cores +('collection1' and 'webgraph'). We create subdirectories for the servers +inside of /var/opt/solrcloud/: + + mkdir /var/opt/solrcloud/ + mkdir /var/opt/solrcloud/solr0 + mkdir /var/opt/solrcloud/solr1 + mkdir /var/opt/solrcloud/solr2 + mkdir /var/opt/solrcloud/solr3 + +In each of these directories, put a file named solr.xml. The description +for the creation of that file in the web is mainly void, since there is +a new [xml structure for solr.xml for Solr 4.4 and +beyond](http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond), +especially for [Core Discovery with +SolrCloud](http://wiki.apache.org/solr/Core%20Discovery%20%284.4%20and%20beyond%29). +Put the following content into `/var/opt/solrcloud/solr0/solr.xml`: + + + + 4 + + localhost + 8080 + solr0 + localhost:2181 + ${solr.zkclienttimeout:30000} + ${shareSchema:false} + ${genericCoreNodeNames:true} + + + ${socketTimeout:0} + ${connTimeout:0} + + + +Finally, make the path `/var/opt/solrcloud/` writable for tomcat6: + + chown -R tomcat6 /var/opt/solrcloud/ + chgrp -R tomcat6 /var/opt/solrcloud/ + +To deploy Solr with the YaCy configuration you must create a Tomcat +Context fragment for each Solr instance. A Tomcat Context Fragment is a +file in `/var/lib/tomcat6/conf/Catalina/localhost`. Therefore, we must +create four files, one for each Solr server, in this directory: write a +file to `/var/lib/tomcat6/conf/Catalina/localhost/solr0.xml` with the +following content: + + + + + + +and copy this to `solr1.xml .. solr3.xml` and patch the solr/home +attribute to `solr1 .. solr3`. If you patch these files using emacs, make +sure that you delete all files ending with '~' because they will cause +an error. Finally, restart tomcat: + + /etc/init.d/tomcat6 restart + +### Create the SolrCloud + +We can now open the Solr web service at +Open this web page to check if the service is up and running. Then we +can use that web service to instantiate the SolrCloud: + + curl 'http://localhost:8080/solr0/admin/collections?action=CREATE&name=collection1&numShards=4&replicationFactor=1' + curl 'http://localhost:8080/solr0/admin/collections?action=CREATE&name=webgraph&numShards=4&replicationFactor=1' + +### Assign the SolrCloud to YaCy + +When the SolrCloud is ready and running, it can be assigned to YaCy as +storage server. Open the servlet at + and select the flag "Use +remote Solr server(s)". As server address, enter one of the Solr +servers, like Finally, uncheck the flag +"Use deep-embedded local Solr". + + + + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/docs.md b/docs/docs.md index fdad371..436ba32 100644 --- a/docs/docs.md +++ b/docs/docs.md @@ -48,6 +48,10 @@ may be outdated, you can help the community by checking and [improving](contribu * [YaCy and Tor](operation/yacy-tor.md) * [Network Definition](operation/network-definition.md) +### Developers +* [Solr and YaCy integration](dev/solr.md) +* [YaCy and Solr Cloud](dev/solrcloud.md) + ## Old and obsolete The original YaCy wiki is closed now (no new registration or editing) and will be abandoned in future, but still contains valuable information. You