diff --git a/docs/docs.md b/docs/docs.md index e5ec02a..56d8077 100644 --- a/docs/docs.md +++ b/docs/docs.md @@ -5,11 +5,12 @@ * [FAQ](faq.md) - Frequently asked questions * [Readme](https://github.com/yacy/yacy_search_server/blob/master/README.md) - at github.com - -## Operation +## Installation * [Headless - YaCy on a Remote Server](installation/headless.md) * [Shrink Debian by removing all graphical features to turn it into a headless server](installation/shrink.md) * [Set a static IP to a debian server](installation/staticip.md) + +## Operation * [Setting the ranking rules](operation/ranking.md) * [YaCy config settings](operation/yacy_conf.md) @@ -22,6 +23,10 @@ ## Converted from old-wiki may be outdated, you can help the community by checking and [improving](contribute.md) the pages +### Basics +* [Use cases](use_cases.md) +* [Features](features.md) + ### Installation * [System Requirements](installation/requirements.md) * [Arch Install Guide](installation/archinstall.md) @@ -34,12 +39,13 @@ may be outdated, you can help the community by checking and [improving](contribu * [Set up Raspberry Pi with YaCy](installation/raspberry_pi.md) ### Operation -* [YaCy and Tor](operation/yacy-tor.md) +* [Index Creation - Crawl Start](operation/crawlstart_p.md) +* [Autoupdate](operation/autoupdate.md) * [Portforwarding](operation/portforwarding.md) * [Using the YaCy Front-End over HTTPS](operation/yacyoverhttps.md) - - - +* [Performance Tuning](operation/performance.md) +* [YaCy and Tor](operation/yacy-tor.md) +* [Network Definition](operation/network-definition.md) ## Old and obsolete The original YaCy wiki is closed now (no new registration or editing) and diff --git a/docs/features.md b/docs/features.md new file mode 100644 index 0000000..e549494 --- /dev/null +++ b/docs/features.md @@ -0,0 +1,118 @@ +# Features + +Main technical features : + + - P2P network architecture + - Cross-platform : can work on any operating system which can run a + [JVM](https://en.wikipedia.org/wiki/Java_virtual_machine) + - Parsing capabilities : + - TXT clear text + - CSV tabular data + - RTF rich text + - XML structured data + - HTML web documents + - RSS, RDF, Atom Newsfeeds + - MS Office Excel, Word, Powerpoint + - MS Visio diagrams + - ODF OpenDocument + - PDF Portable Document + - PS Postscript + - SWF Adobe Flash + - VCard Electronic Business Card + - Archives : 7zip, zip, bz2, tbz, tbz2, tar.gz, rpm, jar, apk + - Images : png, jpg, gif, svg, ico, bmp, tif, psd + - CAD Drawings : dwg + - MM FreeMind mind maps + - Audio : mp3, ogg, oga, m4a, m4p, flac, wma, sid + - Torrent torrent metadata + - OpenSearch interface + - Load balancing + - Automated index redundancy distribution + - Direct import of Fremd-DB (Surrogate Harvester API) + - Automatic indexing through a proxy filter + - Embedded web server + - Internal domain names ending in .yacy + - P2P bootstrap from central seed lists + - Spell check + - Filter expression. e.g. : filetype:pdf + - UTF-8 encoding + +## Technologies + +Below are the main technologies used in the project : + + - Java 11 + - [XHTML](https://www.w3.org/TR/2002/REC-xhtml1-20020801/), + [CSS](https://www.w3.org/Style/CSS/), + [JavaScript](https://en.wikipedia.org/wiki/JavaScript) + - [JSON](http://json.org/) + - [Dublin Core](http://dublincore.org/) + - External components : + - [Apache Commons](https://commons.apache.org/) Toolkit + - [Apache HttpComponents](https://hc.apache.org/) + - [Apache Jakarta Oro](http://jakarta.apache.org/oro/) RegExp + - [Apache POI](http://poi.apache.org/) API for Microsoft Documents + - [Apache James Mime4j](https://james.apache.org/mime4j/) + - [Apache Lucene](https://lucene.apache.org/) + - [Apache Solr](http://lucene.apache.org/solr/) + - [Apache PDFBox and FontBox](http://pdfbox.apache.org/) + - [Apache Xerces](http://xerces.apache.org/xerces-j/) XML Parser + - [Apache XML APIs](http://xml.apache.org/commons/) + - [Bouncy Castle](http://www.bouncycastle.org/java.html) Crypto + APIs : Provider, Mail + - [GlassFish](https://glassfish.dev.java.net) Servlets + - [Guava](https://github.com/google/guava) + - [ICU](http://site.icu-project.org/) International Components for + Unicode + - [J7Zip](http://p7zip.sourceforge.net/) + - [Java CIFS](http://jcifs.samba.org) Client Library + - [Jazzy](https://sourceforge.net/projects/jazzy/) Spelling API + - [Jaudiotagger](http://www.jthink.net/jaudiotagger/) + - [JSch](http://www.jcraft.com/jsch/) Java Secure Channel + - [JDBC](http://www.oracle.com/technetwork/java/javase/jdbc/index.html) + - [Jetty](http://www.eclipse.org/jetty/) Web server + - [jQuery](https://jquery.com/) JavaScript library + - [JSONIC](http://jsonic.osdn.jp/) json encoder/decoder + - [json-simple](https://github.com/fangyidong/json-simple) toolkit + - [jsoup](http://jsoup.org/) Java HTML Parser + - [language-detection](https://github.com/shuyo/language-detection) + library + - [metadata-extractor + API](http://www.drewnoakes.com/drewnoakes.com/code/exif/) + - [Mozilla charset + detector](http://sourceforge.net/projects/jchardet/) + - [Noggit](https://github.com/yonik/noggit) JSON parser + - [Restlet + Framework](https://restlet.com/projects/restlet-framework/) + - [SLF4J](http://www.slf4j.org/) Simple Logging Facade for Java + - [Spatial4j](https://www.locationtech.org/projects/technology.spatial4j) + spatial/geospatial library + - [Stax2 API](https://github.com/FasterXML/stax2-api) + - [TwelveMonkeys ImageIO + plugins](https://github.com/haraldk/TwelveMonkeys) : BMP, TIFF + - [Giant Java Tree](http://www.gjt.org/) TAR Package + - [WebCat](http://webcat.sourceforge.net) SWF Package + - [Weupnp](http://bitletorg.github.io/weupnp/) tiny UPnP client + - [Woodstox](http://wiki.fasterxml.com/WoodstoxHome) XML processor + - [XMP](http://www.adobe.com/devnet/xmp.html) Adobe's Extensible + Metadata Platform + - Build and tests utils + - [Apache Ant](http://ant.apache.org/) Building Environment + - [JRPM](http://jrpm.sourceforge.net/) + - [JUnit](http://jrpm.sourceforge.net/) testing framework + +# Issues + + - Stability + - Performance + - Languages support + - Very simple stemming + + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/installation/requirements.md b/docs/installation/requirements.md index e108370..b9e98c9 100644 --- a/docs/installation/requirements.md +++ b/docs/installation/requirements.md @@ -23,16 +23,16 @@ environment on your system yet, you must install it before installing YaCy. GNU/Linux distributions may include, e.g. a free one called [OpenJDK](http://openjdk.java.net/install/). Otherwise, Java is available from the [Sun website](http://java.com/en/download/index.jsp). -The minimum Java version you need for YaCy is Java 7 — which might +The minimum Java version you need for YaCy is Java 11 — which might change in the future. Note that Apache Solr beeing a YaCy core component, it is a good idea to follow also Solr recommendations. For example, YaCy 1.82 includes Solr -4.10.3 : Java 1.7.0\_u55 or later is recommanded (see [Solr System +4.10.3 : Java 1.7.0_u55 or later is recommended (see [Solr System Requirements](https://lucene.apache.org/solr/4_10_3/SYSTEM_REQUIREMENTS.html)) Because of this — and the fact that the newer Java version is more -powerful than the old one — you should chose Java 7 right from the +powerful than the old one — you should chose Java 11 right from the start. If the only thing you want to do is just run Java programs you can make do with the [JRE (Java Runtime Environment)](https://en.wikipedia.org/wiki/Java_Runtime_Environment). If you want to develop diff --git a/docs/operation/autoupdate.md b/docs/operation/autoupdate.md new file mode 100644 index 0000000..4543e6a --- /dev/null +++ b/docs/operation/autoupdate.md @@ -0,0 +1,82 @@ +# Autoupdate + +## Using Autoupdate + +Just go to this page: + +You can either manually choose a release from the update locations and +update or enable the autoupdate feature. The autoupdate regularly +(configure the interval) checks the update locations. You can choose +between main (stable snapshot-releases ~ every 2 months) or the current +dev releases. The dev releases ending on 123 are often more experimental +than the others, so they are blacklisted by default. + +## Setup your own update location + +An updatelocation is just a HTML-Page with links to yacy-tarballs +conforming to the version-scheme. The updatelocation is configured per +network in the network-definition (for example +defaults/yacy.network.freeworld.unit): + + network.unit.update.location0 = http://your.domain.tld/yacyreleases/ + +## Setup signatures + +To be sure, YaCy really downloads the tarballs, the updatelocation admin +uploaded, you should provide signatures for your releases. DNS-spoofing, +man-in-in-the-middle-attacks or attacks on the webserver are not that +difficult. + +### Generate a private and a public key + +Configure the location where to put the private key with the property +*privateKeyFile* in *build.properties* . Then run + + ant genkey + +The private key will be created at the specified location. The public +key has the extension *.pub*. + + +**Note:** The default value for privateKeyFile is "private.key" and the +private and public key are directly created in the yacy folder + + + +### Put public key in network-definition + +The network definition file should look like this: + + network.unit.update.location0 = http://your.domain.tld/yacyreleases/ + network.unit.update.location0.key = MIIBuDCCASwGQeEwx7V...(very long)...bukaPtQxr2p9y1QNZFauihmu4ak4AyT + +Now, YaCy will try to download the signature and check it. So provide +*.sig*-files\! + + + +**Note:** The default network definitions file is located in +`defaults/yacy.network.freeworld.unit.` + +But the values can be changed during runtime as well here: + + + + + +### Generate a distribution tarball with signature + +Just run: + + ant clean dist sign + +You'll find the tarball and a .sig-file in the RELEASE-directory. Put +them on the updatelocation + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/operation/crawlstart_p.md b/docs/operation/crawlstart_p.md new file mode 100644 index 0000000..c48a6a5 --- /dev/null +++ b/docs/operation/crawlstart_p.md @@ -0,0 +1,222 @@ +# Index Creation - Crawl Start + + + + + +[![](../images/thumb/yacy_crawlstart_p_svn6915_0.69_en.png/300px-yacy_crawlstart_p_svn6915_0.69_en.png)](./datei:yacy_crawlstart_p_svn6915_0.69_en.png.html) + + + + + + + +This page ist available in the Index Control Index Creation via the address + and allows you to start new web +crawls by creating a new crawling profile. In case an administrator +password was setup you have to login as the administrator first. + +## Crawl Properties + +### Starting Point + +The "Starting Point" at "From URL" defines the page where the crawl is +started from (e.g. www.server.com). + + - **From URL:** Enter your website that should be crawled in the form + or or + maybe . This will then + get indexed with the specified crawl depth. + + + + - **From Sitemap:** If the domain in "From URL:" lists a sitemap in + the robots.txt file this sitemap will be shown here and can be used + as a crawl starting point. + + + + - **From File:** Use a HTML document generated from a script or editor + of your choice. All links that are found in this HTML file via Textlink will be automatically + crawled with the crawl depth "1" and indexed. In Linux you can split + a file with `split` in lines. A `split -l 10000 Links.db` created + files with each 10000 lines named like `Links.dbnn`. + + + +### Create Bookmark + +This option works only if a "From URL:" was used as a starting point and +automatically adds the URL as a bookmark. + + - **Title:** Use a specific title for the bookmark + + + + - **Folder:** To crawl this URL regurarly use on the default folders + defined: + - /autoReCrawl/hourly + - /autoReCrawl/daily + - /autoReCrawl/weekly + - /autoReCrawl/monthly + +Attention: The Recrawl settings are folder dependant You can change +those settings in the file `/DATA/SETTINGS/autoReCrawl.conf`. + + + +### Crawling Depth + +Set the depth for the crawl to define how deep the crawl should be. + + - If you just want to index the page defined at the starting point use + the crawl depth 0. + + + + - If you just want to index the page defined at the starting point and + all directly linked pages use the crawl depth 1. + + + + - If you just want to index the page defined at the starting point and + all directly and indirectly linked pages use the crawl depth 2. + + + + - ... + +A minimum of 0 is recommended and means that the page you enter under +"Starting Point" will be added to the index, but no linked content is +indexed. 2-4 is good for normal indexing. Be careful with the depth. +Consider a branching factor of average 20; A prefetch-depth of 8 would +index 25.600.000.000 pages, maybe this is the whole WWW. + + + +### Must-Match Filter + +With these three options you can select how the crawler is accepting +URLs: + + - **Use filter** The filter is a regular expression that must match + with the URLs which are used to be crawled and the default setting + is 'catch all' (.\*). + +Example: to allow only urls that contain the word 'science', set the +filter to '.\*science.\*'. + + - **Restrict to start domain** With this setting the crawler will only + accept URLs with the domain of the start URL. It is recommended to + increase the crawling depth here to fully crawl a single domain. + +Example: A crawl From URL: +would also crawl and index all linked pages with the active filter +.\*www.server.com.\* + + - **Restrict to sub-path** With this setting the crawler will only + accept URLs with the current sub-folder of the start URL. + +Example: A crawl From URL: +would also crawl and index all linked pages with the active filter + + + + +### Must-Not-Match Filter + +This filter must not match to allow that the page is accepted for +crawling. The empty string is a never-match filter which should do well +for most cases. If you don't know what this means, please leave this +field empty. + +### Re-crawl known URLs + +If you use this option, web pages that are already existent in your +database are crawled and indexed again. It depends on the age of the +last crawl if this is done or not: if the last crawl is older than the +given date, the page is crawled again, otherwise it is treated as +'double' and not loaded or indexed again. + + + +### Auto-Dom-Filter + +This option will automatically create a domain-filter which limits the +crawl on domains the crawler will find on the given depth. You can use +this option i.e. to crawl a page with bookmarks while restricting the +crawl on only those domains that appear on the bookmark-page. The +adequate depth for this example would be 1. The default value 0 gives no +restrictions. + +### Maximum Pages per Domain + +You can limit the maximum number of pages that are fetched and indexed +from a single domain with this option. You can combine this limitation +with the 'Auto-Dom-Filter', so that the limit is applied to all the +domains within the given depth. Domains outside the given depth are then +sorted-out anyway. + + + +### Accept URLs with '?' / dynamic URLs + +A questionmark is usually a hint for a dynamic page. URLs pointing to +dynamic content should usually not be crawled. However, there are +sometimes web pages with static content that is accessed with URLs +containing question marks. If you are unsure, do not check this to avoid +crawl loops. + +Example: *www.domain.com/index.php?page=start* + + + +### Store to Web Cache + +Use YaCy as a proxy cache at the same time. So crawled and indexed pages +are saved in the proxy cache to speed up the access in case of another +request for the same page. This option is used by default for proxy +prefetch, but is not needed for explicit crawling. + + + +### Do Local Indexing + +This enables indexing of the wepages the crawler will download. This +should be switched on by default, unless you want to crawl only to fill +the Document Cache without indexing. The settings allows separate +options for *text* and other *media* formats. + +### Do Remote Indexing + +Here you can select who gets this "crawl job". If checked, the crawler +will contact other peers and use them as remote indexers for your crawl. +If you need your crawling results locally, you should switch this off. +Only senior and principal peers can initiate or receive remote crawls. A +YaCyNews message will be created to inform all peers about a global +crawl, so they can omit starting a crawl with the same start point. + + + +### Exclude static Stop-Words + +This can be useful to circumvent that extremely common words are added +to the database, i.e. "the", "he", "she", "it"... To exclude all words +given in the file yacy.stopwords from indexing, check this box. + + +Now just click on "Start New Crawl" and your YaCy server starts indexing +this page. The progress of the crawl can be seen if you select Index +Control - Crawler Monitor, or Index Control - Crawl Results from the menu +panel on the left side. + + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/operation/network-definition.md b/docs/operation/network-definition.md new file mode 100644 index 0000000..de73e41 --- /dev/null +++ b/docs/operation/network-definition.md @@ -0,0 +1,274 @@ +# Network definition + +YaCy peer-to-peer network is completely decentralized and also does +not require a single central server for the network to clamp up. + + + + +## Network-Bootstrapping + +Of course, a 'new' peer must know how to contact the other peers, for +that is there so-called seed list. That, generated by a peer in the +network, is used for the +Network-[Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_node), +but any participant of the network can generate seed-list. A peer that +creates such a file may call itself a 'principal peer' and there can be +several of them. The network is therefore defined by very +specific peers, but these cannot be viewed as 'centers' because, in +principle, every peer can be principal. + +Now you still may ask yourself where a new peer finds the principal peers of +the network: these are stored in the network definition file of the network, +that you want to join. A new YaCy peer can basically connect to any +network, but that doesn't mean you must set something complicated: every +installation includes a network definition file of default network called +'freeworld'. That is located in `defaults/yacy.network.freeworld.unit` and +is set by the attribute `network.unit.definition` in `defaults/yacy.init`. +You don't have to adjust or change anything in that configuration file, +everything neccessary for the "freeworld" network is already set. + +However, every YaCy user can define their own network, and this article +covers in detail how that works. + + +The processes that happen during bootstrapping are: + + - look-up the `network.unit.definition` attribute in + `defaults/yacy.init`. The value of this attribute is either a path to + a file or URL of a network definition. In case of Freeworld, the values are stored in + `defaults/yacy.network.freeworld.unit` + - read the `defaults/yacy.network.freeworld.unit`; there are + attributes `network.unit.bootstrap.seedlist0`, + `network.unit.bootstrap.seedlist1` etc., which contain the URLs of + the seed lists + - the files from the values of `network.unit.bootstrap.seedlist0` (..) + are read. They contain so-called Peer-Seeds. These are + brief information about the peers, which states which IP they have and + a lot more like name, index size and so-called peer news + - some of the seeds accumulated in this way are used to grow the seed in question + to send a so-called 'peer ping'. In this ping + one's own peer propagates its own seed information, which + is stored by the pinged peer and sent to other peers. + - in response to the ping, the pinged peer sends the latest + information from other peers by sending their seeds. On + other peers then find out about their own peer in the same way + the peers pinged in this way pass on information with the new seed. + +After the seed lists have been loaded once, a peer can even completely +find its way back to the network without loading the peer lists. + + +## Definition of the YaCy network with yacy.network.unit + +In peer-to-peer mode, YaCy creates a network cluster of YaCy peers defined +for a specific domain of web-index. By default, this search network is the +public YaCy network, whose domain is the public Internet. The network is +defined through the bootstrapping, and all peers within the network must +have the same bootstrapping information with the properties of the network. +These information are stored in `yacy.network.unit`. + +The `yacy.network.unit` file is included in every YaCy instance from the +default settings (yacy.init) with the property `network.unit.definition`, +which defines the following properties, among others: + + + network.unit.name = freeworld + network.unit.description = Public YaCy Community + network.unit.domain = global + network.unit.dhtredundancy.junior = 1 + network.unit.dhtredundancy.senior = 3 + network.unit.bootstrap.seedlist0 = http://www.yacy.net/yacy/seed.txt + network.unit.update.location0 = https://download.yacy.net/ + + + +At the YaCy startup, the network is set up as follows: + +1. the `yacy.init` is loaded: the property `network.unit.definition` + in yacy.init denotes `yacy.network.unit` as the network definition. +2. `yacy.network.unit` is loaded: the property + `network.unit.bootstrap.seedlist0` in yacy.network.unit is set to + "", for a list of seeds from the network +3. is loaded and the seeds in it + are loaded into the seed DB. +4. The seeds contain information about the last known peer addresses. + +The peers of this network all use the name mentioned in yacy.network.unit +(`network.unit.name`) to identify themself as a participant in the same network. +The property `network.unit.description` is just a free +definable text that is displayed in the network graphic, for example. + +A very important information is the web domain which the network +indexes. The associated property is called `network.unit.domain`. +The domain and can take the following values: + + - `global`: only URLs that are freely accessible are in the index + accepted. + - `local`: only URLs that are accessible on an intranet will be + accepted. This is useful, for example, when indexing a + intranets. + - `any`: both local and global addresses are accepted. + +Another network-related setting is the "redundancy factor" or the number +indicating how many copies of index are distributed within the DHT. In a +public network, the availability of a peer cannot be ensured and therefore +the redundancy factor is `3` (`network.unit.dhtredundancy.senior`). In a +network with high availability, this factor can be set to `1`. + +If all peers in a network are administered by a single person you may want +an automatic update of all the peers to take place. To do this, a download +location must be defined, and a network operator can specify own location to +be able to control the version for the automatic update. To do this, the +property `network.unit.update.location0` can contain the URL of page +that contains links to releases. See [autoupdate](./autoupdate.md) for details. + +You can provide additional alternative addresses both to +`network.unit.bootstrap.seedlist0` and `network.unit.update.location0`; you +can simply add additional properties named like that, with increased +sequence number. + + +## Creation of your own YaCy network + +The network definition must be the same for all participants in a network, +and this is achieved - in the standard case - by including that in the +release. + +A YaCy network operator may be interested in changing the network definition +after the network has been set up for all peers, for example for advanced +security settings for the network. Hence the setting the +network.unit.definition in yacy.init is possible also via URL, which we use +in the following example. + +The construction of a new network consists of two major steps, first is the +definition of the first peer of the new network and then in the deployment +of the other peers assigned to the first peer. + + +### Configuration of the first peer of a new network + +The steps are: + + - Editing the yacy.network.unit, for example + to index both local and global websites, but with the update + address of new releases from the global network: + + + + network.unit.name = mynet + network.unit.description = My first very own YaCy network + network.unit.domain = any + network.unit.dhtredundancy.junior = 1 + network.unit.dhtredundancy.senior = 1 + network.unit.bootstrap.seedlist0 = http://www.meindomain.invalid/yacy.myseedlist + network.unit.update.location0 = https://download.yacy.net/ + + - Upload this file, for example to: + + - Configure this definition for each peer before the initial + start-up by opening the `yacy.init` file and setting the value + + + + + network.unit.definition = http://www.meindomain.invalid/yacy.mynetdef + + +This is only necessary for the initial installation, further updates of the +peers must remain constant and cannot be set again. + + - Now, the first peer of the network can be started. For the second peer + to find it, it must know its IP. YaCy usually uses peer ping to + distinguish itself from another peer to have his public IP named. At + the first peer and a new network, this is not possible because there + is no other peer yet that could respond to a peer ping. Instead, your + own public IP must be assigned via the menu item + http:///Settings_p.html?page=ServerAccess , + and configure the IP in StaticIP setting. + + - The first peer must operate as a principal peer, i.e. it must + be able to create a seedlist so that the newly + started peers can find the first peer + - Under + http:///Settings\_p.html?page=seed + the upload address for the seed list can be defined. This + was already entered in yacy.mynetdef and + was + - check whether the peer reaches principal status, i.e. + it was able to create a seed list and complete the upload. + + +### Configuration of the participating peers of the new network + +Once the first peer is running, additional peers can be added. They load +the seedlist and contact the principal-peer, which provides the new +information again through a seedlist-upload to the newly connected peer. + +To ensure that new peers can automatically access the new network, you can +make a special YaCy release with settings of the new network. To become a +new peer, correctly configured to participate in the new network, it can +also be configured with normal, unchanged release and updated without +loosing network membership. The steps to define the Special releases are: + + - Configuring the network definition in yacy.init: + + + + + network.unit.definition = http://www.meindomain.invalid/yacy.mynetdef + + - It also makes sense to set-up the automatic updates of the peers: + + + + update.process = auto + + - To avoid neccessity of entering the passwords via setup menu during mass + installations, it is recommended to set a default password in + yacy.init, according to the following example: + + + + + adminAccount=admin:myS3cr3tPa55w0rd + + - Once the yacy.init is fully configured, you can create your own + Bootstrap-release for your own network, simply by packing the + yacy directory: + + + + tar cf yacy_mynet.tar yacy + +For a mass deployment, all you have to do to install YaCy on the network's +computers, is to distribute and unpack the file `yacy_mynet.tar`. + +To ensure a permanent availability of the YaCy installation, it's +recommended to define a cron job that regularly restarts installed peer. +For example, using following entries in `/etc/crontab` file: + + + 0 0 * * * yacyuser /home/yacyuser/yacy/stopYACY.sh + 2 0 * * * yacyuser /home/yacyuser/yacy/killYACY.sh + 4 0 * * * yacyuser /home/yacyuser/yacy/startYACY.sh -l + +or, simply: + + 0 3 * * * yacyuser /home/yacyuser/yacy/restartYACY.sh + +provided that YaCy runs under the user 'yacyuser' and that +YaCy directory is located in the yacyuser home directory. + + +For more network settings see also: +[yacy.network.readme](https://github.com/yacy/yacy_search_server/blob/master/defaults/yacy.network.readme). + + + +_Converted and translated from German from +. May be outdated._ + + + + diff --git a/docs/operation/performance.md b/docs/operation/performance.md new file mode 100644 index 0000000..59d1e78 --- /dev/null +++ b/docs/operation/performance.md @@ -0,0 +1,370 @@ +# Performance Tuning + + +**YaCys crawling and indexing performance can be dramatically +enhanced.** The default settings in a standard YaCy release are not +preset for maximum performance, because the software shall run on +personal computers that are mainly used for other purposes. Too high +performance settings would eat up all CPU time, memory and IO bandwidth. +But YaCy can be specialized for a high-performance web-search production +system. + +Depending on your computing environment, you can use one or all of the +following remommendations to modify YaCy. Please be aware that all +changes can also have some unwanted side-effects. + + + +## Increase Memory Usage + +This means that upon start-up time YaCy takes more memory from the OS. + +#### How-To + +Open the Performance Page, select the 'Memory Settings for Database +Caches' Submenu. Under 'Memory Settings' increase 'Maximum used memory'; +click 'Set'. Then re-start YaCy + +#### Effect + +This is a premise for the following performance settings. It also can +speed up YaCy if memory is low and there are frequent Garbage +Collections + +#### Side-Effects + +You decrease the available memory for other applications on your system. + +#### Why is this not done by default? + +YaCy wants to be nice to the average computer user and their systems. +Modern computers have 512MB RAM or more. We believe that 96MB for YaCy +as default is a good tradeoff between performance and resource +allocation. + + + +## Increase Indexing Cache + +Indexing is the process of creation a Reverse Word Index (RWI) +datastructure from a given set of text documents. It means that a +document-words releation is reversed to a word-documents relation. This +can be enhanced using a word-documents relation write cache. There are +currently two write caches of that kind: one for RWIs that are supposed +to be transmitted to other peers (DHT-Out) and one for RWIs that shall +be stored on the own peer (DHT-In). But unfortunately the DHT-Out cache +fills up faster than it is possible to send them away to other peers, so +they are (temporary) stored to the own RWI index file(s). Flushing to +the file is IO-expensive, and the greater the cache the less IO-events +happen. + +#### How-To + +Open the Performance Page. Within the 'Cache Settings' table, you can +see some input fields. The 'Maximum number of words in cache' value can +be increased (i.e. 90000 if you have assigned 1GB RAM in the previous +step). You can do this for DHT-Out and DHT-In. Normally more words are +stored in DHT-Out, because only a fraction of the words that you index +are stored on your own peer. Be aware that this value is decreased +automatically if a low-memory event occurs, so that words are flushed an +memory is freed again. This value is then automatically re-set to +'Initial space of words in cache', so please increase this value also. +The next two values 'word flush divisor' are used to determine how many +words shall be flushed to disc after each document is indexed. There are +two values, one for busy-cycles and one for idle-cycles. That means you +can decide that the cache is flushed faster if the peer is busy. I.e.: +if you set the busy divisor to 10000, then 5 words are flushed after +index a page when your word cache has 50000 words in it. + +#### Effect + +Indexing time decreases, PPM (page-per-minute) increases. + +#### Side-Effect + +This needs a lot of memory. If you set too high values, this may cause +frequent Garbage Collections (GC) and that may slow down overall speed +dramatically. If you increase cache space, frequently visit the +performance page and check if the complete memory is taken (at 'Memory +Settings for Database Caches') + +#### Why is this not done by default? + +It needs higher memory assignment by default. Please see 'Increase +Memory Usage' above. + + + +## Decrease Waiting Time Between Scheduled Tasks + +YaCy has a thread organisation for the processing of queues. Each queue +containes entries for special tasks, i.e. there is a queue with urls +that wait for beeing fetched, there is a queue with documents that wait +to be indexed and so on. Between each job of every task there is a pause +to give other processes on the owners computer more CPU and IO time. +This must be done with pauses in YaCy, because most OS' do only handle +CPU priority and time-slicing, but not IO-usage balancing between +processes. + +#### How-To + +Open the Performance Page. At the 'Scheduled tasks overview and waiting +time settings' you can see some input fields for delay values. See at +the 'Delay between busy loops' column: There are the delay values in +milliseconds that are used to pause between every queue processing. + + - if you want to speed up crawling, decrease the 'Local Crawl' value. + **PLEASE** do not set this to zero, because that may cause cause too + heavy load on the target HTTP server. + - if you want to speed up indexing, decrease the 'Parsing/Indexing' + value. + +#### Effect + +Queues are worked-off faster. If the delay values are well-balanced, +then this may cause better indexing speed. + +#### Side-Effect + +If you do too fast page-fetching, this may cause denial-of-service +effects on target web servers. There is a built-in load-balancing beween +target domains, but that may not help if you are crawling only a single +domain. Please try to avoid this case. For all other values: no pauses +between loops may cause that your system may not be used for other tasks +than YaCy, because then YaCy eats up all IO-bandwith and CPU time. + +#### Why is this not done by default? + +To protect the used from doing DoS-by-mistake and to implement a +'IO-nice' so that the users computer is not blocked. + + + +## Switch to Robinson Mode + +If you want to use the indexing result only on your own private search +portal, you can switch off index ditribution, index receive and remote +indexing. We call that the Robinson mode. Because index distribution is +synchronized with indexing tasks, the indexing is slower when index +distribution is switched on. There is no circumvention of +synchronization by implementation of a separate DHT transmission thread, +because both processes would access the same databases at the same time +and conflicting IO would cause less performance. + +#### How-To + +Open the 'Basic Configuration' Page and click on the 'Network' sub-menu. +Check the 'Robinson Mode' button. You can then select which kind of +robinson mode you want to activate: - if you want complete separation +and invisibility to other peers, choose 'Private Peer' - if you want +content-separation, but visibility to other peers (they are allowed to +search your peer), choose 'Public Peer' - if you want a cluster of +public peers, choose 'Public Cluster'. You can define the cluster by +simply naming the other cluster peers in a comma-separated list. The +Form of the names are .yacy + +#### Effect + +Because DHT transmissions are synchronized with the indexing within the +'Parsing/Indexing' queue (see above), indexing ist speed up if there is +no DHT transmission. Furthermore, your web index is not mixed with +indexes from other peers. + +#### Side-Effect + +When index distribution or index receive is switched off (or both), then +YaCy does not permit a global search. If a web search is startet, only +indexes from the own peer are used. This functional limitation was set +to ensure that the peer-to-peer principle of give-and-take is preserved. +In other words: if you switch to Robinson Mode you can use YaCy only as +your own indexing/search portal. + +#### Why is this not done by default? + +Without index distribution there would not be a global search engine. + +## Increase Number of Crawl Threads + +If your web-crawl is well-balanced (many domains) and crawling is still +too slow (indexing queue is empty and cannot be filled fast enough by +the crawler), then it is recommended to increase the maximum number of +active crawl threads. + +#### How-To + +Open the Performance Page. At the 'Thread Pool Settings' table you see +input fields for maximum active crawl threads. Increase this number, but +limit it to a number that is not too big for your (cheap) router. + +#### Effect + +The number of concurrent http-fetch requests to target web servers +increase. This can speed up crawling. + +#### Side-Effect + +Your router may not be able to handle so many concurrent requests. + +#### Why is this not done by default? + +To be compliant with minimum requirements of cheap network equipment, +and to protect target servers from beeing accessed with too many +requests at the same time. + + + +## Do Not Monitor the Crawler + +After a web indexing is started, you see the Crawler Monitor page. This +page uses Ajax technology to load several xml files from the built-in +web server, which are constructed doing database-lookups. This creates a +constant IO usage which conflicts with the IO needs during crawling + +#### How-To + +After you started a crawl, do not leave the Crawler Monitor page open. +You can monitor the PPM number also at the Status page and at the +Network page. + +#### Effect + +No additional IO is created that conflics with indexing. Indexing gets +faster. + +#### Side-Effect + +You cannot see the Crawling Monitor page. + +#### But why is there this feature if it decreases speed? + +That would mean that we should not have something like the Crawler +Monitor page. But thats such a strong nice-to-have (as heard many times) +that we recently implemented that. + +## Switch Off File Sharing + +Other application that create strong IO or IP load causes YaCy to work +more slowly. File sharing software create both, strong IO and IP load. +There is no need to shut down file sharing, but it will increase speed +of YaCy + +## Re-Boot your Router + +Cheap routers cannot handle many open network connections very proper. +In case that network connections get lost, they may even turn into +zombie threads. When doing a web crawl it typically occurrs that many +unresolved links are tried to access, which may cause this problem. If +your internet connection gets constantly slower, then the most probable +cause is not heavy load from YaCy, but too many zombie thread in your +router. A re-boot of the router solves that problem and increases +internet speed again. + +## Start Several Crawls + +It may appear strange, but starting of several crawl jobs can increase +crawling speed because that may help to balance the http-fetch over +different domains. If the servers at the different domains are slow, +then many jobs will cause a balancing over these domains which can +increase crawling speed. + +## Move DATA to a RAID + +This was never tested, but storage of the RWI on a RAID can speed up +indexing because indexing is such a heavy IO job. + +## Put Parts of the Index to Other Disc + +This would be a nice alternative to the RAID idea: set symbolic links +for paths of the index storage to another IO device. Doing so, you +divide the IO over several devices which can give more overall IO speed. +A path that is appropriate for separation to another disc could be +DATA/INDEX/*foo*/SEGMENTS/*bar*/, this is the directory where the RWIs +are stored, *foo* and *bar* are freeworld/default for default settings. + +## What the heck is going on? + +Now, the following isn't even remotely relevant to YaCy, and yet it is. +The `sysstat` tools are a neat package of utils available for variants +of the [GNU Operating System](http://www.gnu.org/) such as +[Linux](http://www.kernel.org/). + +Running the following...: + + while true; do clear; tail -n 26 DATA/LOG/yacy00.log; vmstat; iostat; sleep 20s; done + +or just + + tail -f DATA/LOG/yacy00.log + +*(just press ctrl-c when you feel overwhelmed by the immensity of this +information)* + +..may give you information which could be used to make good decisions as +to what the effects of your adjustments to YaCy settings are on your +system. + +You prefer to just tail the log in one +*[screen](https://www.gnu.org/software/screen/manual/html_node/index.html)* window and run `iostat +10` or `wmstat 10` in another (the parameter is the delay beween +updates, but make no mistake: These tools have many more useful +options). + +### vmstat + +**IO** (In/Out) is data being read and written to the disk. The rest of +the system has to wait for this, so the system will seem utterly slow +when there is very high IO activity. + +**Swap** are bytes swapped written to and from your swap. If you have +such IO activity then you should *reduce* the amount of memory made +available to YaCy: There are too much in the memory and it's been +swapped out - and this is very bad. *cache* is disk cache and in reality +free memory; it's dropped once the memory is neede by a program. + +IO shows you blocks in and blocks out. High numbers is very sad and +depressing: It means that there is a huge amount of disk activity. + +The CPU field shows you (us)er processes, (sy)stem processes, (id)le +activity and your worst nightmare: (wa)iting for IO. An example of this +is: + + procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- + r b swpd free buff cache si so bi bo in cs us sy id wa + 0 0 24600 3584 4624 383804 0 1 203 185 6 4 8 2 75 15 + 0 0 24600 7284 4576 380064 0 0 625 60 2016 297 11 9 68 12 + 0 1 24600 2472 4636 384516 0 0 749 49 1976 300 5 2 79 14 + 0 1 24600 4072 4656 382756 0 0 1365 730 2048 347 6 1 65 28 + 0 1 24600 2716 4672 384032 0 0 1373 398 2105 301 29 19 38 15 + 0 1 24600 3712 4652 382844 0 0 1223 74 2085 325 15 16 60 9 + 0 0 24600 4076 4600 383520 0 0 1959 639 2058 353 5 2 53 41 + +This system appears to be doing fine, since it's not (wa)iting that much +for IO. + +In contrast: The following box..: + + procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- + r b swpd free buff cache si so bi bo in cs us sy id wa + 0 2 59104 2404 2528 143560 0 0 1247 110 2593 493 28 2 9 61 + 0 0 59104 2292 2380 143520 0 0 1393 33 2679 442 9 8 5 77 + 0 1 59104 1160 2180 145600 0 0 1403 3 2689 380 6 6 16 73 + 0 2 59104 1380 2092 145716 0 0 1601 140 2503 431 4 2 6 87 + 0 3 59100 1032 1892 145864 30 0 1580 179 2748 489 19 4 7 69 + 0 2 59100 1996 1760 145136 0 0 1485 59 2770 447 12 2 0 86 + 0 2 59100 2200 1768 144840 0 0 1124 10 2663 466 3 2 16 78 + +...is mostly (wa)iting for bytes being read (and occationally written) +from the storage device. There are almost no CPU cycles since it's all +busy (wa)iting for bytes to be read. This is very bad and means that you +may want to take configuration steps in order to reduce IO activity. + + + + + +_Converted from +, may be outdated_ + + + + diff --git a/docs/operation/yacy-tor.md b/docs/operation/yacy-tor.md index 0b6b3f4..c14fd53 100644 --- a/docs/operation/yacy-tor.md +++ b/docs/operation/yacy-tor.md @@ -24,8 +24,7 @@ should look like this: `*.onion/.*` Thread about Whitelisting feature: - - - + - @@ -39,7 +38,7 @@ index both Tor hidden-services and normal Web sites. ## Help Should you have questions or need help, go to the [English YaCy -forum](http://www.huzzaar.com/yacy-forum/) +forum](https://community.searchlab.eu/) ## Part 1 - Configuring Tor and Privoxy @@ -369,7 +368,7 @@ located in there, e.g. `chown -R yacy: ./` | Thread about Whitelisting feature: - - + - ~~YaCy only supports a blacklist by default, therefore you have to download @@ -379,7 +378,7 @@ filter is available.~~ Sorry, but this Whitelist can't be used at this moment: - - + - Now we just have to make an entry to only index *.onion* sites: @@ -398,7 +397,7 @@ accept only \*.onion domains. #### Defining the YaCy-Tor-network By now, YaCy is able to build and define separated networks: -[Netzdefinition](./de:netzdefinition.html "De:Netzdefinition") +[Network Definition](./network-definition.md) The current definitions can be downloaded from [\[2\]](http://byi4akelnxrz5tab.onion:8081/yacy.network.unit.tor) and diff --git a/docs/operation/yacyoverhttps.md b/docs/operation/yacyoverhttps.md index b307a81..1e19491 100644 --- a/docs/operation/yacyoverhttps.md +++ b/docs/operation/yacyoverhttps.md @@ -1,6 +1,6 @@ # Using the YaCy Front-End over HTTPS -It is possible to put a SSL encoding in front of YaCy to get the YaCy +It is possible to put a SSL encryption in front of YaCy to get the YaCy interface accessible using https. This can easily be done using stunnel and openssl. @@ -53,6 +53,3 @@ Now the YaCy search page can be opened at _Converted from , may be outdated_ - - - diff --git a/docs/use_cases.md b/docs/use_cases.md new file mode 100644 index 0000000..9ab441c --- /dev/null +++ b/docs/use_cases.md @@ -0,0 +1,158 @@ +# Use cases + +This list describes [use cases](http://en.wikipedia.org/wiki/Use_case) +and the respective sequences of events that supply you with the desired +results using YaCy. + +## Search Functions + +### Alternative Web Search + +You are looking for www documents, but - for different reasons +(availability, censorship, other search portals' ranking and sorting) - +you want to find websites that have been indexed by a community of +independent or topic-oriented (see: different networks; freeworld; TOR; +sciencenet) YaCy peers. + + - [Install](download_installation.md) + a YaCy standard release without any modifications. + - Open and use the YaCy search page. + +### Portal Search + +You own a web portal and need a search function for your pages. Users of +your website shall be able to find documents within your site or its +sub-domains. Users of the public YaCy network (freeworld) shall get +results containing links to your site. + + - [Install](download_installation.md) + YaCy on your (v)server. + - Switch to Robinson mode (unilateral index separation from other + users). + - Start a [web crawl](operation/crawlstart_p.md) + that is restricted to your domain. + - Integrate the YaCy search page into your web portal. + +### Topic Search + +You own a web portal on a special-interest topic and want to change it +into a search portal which offers a web search especially on this +subject. Your users shall be able to get search results from your site +and the sites you recommended alike from a single search dialogue. + + - [Install](download_installation.md) YaCy on your (v)server. + - Switch to Robinson mode (unilateral index separation from other + users). + - Start several [web crawls](operation/crawlstart_p.md) + that are restricted to your domain and the ones you want to include. + - Integrate the YaCy search page into your web portal. + +### Intranet Search + +An intranet you administer needs a search function. Users of this search +shall be able to reach all pages of your intranet from a single search +page. (Use similar to a 'search appliance'; see: GSA) + + - [Install](download_installation.md) + YaCy on a server inside your intranet. + - Reconfigure the standard network affiliation to 'intranet'. + - Start an unrestricted [web + crawl](operation/crawlstart_p.md) + with a page on your intranet (if interlinking exists) or to several + pages, one for each intranet server. + - Integrate the YaCy search page into your intranet portal. + +### Collaborative Desktop Indexing + +You and your colleagues need a common search function for documents +which are stored on your private computers and not on shared drives. +Each member of the group can restrict the shared use to certain +documents. All documents are to be published via a web service. A search +shall include all shared documents. Other persons' frequently-used +documents are to be accessed by means of a bookmark function. + + - All group members + [install](download_installation.md) + YaCy on their computers. + - Reconfigure the standard network affiliation to 'intranet'. + - All users publish their documents to be shared in their + `/DATA/HTDOCS/repository/` directory. + - Everybody starts a [web + crawl](operation/crawlstart_p.md) + at the address `http://:8090/repository/`. + - All users can find all their own documents and those of the other + team members through their own YaCy search + page + - Found documents will be served automatically through YaCy's built-in + web server. + - Found documents can be bookmarked by means of YaCy's built-in + function. + - Users can publish their bookmarks. + - Other users' published bookmarks can be imported as private ones. + + + +## Personal Web Assistance + +### Personal Bookmark Server + +You have a huge number of bookmarks on your private and on your company +computer. You want to use all of them on both computers without +publishing them. While you are on a journey, you also want to be able to +access your personal bookmarks from any computer. + + - [Install](download_installation.md) + YaCy on your home computer and enable public access via dyndns or + install YaCy on a (v)server on the internet. + - Export your bookmarks to YaCy. They can be accessed by tags. + - For each entry, you can select if it is to be public or private. + - On your company computer, you can access your bookmarks, for + instance, through the + [YaCyBar](https://github.com/yacy/YaCyBar) + (Firefox plug-in) or the YaCy web front-end. + - While on a journey, you can access your bookmarks through the YaCy + web front-end. + + + +## Web Analysis + +### Detection of All Dead Links on Your Web Portal + +You administer a web portal and want to remove faulty links from your +site. + + - [Install](download_installation.md) + YaCy. + - Switch to Robinson mode (unilateral index separation from other + users). + - Start a [web + crawl](operation/crawlstart_p.md) + that is restricted to the domain to be checked. + - Examine the crawl log and filter out all messages about unreachable + pages. + - Use the unix/linux strings command to extract a comprehensive list + of all dead links. + +### Generation of a Domain List + +You need a list of domain names which are accessible on-line and provide +contents. + + - [Install](download_installation.md) + YaCy. + - Start a [web + crawl](operation/crawlstart_p.md) + A good starting point is a highly-interlinked portal page. + - Use the built-in export function for URL and domain lists. You can + refine the exports using filters. + + + + +_Converted from +, may be outdated_ + + + +