Skip to content

Commit

Permalink
Change robots.txt #1709
Browse files Browse the repository at this point in the history
We had an old robots.txt for the wmflabs with subdirectory.
As the Toolforge Scholia domain is changed to scholia.toolforge.org
then the robots.txt was no longer having a correct path and should
thus be ineffective.

Search engines index dynamic content on Scholia pages differently:
Bing and Quant seems to index the content, but Duckduckgo and Google do
apparently not, see #1709.

With this change, not only is the path change, but bots are now allows.
If this results in too much load on the Toolforge infrastruture then it
should be changed to a 'Disallow: /'.

Note that the 'robots' HTML meta tag on each Scholia page has a nofollow
to avoid crawling.
  • Loading branch information
fnielsen committed Nov 30, 2021
1 parent 1c66181 commit ec82fe7
Showing 1 changed file with 22 additions and 2 deletions.
24 changes: 22 additions & 2 deletions scholia/app/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -1984,14 +1984,34 @@ def show_publisher_empty():
def show_robots_txt():
"""Return robots.txt file.
A robots.txt file is returned that allows bots to index Scholia.
Returns
-------
response : flask.Response
Rendered HTML for publisher index page.
Rendered plain text with robots.txt content.
Notes
-----
The default robots.txt for Toolforge hosted tools is
User-agent: *
Disallow: /
Scholia's function returns a robots.txt with 'Allow' for all. We would like
bots to index, but not crawl Scholia. Crawling is also controlled by the
HTML meta tag 'robots' thatis set to the content: noindex, nofollow on all
pages. So Scholia's robots.txt is:
User-agent: *
Allow: /
If this results in too much crawling or load on the Toolforge
infrastructure then it should be changed.
"""
ROBOTS_TXT = ('User-agent: *\n'
'Disallow: /scholia/\n')
'Allow: /\n')
return Response(ROBOTS_TXT, mimetype="text/plain")


Expand Down

0 comments on commit ec82fe7

Please sign in to comment.