Question: Populating site data #91

squalx · 2023-01-24T18:36:18Z

squalx
Jan 24, 2023

Hello, I'm still not too familiar with the project. Did the design enhancements for the landing page, but now I want to do them for the search results page. So I'm trying to add some sites to get on that page. Hopefully enough that I can get multiple pages so I can see the pagination.

So I've tried to run:
src/indexing/bulkimport/wikipedia/import.sh

In there there's a comment that explains how to run this from the container:
docker exec -it src_indexing_1 /usr/src/app/bulkimport/wikipedia/import.sh >~/logs/bulkimport/wikipedia.log

Like I mentioned I'm no Docker guru. I just know how to put containers up and build them. Look at logs and such. The instruction above, in my case, needs to be run with sudo not to fail. Again, this is a problem with my docker installation. The log ends up being created on my local machine, not the containers. Is this correct? Anyway, I think the indexing fails. Below are the log errors. Thanks!:

Step 1: Find the last successful wikipedia import, and see if there is a later one available
Last completed 00000000
Last available 20230123
Step 2: Download, started at Tue Jan 24 18:23:08 UTC 2023Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/src/app/bulkimport/wikipedia/wikiutils.py", line 15, in update_log
    utils.update_indexing_log(domain, status, message)
AttributeError: module 'common.utils' has no attribute 'update_indexing_log'
cirrussearch/en/enwiki-20230123-cirrussearch-content.json.gz: Read-only file system

Step 3: Uncompress and split into smaller chunks, started at Tue Jan 24 18:23:09 UTC 2023gzip: cirrussearch/en/enwiki-20230123-cirrussearch-content.json.gz: No such file or directory
rm: cannot remove 'cirrussearch/en/enwiki-20230123-cirrussearch-content.json.gz': No such file or directory


Step 4: Reformat, started at Tue Jan 24 18:23:09 UTC 2023cirrussearch/en/*
Traceback (most recent call last):
  File "/usr/src/app/bulkimport/wikipedia/reformatjson.py", line 79, in <module>
    with open(input_file) as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'cirrussearch/en/*'


Step 5: Load into Solr, started at Tue Jan 24 18:23:09 UTC 2023 
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">39</int>
</lst>
</response>
cirrussearch/en/*
Warning: Couldn't read data from file "cirrussearch/en/*", this makes an empty 
Warning: POST.
{
  "responseHeader":{
    "status":0,
    "QTime":2}}
rm: cannot remove 'cirrussearch/en/*': No such file or directory

Completed at Tue Jan 24 18:23:09 UTC 2023 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/src/app/bulkimport/wikipedia/wikiutils.py", line 15, in update_log
    utils.update_indexing_log(domain, status, message)
AttributeError: module 'common.utils' has no attribute 'update_indexing_log'

Answered by m-i-l

Jan 28, 2023

Great question.

First a quick clarification - the src/indexing/bulkimport/wikipedia/import.sh you ran is for the bulk load of wikipedia. It hasn't been maintained since wikipedia indexing was stopped, and now fails pretty early on. The idea was that src/indexing/bulkimport/ would contain scripts to bulk load content into the search engine directly, while src/db/bulkimport/ would contain scripts to bulk load site details into the database for the indexer to pick up and index as normal. Apologies for the confusion - I've updated the README accordingly. Fortunately the wikipedia import now fails before downloading the whole of wikipedia:-)

Regarding the indexing of test sites on local dev - …

View full answer

m-i-l · 2023-01-28T12:47:00Z

m-i-l
Jan 28, 2023
Maintainer

Great question.

First a quick clarification - the src/indexing/bulkimport/wikipedia/import.sh you ran is for the bulk load of wikipedia. It hasn't been maintained since wikipedia indexing was stopped, and now fails pretty early on. The idea was that src/indexing/bulkimport/ would contain scripts to bulk load content into the search engine directly, while src/db/bulkimport/ would contain scripts to bulk load site details into the database for the indexer to pick up and index as normal. Apologies for the confusion - I've updated the README accordingly. Fortunately the wikipedia import now fails before downloading the whole of wikipedia:-)

Regarding the indexing of test sites on local dev - there's a couple of options I mention in at https://github.com/searchmysite/searchmysite.net/blob/main/README.md :

Add your own site to your local dev as a Full listing, setting your site as an admin, submitting any other sites you need as Basic listings, and using your admin role to approve those other sites, so you can (mostly) add sites via the web interface. This does however gloss over some details:

For your own Full listing, you can either set ENABLE_PAYMENT=False in .env, or enter the test credit card details from the test scripts given local dev should point to non-prod Stripe.
You are almost certainly going to have to run some SQL. You could simply add your site as a Full listing and stop there to avoid having to run any SQL, but that would be limiting. I use VSCode with the PostgresSQL extension. Connection details are "127.0.0.1" for host, "postgres" for user, "searchmysitedb" for database, and whatever you put in .env for password.
In practice, you probably don't want to go through the whole Domain Control Validation (DCV) process again just for local dev. I use the validation key I've set up for my prod TXT record, and when I reach that stage on local dev I update the validation key in the local dev database to match that value, refresh the page, and then validate. SQL for that isn't on the README, and is `UPDATE tblValidations SET validation_key = '<yourkey>' WHERE domain = '<yourdomain>';

Use the bulk import scripts in src/db/bulkimport/ . These were originally written for the bulk load of sites from indiemap.org into production in Sept 2020 as described at https://blog.searchmysite.net/posts/searchmysite.net-update-seeding-and-scaling/ , so haven't really been written with end users in mind (i.e. they are a bit hacky), but could still be useful for anyone to bulk load into their own local dev environments. Again the README.md doesn't have much detail. To use against your own local dev:

Step 1: create the input file. It must be a text file with a list of domains (e.g. michael-lewis.com) or home page links (e.g. https://michael-lewis.com/) each on a new line, and saved in the data folder with a .txt extension, e.g. data/inputfile.txt.
Step 2: run python checkdomains.py <inputfile> (i.e. the file created above without data/ at the start and .txt at the end). Make sure your local dev is up and running, and that you have your python env set up. There are a couple of variables to be aware of: (i) database_host should point to your local dev (e.g. "db"), rather than the prod IP, (ii) input_reviewed should be True, otherwise the sites you load will require moderator review (which in turn needs an admin account set up). checkdomains.py will do a bunch of checks, e.g. to make sure the site is responding, that it isn't already in the database, and so on. The output will be a file at data/<inputfile>.json which has all the info that step 3 needs.
Step 3: run python insertdomains.py <inputfile> where inputfile is the same value as above, i.e. no data/ and .txt. This will look for the previously created json file, and insert the relevant domains into the database with the relevant settings. Note that the category and moderator (and tier) are hardcoded.

There is also a 3rd option I don't mention on the README - a backup of the production search collection, in a format that can easily be imported into local dev via localhost:8983/solr/content/replication?command=restore&location=/var/solr/data/userfiles . At the moment, with approx 1,500 sites, the backup takes around 640Mb, so it isn't too difficult to work with (for comparison, the compressed wikipedia download was 32.5Gb). Not sure the best place to host it though. It is too big for GitHub. I guess I could copy from the search container to the web container and expose via the web server, like wikipedia do for their exports, although I'd need to make sure it doesn't end up increasing storage costs.

3 replies

squalx Jan 31, 2023
Author

Sorry Michael, just noticed this reply. I'm following the repo now so I should get replies right away. Everything is working, so thanks! I'm adding sites manually by myself. Haven't tried adding them in batch.

First, just to clarify, I'm using PostgreSQL by Chris Kolkman for VSCode. The M$ one is not very good.

I've configured ENABLE_PAYMENT as false.

I was too lazy to figure out what you were explaining about approving the admin :-) So, after creating a new "paid" user, I did this:

INSERT INTO tblPermissions (domain, role) VALUES ('MYDOMAIN', 'admin');
UPDATE tblValidations SET validation_success = TRUE, validation_date = NOW() WHERE domain = 'MYDOMAIN';

Then I go to the review link on my admin account. I approve all new sites. And then fired the indexer to not wait:
docker exec -it src_indexing_1 python /usr/src/app/search_my_site_scheduler.py

Thanks!

m-i-l Feb 4, 2023
Maintainer

If that works for now that's great. It isn't the full SQL for approved sites so there may be odd behaviours later, if you are running the environment long enough to encounter them. For reference, the full SQL for the final stage of approving tier 3 (full listing) sites is in sql_update_freefull_approved in https://github.com/searchmysite/searchmysite.net/blob/main/src/web/content/dynamic/searchmysite/sql.py .

The indexer is set to run every 1 min on dev, so I almost always just wait for that, with a docker logs -f src_indexing_1 to see when it kicks into action. Running one manually and having the scheduled one start running at the same time should be fine - it is written to allow concurrent running should I need to scale with more than one indexing server.

Given you say all is running, I'll mark the above as the answer.

squalx Feb 6, 2023
Author

Thank you! I will use the proper SQL. It has indexed quite a bit of sites I have added without issues so far. I just added about 10 and then made some crazy search, like searching for the letter "a". And I get 10 pages of pagination which is all I need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Populating site data #91

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Question: Populating site data #91

squalx Jan 24, 2023

Replies: 1 comment · 3 replies

m-i-l Jan 28, 2023 Maintainer

squalx Jan 31, 2023 Author

m-i-l Feb 4, 2023 Maintainer

squalx Feb 6, 2023 Author

squalx
Jan 24, 2023

Replies: 1 comment 3 replies

m-i-l
Jan 28, 2023
Maintainer

squalx Jan 31, 2023
Author

m-i-l Feb 4, 2023
Maintainer

squalx Feb 6, 2023
Author