Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bookworm build all #91

Closed
tpmccallum opened this issue Jan 5, 2016 · 11 comments
Closed

bookworm build all #91

tpmccallum opened this issue Jan 5, 2016 · 11 comments

Comments

@tpmccallum
Copy link

The following is a trail of my latest build, most issues were resolved and this is just a trail of events which may help others.

I do have a specific issue (which is unresolved) which you can see if you scroll down to the very end of this page.

I get the following error when running the bookworm build all

make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
cat: .bookworm/metadata/jsoncatalog.txt: No such file or directory
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/texts/textids.dbm
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
cat: .bookworm/metadata/jsoncatalog.txt: No such file or directory
bookworm -l WARNING -d mccallum prep preDatabaseMetadata
Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 234, in prep
    getattr(self,args.goal)()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 267, in preDatabaseMetadata
    Bookworm = bookwormDB.CreateDatabase.BookwormSQLDatabase()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 144, in __init__
    self.setVariables(originFile=variableFile)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 161, in setVariables
    self.variableSet = variableSet(originFile=originFile, anchorField=anchorField, jsonDefinition=jsonDefinition,db=self.db)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/variableSet.py", line 476, in __init__
    self.jsonDefinition = json.loads(open(jsonDefinition,"r").read())
IOError: [Errno 2] No such file or directory: '.bookworm/metadata/field_descriptions_derived.json'
make[1]: *** [.bookworm/metadata/catalog.txt] Error 1
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make: *** [.bookworm/targets/encoded] Error 2

@tpmccallum
Copy link
Author

I had the following set up as per the documentation.
< https://github.com/Bookworm-project/BookwormDB/blob/master/README.md >

folder/
    | field_descriptions.json
    | jsoncatalog.txt
    | input.txt

But I was getting errors like this

IOError: [Errno 2] No such file or directory: '.bookworm/metadata/field_descriptions.json'

And it looked like it was trying to cat a file which did not yet exist.

cat .bookworm/metadata/jsoncatalog.txt

My solution was to move the files (which the code was complaining about) into the place where the build was looking for them (in this case .bookworm/metadata)

mv mccallum/field_descriptions.json .bookworm/metadata/
mv  mccallum/jsoncatalog.txt .bookworm/metadata/

In the case of the input.txt file, I still have that in the directory as per the documentation

folder/
    | input.txt

This really helped and the build ran for a while.

@tpmccallum
Copy link
Author

This ran for quite some time and then appeared to fail looking for the input.txt file

Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 97, in tokenize
    raise IOError("Unable to find an input.txt or input.sh file in a default location")
IOError: Unable to find an input.txt or input.sh file in a default location
touch .bookworm/targets/encoded
bookworm -l WARNING -d mccallum prep database_wordcounts
ERROR:root:Query failed: 
DROP TABLE IF EXISTS words

Traceback (most recent call last):
  File "/usr/local/bin/bookworm", line 9, in <module>
    load_entry_point('bookwormDB==0.4.0', 'console_scripts', 'bookworm')()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 556, in run_arguments
    getattr(my_bookworm,args.action)(args)
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 234, in prep
    getattr(self,args.goal)()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/manager.py", line 377, in database_wordcounts
    Bookworm.load_word_list()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 204, in load_word_list
    db.query("""DROP TABLE IF EXISTS words""")
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 103, in query
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py", line 75, in connect
    cursor.execute("CREATE DATABASE IF NOT EXISTS %s" % self.dbname)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.InternalError: (13, "Can't get stat of './mccallum' (Errcode: 13)")
make: *** [.bookworm/targets/database_wordcounts] Error 1

@tpmccallum
Copy link
Author

if args.process=="text_stream":
            if args.file is None:
                for file in ["input.txt",".bookworm/texts/input.txt","../input.txt",".bookworm/texts/raw","input.sh"]:
                    if os.path.exists(file):
                        args.file = file
                        break
                if args.file is None:
                    # One of those should have worked.
                    raise IOError("Unable to find an input.txt or input.sh file in a default location")

The manager python file < https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/manager.py > seemed to be looking in a few places for the input .txt file. I moved mine to .bookworm/texts/input.txt so my final folder and file arrangement was like this

BookwormDB/
    .bookworm/
        metadata/
            | jsoncatalog.txt
            | field_descriptions.json
        texts/
            | input.txt

@tpmccallum
Copy link
Author

I struck an issue regarding the tmp table being full. I saw that this has been addressed here already < #83 > I followed the advice (from Ben) regarding increasing values in the mysql conf and everything seemed to continue well.

@tpmccallum
Copy link
Author

The build completed successfully with the following output

make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
cat .bookworm/metadata/jsoncatalog.txt | parallel --pipe bookworm -l WARNING -d mccallum prep catalog_metadata > .bookworm/metadata/jsoncatalog_derived.txt
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/texts/textids.dbm
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
bookworm -l WARNING -d mccallum prep preDatabaseMetadata
bookworm -l WARNING -d mccallum prep text_id_database
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
make -f /usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/etc/bookworm_Makefile .bookworm/metadata/catalog.txt
make[1]: Entering directory `/mnt/data/single_file_bookworm/BookwormDB'
make[1]: `.bookworm/metadata/catalog.txt' is up to date.
make[1]: Leaving directory `/mnt/data/single_file_bookworm/BookwormDB'
bookworm -l WARNING -d mccallum tokenize text_stream | parallel --block-size 100M -u --pipe bookworm -l WARNING -d mccallum tokenize encode
touch .bookworm/targets/encoded
bookworm -l WARNING -d mccallum prep database_wordcounts
touch .bookworm/targets/database_wordcounts
bookworm -l WARNING -d mccallum prep database_metadata
/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py:100: Warning: Data truncated for column 'searchstring' at row 7576939
  cursor.execute(sql)
touch .bookworm/targets/database_metadata
touch .bookworm/targets/database

The database is somewhat populated

mysql> show tables;
+---------------------+
| Tables_in_mccallum  |
+---------------------+
| API_settings        |
| catalog             |
| fastcat             |
| masterTableTable    |
| masterVariableTable |
| master_bigrams      |
| master_bookcounts   |
| nwords              |
| words               |
| wordsheap           |
+---------------------+
10 rows in set (0.00 sec)

For example catalog is full of entries, but I notice that the words table is empty any suggestions?

mysql> select * from words;
+--------+------+-------+----------+------+
| wordid | word | count | casesens | stem |
+--------+------+-------+----------+------+
|      1 |      |     0 |          | NULL |
+--------+------+-------+----------+------+

@tpmccallum
Copy link
Author

Here are a few more counts from the database

mysql> select count(*) from nwords;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

mysql> select * from nwords;
+--------+--------+
| bookid | nwords |
+--------+--------+
|      1 |      6 |
|      2 |      8 |
|      4 |      5 |
|      6 |      5 |
|      7 |      1 |
|      9 |      1 |
+--------+--------+
6 rows in set (0.00 sec)

mysql> select count(*) from fastcat;
+----------+
| count(*) |
+----------+
|  8605524 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from nwords;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from catalog;
+----------+
| count(*) |
+----------+
|  8605524 |
+----------+
1 row in set (0.00 sec)

Any help would be appreciated.

@tpmccallum
Copy link
Author

I re-ran the build with a smaller set of data and found that the output was the same except for the larger dataset included the line

/usr/local/lib/python2.7/dist-packages/bookwormDB-0.4.0-py2.7.egg/bookwormDB/CreateDatabase.py:100: Warning: Data truncated for column 'searchstring' at row 7576939
  cursor.execute(sql)

Interestingly it prints out the line

cursor.execute(sql)

which is in the CreateDatabase.py file at line 100

def query(self, sql):
        """
        Billy defined a separate query method here so that the common case of a connection being
        timed out doesn't cause the whole shebang to fall apart: instead, it just reboots
        the connection and starts up nicely again.
        """
        logging.debug(" -- Preparing to execute SQL code -- " + sql)
        try:
            cursor = self.conn.cursor()
            **cursor.execute(sql)**

@tpmccallum
Copy link
Author

I think this may have been an issue with running the build over a network. I ran everything again using nohup and it worked perfectly.

@bmschmidt
Copy link
Member

Sorry to not get back to you while this was going on, but glad it worked out.

It may be worth putting some of these SELECT COUNT * FROM [...] commands into the code somewhere, because they do help trace what's failing.

A disconnect in the middle of a command could definitely cause problems. It seems like there must have been traces of a partial build keeping words from getting loaded in. It's very helpful to have this documentation in there. Two notes to add to the record:

  1. As you find, using nohup, tmux or screen to run the processes on a remote server is good. The last two preserve error reporting until we add an option to write them to a file, which can be useful.
  2. Sometimes partial builds will lead to an incomplete wordcount file. Executing bookworm build pristine can be very useful in these cases; it just nukes all database and local files so you can start over.

@tpmccallum
Copy link
Author

Thanks Ben,
Good advice, I am also really glad this all worked.
I saw the pristine function but have not used it yet, this sounds very useful.
Is there some way to get in contact with you (private email) as I am working on a privately funded project and a doctoral level qualification (both using Bookworm) I have a very exciting set of data (16 million entries) to show you.
Tim

@tpmccallum
Copy link
Author

You can email me at [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants