-
Notifications
You must be signed in to change notification settings - Fork 1
/
Guidelines for Using Networked Corpus
42 lines (27 loc) · 2.53 KB
/
Guidelines for Using Networked Corpus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#Guidelines for Using Networked Corpus
The instructions below assume you are working on a Windows machine. Guidelines for the Mac will be added soon.
##Preparing to Use Networked Corpus
The GitHub repo of Networked Corpus appears to make assumptions about your environment and file locations that make it difficult to run out of the box. As a result, a zipped version is provided here with some modifications to make it easier to run. Download and extract the zipped version. Then do the following:
Open `gen-networked-corpus.py` in a text editor and scroll to line 303. Change `datadir = "/path/to/data-folder"` as appropriate to provide the location of your data.
##Using Networked Corpus
Using Networked Corpus involves two steps:
1. Run Mallet
2. Run Networked Corpus
###Step 1
Networked Corpus requires that data be imported into Mallet with the `--token-regex` flag. Assuming that you are running Mallet on a Windows machine, and that your data is in a folder called `data-folder`, that your Mallet output is stored in a folder called `output-folder`, and that both are inside your `mallet` path, the Mallet import command would look like this:
```
bin\mallet import-dir --input data-folder --output output-folder/corpus.mallet --keep-sequence --remove-stopwords --token-regex "[\p{L}\p{M}]+"
```
**Important: In Windows, the regex pattern must be enclosed in double quotes.**
Once you have imported your data, you can run Mallet's `train-topics` command as normal. However, see below for file naming conventions.
##Step 2
To run Networked Corpus, open a command prompt or terminal and `cd` to the folder containing your Mallet output. In order for Networked Corpus to read the Mallet data, you need to make sure that the data has the following file names:
1. doc_topics.txt (this the file sometimes called "composition")
2. topic_keys.txt
3. topic_state.gz
Change the file names if necessary. Type `cd ..` to go up a level and type `mkdir networkedcorpus`. This will make a folder for your Networked Corpus output. Go back into your output folder by typeing `cd output-folder`.
You are now ready to run Networked Corpus. The following command assumes that Networked Corpus is located inside a folder in a Windows user's `Documents` path:
```
python C:\Users\Scott\Documents\networkedcorpus\gen-networked-corpus.py --input-dir output-folder --output-dir networkedcorpus
```
This will populate your `networkedcorpus` folder with the necessary files. When the process is finished, navigate to that folder and look for `index.html`. Launch it in a browser to view the results.