2. Crawling and Preparing Training Set

Crawling the training set from relevant and irr list of urls

Below gives the raw pages as a list:

from main import crawler
pages = [] ; crawler.crawl("data/relevant.txt",pages)

And this is how to parse the html pages once you have raw pages as html in list pages:

training = [] ; test = []
label = True
crawler.parse_pages(pages,training,test,label)

This is the command line code for testing:

cd classification/
python main/crawler.py --class_irr data/urls/irrelevant.txt --class_rel data/urls/relevant.txt --output_dir data/test

If you want to add new documents:

cd classification/
python main/crawler.py --class_irr data/urls/development_irr.txt --class_rel data/urls/development_rel.txt --output_dir data/test_24_may

Clean the documents:

python main/clean.py --check_dir data/test_24_may/raw
python main/clean.py --raw_dir data/test_24_may/raw --parsed_dir data/test_24_may/parsed

Try Weka Documentation!

Extracting the training set from brat folder

First clean the files in brat folder:

cd ~/Downloads/Duru05/full_main/
for a in [1-5]*.ann; do   echo $a;  mv $a `printf d%04d.%s ${a%.*} ${a##*.}`; done
for a in [1-5]*.txt; do   echo $a;  mv $a `printf d%04d.%s ${a%.*} ${a##*.}`; done
for a in d[0-5]*.txt.ann; do   echo $a; mv $a `printf %s.%s ${a%.*.*} ${a##*.}` ; done
for a in d[0-5]*.txt.txt; do   echo $a;  mv $a `printf %s.%s ${a%.*.*} ${a##*.}` ; done

Filter the .txt files and copy to data folder:

mkdir -p ~/work/portuguese-nlp/classification/data/v6/class_rel
cp -r ~/work/portuguese-nlp/classification/data/v1/parsed/v4/class_irr ~/work/portuguese-nlp/classification/data/v6/
find ~/Downloads/Duru06/full_main/ -name "*.txt" -exec cp {} ~/work/portuguese-nlp/classification/data/v6/class_rel/ \;
ln -s ~/work/portuguese-nlp/classification/data/v6 ~/work/portuguese-nlp/classification/data/latest
scp -r ~/work/portuguese-nlp/classification/data/v6 shark:portuguese-nlp/classification/data/

Move to the server:

#[on TerraNova]    
cd ~/brazil/portuguese-nlp
scp -r shark:portuguese-nlp/classification/data/v6 classification/data/
ln -s classification/data/v6 classification/data/latest

Training set is already cleaned so no need to clean it again.

mv ~/Downloads/extraction_fields_Duru\ -\ training\ merged.tsv classification/data/extraction_fields.tsv

Last step is downloading the annotations. Do not forget to push theminto the repository.

Previous is 1. Preprocessing on dataset

Next is 3. Classification using Graphlab

Back to Main Page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training_set_preperation.md

training_set_preperation.md

2. Crawling and Preparing Training Set

Crawling the training set from relevant and irr list of urls

Extracting the training set from brat folder

Files

training_set_preperation.md

Latest commit

History

training_set_preperation.md

File metadata and controls

2. Crawling and Preparing Training Set

Crawling the training set from relevant and irr list of urls

Extracting the training set from brat folder