Below gives the raw pages as a list:
from main import crawler
pages = [] ; crawler.crawl("data/relevant.txt",pages)
And this is how to parse the html pages once you have raw pages as html in list pages:
training = [] ; test = []
label = True
crawler.parse_pages(pages,training,test,label)
This is the command line code for testing:
cd classification/
python main/crawler.py --class_irr data/urls/irrelevant.txt --class_rel data/urls/relevant.txt --output_dir data/test
If you want to add new documents:
cd classification/
python main/crawler.py --class_irr data/urls/development_irr.txt --class_rel data/urls/development_rel.txt --output_dir data/test_24_may
Clean the documents:
python main/clean.py --check_dir data/test_24_may/raw
python main/clean.py --raw_dir data/test_24_may/raw --parsed_dir data/test_24_may/parsed
First clean the files in brat folder:
cd ~/Downloads/Duru05/full_main/
for a in [1-5]*.ann; do echo $a; mv $a `printf d%04d.%s ${a%.*} ${a##*.}`; done
for a in [1-5]*.txt; do echo $a; mv $a `printf d%04d.%s ${a%.*} ${a##*.}`; done
for a in d[0-5]*.txt.ann; do echo $a; mv $a `printf %s.%s ${a%.*.*} ${a##*.}` ; done
for a in d[0-5]*.txt.txt; do echo $a; mv $a `printf %s.%s ${a%.*.*} ${a##*.}` ; done
Filter the .txt files and copy to data folder:
mkdir -p ~/work/portuguese-nlp/classification/data/v6/class_rel
cp -r ~/work/portuguese-nlp/classification/data/v1/parsed/v4/class_irr ~/work/portuguese-nlp/classification/data/v6/
find ~/Downloads/Duru06/full_main/ -name "*.txt" -exec cp {} ~/work/portuguese-nlp/classification/data/v6/class_rel/ \;
ln -s ~/work/portuguese-nlp/classification/data/v6 ~/work/portuguese-nlp/classification/data/latest
scp -r ~/work/portuguese-nlp/classification/data/v6 shark:portuguese-nlp/classification/data/
Move to the server:
#[on TerraNova]
cd ~/brazil/portuguese-nlp
scp -r shark:portuguese-nlp/classification/data/v6 classification/data/
ln -s classification/data/v6 classification/data/latest
Training set is already cleaned so no need to clean it again.
mv ~/Downloads/extraction_fields_Duru\ -\ training\ merged.tsv classification/data/extraction_fields.tsv
Last step is downloading the annotations. Do not forget to push theminto the repository.
Previous is 1. Preprocessing on dataset
Next is 3. Classification using Graphlab
Back to Main Page