Skip to content

Latest commit

 

History

History
48 lines (42 loc) · 1.59 KB

parse.md

File metadata and controls

48 lines (42 loc) · 1.59 KB

Tokenization and Parsing

We used LX Parser, a Constituency Parser for Portuguese based on Stanford Parser and LX-Tokenizer to tokenize input prior to parsing.

Some bash scripting for LX-Tokenizer:

  mkdir data/tokenized
  for file in $(find data/input/ -type f -printf "%f\n");
    do
      cat data/input/$file | Tokenizer/Tokenizer/run-Tokenizer.sh > data/tokenized/$file ;
    done

Some java for parsing:

  mkdir data/parsed
  for file in $(find data/tokenized/ -type f -printf "%f\n");
      do
  	    java -Xmx1000m -cp stanford-parser-2010-11-30/stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -sentences newline -outputFormat oneline -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel cintil.ser.gz data/tokenized/$file > data/parsed/$file 2>>data/log_parse.txt ;
  	    echo "Completed $file"
      done

Example parser output is: (Tagset)

(ROOT
  (S
    (S
      (S
        (NP
          (ART A)
          (N' (N greve) (PP (P de_) (NP (ART os) (N' (N vigilantes) (PP (P de_) (NP (ART o) (N Rio)))))))
        )
        (VP
          (VP (V est?) (VP (V suspensa) (PP (P de) (ADV hoje))))
          (PP (P at?) (NP (N' (N segunda-feira) (A .*/))))
        )
      )
      (S
        (NP (ART A) (N decis?o))
        (VP (V foi) (VP (V tomada) (AP (ADV ontem) (A ,*/))))
      )
    )
    ...