We used LX Parser, a Constituency Parser for Portuguese based on Stanford Parser and LX-Tokenizer to tokenize input prior to parsing.
Some bash scripting for LX-Tokenizer:
mkdir data/tokenized
for file in $(find data/input/ -type f -printf "%f\n");
do
cat data/input/$file | Tokenizer/Tokenizer/run-Tokenizer.sh > data/tokenized/$file ;
done
Some java for parsing:
mkdir data/parsed
for file in $(find data/tokenized/ -type f -printf "%f\n");
do
java -Xmx1000m -cp stanford-parser-2010-11-30/stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -sentences newline -outputFormat oneline -uwModel edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel cintil.ser.gz data/tokenized/$file > data/parsed/$file 2>>data/log_parse.txt ;
echo "Completed $file"
done
Example parser output is: (Tagset)
(ROOT
(S
(S
(S
(NP
(ART A)
(N' (N greve) (PP (P de_) (NP (ART os) (N' (N vigilantes) (PP (P de_) (NP (ART o) (N Rio)))))))
)
(VP
(VP (V est?) (VP (V suspensa) (PP (P de) (ADV hoje))))
(PP (P at?) (NP (N' (N segunda-feira) (A .*/))))
)
)
(S
(NP (ART A) (N decis?o))
(VP (V foi) (VP (V tomada) (AP (ADV ontem) (A ,*/))))
)
)
...