-
Notifications
You must be signed in to change notification settings - Fork 965
/
extract.sh
executable file
·29 lines (27 loc) · 799 Bytes
/
extract.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
#
# NOTES
#
# - Must expand templates to avoid a large loss of content.
# - Text will not (redundantly) contain the title string.
# - Keep sections. Section title will be marked by "Section::::".
# - Keep lists. List bullets will be marked by "BULLET::::".
# - Keep tables. They're mostly garbage but can be removed later (remove "^!*").
# - Remove disambiguation pages. Right now there is no use for them.
INPUT=$1
PROCESSES=$2
TEMPLATES=$3
OUTPUT=$4
python -m wikiextractor.WikiExtractor.py $INPUT \
--json \
--processes $PROCESSES \
--templates $TEMPLATES \
--output $OUTPUT \
--bytes 1M \
--compress \
--links \
--sections \
--lists \
--keep_tables \
--min_text_length 0 \
--filter_disambig_pages