Skip to content

Commit

Permalink
update readme | all tests pass | add sentence tokenizer | rearrange
Browse files Browse the repository at this point in the history
  • Loading branch information
khannatanmai committed May 10, 2021
1 parent 95038fd commit 584d4fb
Show file tree
Hide file tree
Showing 15 changed files with 52 additions and 287 deletions.
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,16 @@

## How to Use
- Install dependencies using `pip install -r requirements.txt`
- Download spacy model using `python -m spacy download en_core_web_sm`
- `python3 src/preprocess.py [rule_file.ppr] [input_file.txt]`
- Test using `./tests/test.sh`

## External tools used
- spacy POS tagger
- Download model using `python -m spacy download en_core_web_sm`
Note: This assumes your input is already sentence tokenised. If it's not, you can use the `spacy` sentence tokeniser first.

## Rule formalism (File extension .ppr)

## **Sample rule file: `tests/rulesets/eng-hin.ppr`**

### Source side rules
- `[...]` : POS Tags
- `[..@1]` : Variables named `0-9,a-z`,etc. to be used in the target side
Expand All @@ -32,3 +33,11 @@ For example, if you want a rule that matches "the" followed by an Adjective, whi
- Anything not in `[...]` is matched directly
- Rules are put in a list and applied on the input sentence one after the other.
- Only lines with `->` in the rule-set are counted as rules.

## Testing
- Run tests using `tests/test.sh`

## Miscellaneous Information
This project is part of my Master's thesis in Computational Linguistics titled: **Rule-based pre-processing of idioms and non-compositional constructions to simplify them and improve black-box machine translation**

You can open an issue on this repo to report any bugs or just to ask a doubt.
File renamed without changes.
13 changes: 0 additions & 13 deletions src/SvayamMT_AccessToken.txt

This file was deleted.

76 changes: 0 additions & 76 deletions src/nmt_api.py

This file was deleted.

19 changes: 0 additions & 19 deletions src/output_test.txt

This file was deleted.

12 changes: 6 additions & 6 deletions src/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def check(x, y): #Comparison with multiple options

patterns_and_replacements.append((detection_pattern, rule[1].strip().split(" ")))

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "attribute_ruler"])
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

input_lines = open(input_file_path).readlines()

Expand Down Expand Up @@ -281,11 +281,11 @@ def check(x, y): #Comparison with multiple options
text = "".join(output_parts)
construction_detected_in_line = True

if(construction_detected_in_line):
print("Construct Detected\t" + text)
else:
print("Not Detected\t" + text)
#if(construction_detected_in_line):
# print("Construct Detected\t" + text)
#else:
# print("Not Detected\t" + text)

#Output after applying all rules
#print(text)
print(text)

3 changes: 0 additions & 3 deletions src/preprocessing_testing.txt

This file was deleted.

17 changes: 1 addition & 16 deletions src/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,16 +1 @@
certifi==2020.12.5
chardet==3.0.4
googletrans==3.1.0a0
h11==0.9.0
h2==3.2.0
hpack==3.0.0
hstspreload==2020.12.22
httpcore==0.9.1
httpx==0.13.3
hyperframe==5.2.0
idna==2.10
requests==2.25.1
rfc3986==1.4.0
sniffio==1.2.0
urllib3==1.26.3
spacy==2.2.4
spacy==3.0.6
13 changes: 13 additions & 0 deletions src/sentence_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import spacy
import sys

nlp = spacy.load('en_core_web_sm') # Load the English Model

file_name = sys.argv[1]
f = open(file_name).readlines()

for line in f:
doc = nlp(line)
for sent in doc.sents:
print(str(sent).strip())

17 changes: 0 additions & 17 deletions src/swayam_api_python.py

This file was deleted.

17 changes: 0 additions & 17 deletions src/testing.py

This file was deleted.

70 changes: 0 additions & 70 deletions src/testing.txt

This file was deleted.

File renamed without changes.
27 changes: 0 additions & 27 deletions tests/rulesets/rule-set.ppr

This file was deleted.

Loading

0 comments on commit 584d4fb

Please sign in to comment.