-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task #3
base: master
Are you sure you want to change the base?
task #3
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,105 @@ | ||
#!/usr/bin/env python | ||
import json | ||
import pandas as pd | ||
from nltk import word_tokenize | ||
from nltk.corpus import stopwords | ||
from stop_words import get_stop_words | ||
from nltk.wsd import lesk | ||
import re | ||
|
||
# Your classes or methods go here: | ||
|
||
class Writer: | ||
|
||
class Tool: | ||
def my_method(self): | ||
print("my_method called!") | ||
"""Writer data to json file. Read txt file""" | ||
|
||
@staticmethod | ||
def write_to_json(data_text): | ||
with open('task.json', 'w') as json_w: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remember, you shouldn't hardcode such things a file name - this should be configurable. You may provide it as a default value though. |
||
json.dump(data_text, json_w, indent=4) | ||
|
||
|
||
class Cleaner: | ||
|
||
@staticmethod | ||
def clear_data(table, col_name, pattern): | ||
table[col_name].replace(pattern, ' ', regex=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, why replacing with |
||
return table[col_name] | ||
|
||
|
||
class Tokenizer: | ||
|
||
@staticmethod | ||
def delete_stop_words(table, col_name): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An important thing. Actually, your method doesn't even have to know about any kind of "table" or "column" to exist. It takes a string of text and returns a new string of text, without the stopwords. So, let's get rid of these |
||
nltk_words = stopwords.words('english') | ||
stop_words = get_stop_words('en') | ||
stop_words.extend(nltk_words) | ||
table[col_name].apply(lambda without_stopwords: ' '.join( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This approach is used in all of the processing methods: modify the It could be: def delete_stop_words(text):
nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
table[col_name].apply(lambda without_stopwords: ' '.join(
[word for word in without_stopwords.split() if word not in (stop_words)]))
return table[col_name]
# calling
for token in tokens
token_without_stop_words = delete_stop_words(token)
# any other transformations
# merging the resulting data into a single object before writing it to the output file
data["metadata"] = ...
data["orphan_tokens"] = ... Use this comment for the better understanding of what I mean. Again, don't modify your inputs in-place (like in |
||
[word for word in without_stopwords.split() if word not in (stop_words)])) | ||
return table[col_name] | ||
|
||
@staticmethod | ||
def delete_stop_words(table, col_name): | ||
nltk_words = stopwords.words('english') | ||
stop_words = get_stop_words('en') | ||
stop_words.extend(nltk_words) | ||
ex_stopwords = lambda ex_stopwords:''.join([word for word in ex_stopwords.split() if word not in (stop_words)]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This lambda definition is still pretty hard to read. |
||
table[col_name].apply(ex_stopwords) | ||
return table[col_name] | ||
|
||
@staticmethod | ||
def token(table, col_name): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remember, the method name should be a verb ( |
||
table['tokens'] = table[col_name].str.split() | ||
return table['tokens'] | ||
|
||
|
||
class FinderOrphan: | ||
|
||
@staticmethod | ||
def find_orphan(table): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This name is much better, but it should have a plural suffix ( |
||
orphan_tokens = [] | ||
for row in range(len(table.cleared_text)): | ||
text = str(table.cleared_text[row]) | ||
for word in word_tokenize(text): | ||
if lesk(text, word) is None: | ||
orphan_tokens.append(word) | ||
return orphan_tokens | ||
|
||
|
||
if __name__ == "__main__": | ||
t = Tool() | ||
t.my_method() | ||
# open txt | ||
filename = 'input.txt' | ||
f = open(filename) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reading the file could be also extracted into a method or class (as it's done for the output). |
||
text = f.readlines() | ||
f.close() | ||
# data to DataFrame | ||
data = pd.DataFrame(text) | ||
data.columns = ['original_text'] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "original_text" column should have been named "body" (see the task description). |
||
data.original_text = data.original_text.str.lower() | ||
|
||
cleaner = Cleaner() | ||
dollar_symbol = re.compile('(\?*[$].+?[ ])') | ||
URL_symbol = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|[@]\w+') | ||
lattice_symbol = re.compile('[#]\w+') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The correct naming here would be not "lattice" (this is a specific type of a "grid") but "hash". |
||
other_symbol = re.compile('[^\w\s]') | ||
|
||
data['cleared_text'] = cleaner.clear_data(data, 'original_text', dollar_symbol) | ||
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', URL_symbol) | ||
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', lattice_symbol) | ||
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', other_symbol) | ||
|
||
# find tags and metadata | ||
data['tags'] = data['original_text'].str.findall(lattice_symbol) | ||
data['metadata'] = data['original_text'].str.findall(URL_symbol) | ||
|
||
# tokenize | ||
tokens = Tokenizer() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, your helper classes ( |
||
tokens.delete_stop_words(data, 'cleared_text') | ||
tokens.token(data, 'cleared_text') | ||
|
||
# find orphan_tokens | ||
orphan_tokens = FinderOrphan() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The names of these local variables should be defined almost vice versa: orphan_finder = FinderOrphan()
orphan_tokens = orphan_finder.find_orphan(data)
# or just
orphan_tokens = FinderOrphan().find_orphan(data) |
||
orphan = orphan_tokens.find_orphan(data) | ||
|
||
# data to json | ||
for_json = data.to_dict() | ||
write = Writer() | ||
write.write_to_json(for_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good example of the Single Responsibility Principle. Good job!