-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task #3
base: master
Are you sure you want to change the base?
task #3
Conversation
"""Writer data to json file. Read txt file""" | ||
|
||
@staticmethod | ||
def write_to_json(data_text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good example of the Single Responsibility Principle. Good job!
main.py
Outdated
|
||
@staticmethod | ||
def clear_data(table, col_name): | ||
table[col_name] = table['original_text'].replace('(\?*[$].+?[ ])', '', regex=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks could be distributed in several methods - think SRP.
main.py
Outdated
class Cleaner: | ||
|
||
@staticmethod | ||
def clear_data(table, col_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good method name.
main.py
Outdated
nltk_words = stopwords.words('english') | ||
stop_words = get_stop_words('en') | ||
stop_words.extend(nltk_words) | ||
table[col_name] = table[col_name].apply(lambda without_stopwords: ' '.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lambda is too crowded with code - could be defined above this line and then passed to the apply
call. Try to have a single-expression labmdas when you pass them to methods as a param.
main.py
Outdated
|
||
@staticmethod | ||
def token(table, col_name): | ||
data['tokens'] = table[col_name].str.split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Hmm, what is this
data
dictionary? Where does it come from? - Why do you need this var?
|
||
@staticmethod | ||
def write_to_json(data_text): | ||
with open('task.json', 'w') as json_w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember, you shouldn't hardcode such things a file name - this should be configurable. You may provide it as a default value though.
f.close() | ||
# data to DataFrame | ||
data = pd.DataFrame(text) | ||
data.columns = ['original_text'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "original_text" column should have been named "body" (see the task description).
t.my_method() | ||
# open txt | ||
filename = 'input.txt' | ||
f = open(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the file could be also extracted into a method or class (as it's done for the output).
|
||
@staticmethod | ||
def clear_data(table, col_name, pattern): | ||
table[col_name].replace(pattern, ' ', regex=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why replacing with " "
and not just with an empty string?
nltk_words = stopwords.words('english') | ||
stop_words = get_stop_words('en') | ||
stop_words.extend(nltk_words) | ||
ex_stopwords = lambda ex_stopwords:''.join([word for word in ex_stopwords.split() if word not in (stop_words)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lambda definition is still pretty hard to read.
nltk_words = stopwords.words('english') | ||
stop_words = get_stop_words('en') | ||
stop_words.extend(nltk_words) | ||
table[col_name].apply(lambda without_stopwords: ' '.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is used in all of the processing methods: modify the table[col_name]
contents in-place and then return it. This should be avoided. You should attempt as much as possible to make your method a pure function. That is, it's just a pipe: data goes in, gets transformed or augmented, and then goes out.
It could be:
def delete_stop_words(text):
nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
table[col_name].apply(lambda without_stopwords: ' '.join(
[word for word in without_stopwords.split() if word not in (stop_words)]))
return table[col_name]
# calling
for token in tokens
token_without_stop_words = delete_stop_words(token)
# any other transformations
# merging the resulting data into a single object before writing it to the output file
data["metadata"] = ...
data["orphan_tokens"] = ...
Use this comment for the better understanding of what I mean.
Again, don't modify your inputs in-place (like in table[col_name].apply(...)
.
class Tokenizer: | ||
|
||
@staticmethod | ||
def delete_stop_words(table, col_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important thing. Actually, your method doesn't even have to know about any kind of "table" or "column" to exist. It takes a string of text and returns a new string of text, without the stopwords. So, let's get rid of these table
and column
parameters in all of our processing methods.
cleaner = Cleaner() | ||
dollar_symbol = re.compile('(\?*[$].+?[ ])') | ||
URL_symbol = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|[@]\w+') | ||
lattice_symbol = re.compile('[#]\w+') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct naming here would be not "lattice" (this is a specific type of a "grid") but "hash".
data['metadata'] = data['original_text'].str.findall(URL_symbol) | ||
|
||
# tokenize | ||
tokens = Tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, your helper classes (Tokenizer
, FinderOrphan
and others) modify the data in your data frame data
. Please avoid this. modify the data in a single place - top level of your script. Your classes or methods should know nothing about the data frame.
return table[col_name] | ||
|
||
@staticmethod | ||
def token(table, col_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember, the method name should be a verb (tokenize
) and not a noun.
class FinderOrphan: | ||
|
||
@staticmethod | ||
def find_orphan(table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is much better, but it should have a plural suffix (find_orphans
).
tokens.token(data, 'cleared_text') | ||
|
||
# find orphan_tokens | ||
orphan_tokens = FinderOrphan() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names of these local variables should be defined almost vice versa:
orphan_finder = FinderOrphan()
orphan_tokens = orphan_finder.find_orphan(data)
# or just
orphan_tokens = FinderOrphan().find_orphan(data)
No description provided.