Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 99 additions & 7 deletions main.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,105 @@
#!/usr/bin/env python
import json
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from stop_words import get_stop_words
from nltk.wsd import lesk
import re

# Your classes or methods go here:

class Writer:

class Tool:
def my_method(self):
print("my_method called!")
"""Writer data to json file. Read txt file"""

@staticmethod
def write_to_json(data_text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example of the Single Responsibility Principle. Good job!

with open('task.json', 'w') as json_w:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, you shouldn't hardcode such things a file name - this should be configurable. You may provide it as a default value though.

json.dump(data_text, json_w, indent=4)


class Cleaner:

@staticmethod
def clear_data(table, col_name, pattern):
table[col_name].replace(pattern, ' ', regex=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why replacing with " " and not just with an empty string?

return table[col_name]


class Tokenizer:

@staticmethod
def delete_stop_words(table, col_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important thing. Actually, your method doesn't even have to know about any kind of "table" or "column" to exist. It takes a string of text and returns a new string of text, without the stopwords. So, let's get rid of these table and column parameters in all of our processing methods.

nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
table[col_name].apply(lambda without_stopwords: ' '.join(
Copy link
Contributor

@vittorius vittorius Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is used in all of the processing methods: modify the table[col_name] contents in-place and then return it. This should be avoided. You should attempt as much as possible to make your method a pure function. That is, it's just a pipe: data goes in, gets transformed or augmented, and then goes out.

It could be:

def delete_stop_words(text):
    nltk_words = stopwords.words('english')
    stop_words = get_stop_words('en')
    stop_words.extend(nltk_words)
    table[col_name].apply(lambda without_stopwords: ' '.join(
        [word for word in without_stopwords.split() if word not in (stop_words)]))
    return table[col_name]

# calling
for token in tokens
    token_without_stop_words = delete_stop_words(token)
    # any other transformations

# merging the resulting data into a single object before writing it to the output file

data["metadata"] = ...
data["orphan_tokens"] = ...

Use this comment for the better understanding of what I mean.

Again, don't modify your inputs in-place (like in table[col_name].apply(...).

[word for word in without_stopwords.split() if word not in (stop_words)]))
return table[col_name]

@staticmethod
def delete_stop_words(table, col_name):
nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
ex_stopwords = lambda ex_stopwords:''.join([word for word in ex_stopwords.split() if word not in (stop_words)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda definition is still pretty hard to read.

table[col_name].apply(ex_stopwords)
return table[col_name]

@staticmethod
def token(table, col_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, the method name should be a verb (tokenize) and not a noun.

table['tokens'] = table[col_name].str.split()
return table['tokens']


class FinderOrphan:

@staticmethod
def find_orphan(table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is much better, but it should have a plural suffix (find_orphans).

orphan_tokens = []
for row in range(len(table.cleared_text)):
text = str(table.cleared_text[row])
for word in word_tokenize(text):
if lesk(text, word) is None:
orphan_tokens.append(word)
return orphan_tokens


if __name__ == "__main__":
t = Tool()
t.my_method()
# open txt
filename = 'input.txt'
f = open(filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the file could be also extracted into a method or class (as it's done for the output).

text = f.readlines()
f.close()
# data to DataFrame
data = pd.DataFrame(text)
data.columns = ['original_text']
Copy link
Contributor

@vittorius vittorius Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "original_text" column should have been named "body" (see the task description).

data.original_text = data.original_text.str.lower()

cleaner = Cleaner()
dollar_symbol = re.compile('(\?*[$].+?[ ])')
URL_symbol = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|[@]\w+')
lattice_symbol = re.compile('[#]\w+')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct naming here would be not "lattice" (this is a specific type of a "grid") but "hash".

other_symbol = re.compile('[^\w\s]')

data['cleared_text'] = cleaner.clear_data(data, 'original_text', dollar_symbol)
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', URL_symbol)
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', lattice_symbol)
data['cleared_text'] = cleaner.clear_data(data, 'cleared_text', other_symbol)

# find tags and metadata
data['tags'] = data['original_text'].str.findall(lattice_symbol)
data['metadata'] = data['original_text'].str.findall(URL_symbol)

# tokenize
tokens = Tokenizer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, your helper classes (Tokenizer, FinderOrphan and others) modify the data in your data frame data. Please avoid this. modify the data in a single place - top level of your script. Your classes or methods should know nothing about the data frame.

tokens.delete_stop_words(data, 'cleared_text')
tokens.token(data, 'cleared_text')

# find orphan_tokens
orphan_tokens = FinderOrphan()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of these local variables should be defined almost vice versa:

orphan_finder = FinderOrphan()
orphan_tokens = orphan_finder.find_orphan(data)

# or just

orphan_tokens = FinderOrphan().find_orphan(data)

orphan = orphan_tokens.find_orphan(data)

# data to json
for_json = data.to_dict()
write = Writer()
write.write_to_json(for_json)