Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

task #3

wants to merge 3 commits into from

Conversation

KondratiukYuliia
Copy link

No description provided.

"""Writer data to json file. Read txt file"""

@staticmethod
def write_to_json(data_text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example of the Single Responsibility Principle. Good job!

main.py Outdated

@staticmethod
def clear_data(table, col_name):
table[col_name] = table['original_text'].replace('(\?*[$].+?[ ])', '', regex=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks could be distributed in several methods - think SRP.

main.py Outdated
class Cleaner:

@staticmethod
def clear_data(table, col_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good method name.

main.py Outdated
nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
table[col_name] = table[col_name].apply(lambda without_stopwords: ' '.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is too crowded with code - could be defined above this line and then passed to the apply call. Try to have a single-expression labmdas when you pass them to methods as a param.

main.py Outdated

@staticmethod
def token(table, col_name):
data['tokens'] = table[col_name].str.split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Hmm, what is this data dictionary? Where does it come from?
  2. Why do you need this var?


@staticmethod
def write_to_json(data_text):
with open('task.json', 'w') as json_w:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, you shouldn't hardcode such things a file name - this should be configurable. You may provide it as a default value though.

f.close()
# data to DataFrame
data = pd.DataFrame(text)
data.columns = ['original_text']
Copy link
Contributor

@vittorius vittorius Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "original_text" column should have been named "body" (see the task description).

t.my_method()
# open txt
filename = 'input.txt'
f = open(filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the file could be also extracted into a method or class (as it's done for the output).


@staticmethod
def clear_data(table, col_name, pattern):
table[col_name].replace(pattern, ' ', regex=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why replacing with " " and not just with an empty string?

nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
ex_stopwords = lambda ex_stopwords:''.join([word for word in ex_stopwords.split() if word not in (stop_words)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda definition is still pretty hard to read.

nltk_words = stopwords.words('english')
stop_words = get_stop_words('en')
stop_words.extend(nltk_words)
table[col_name].apply(lambda without_stopwords: ' '.join(
Copy link
Contributor

@vittorius vittorius Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is used in all of the processing methods: modify the table[col_name] contents in-place and then return it. This should be avoided. You should attempt as much as possible to make your method a pure function. That is, it's just a pipe: data goes in, gets transformed or augmented, and then goes out.

It could be:

def delete_stop_words(text):
    nltk_words = stopwords.words('english')
    stop_words = get_stop_words('en')
    stop_words.extend(nltk_words)
    table[col_name].apply(lambda without_stopwords: ' '.join(
        [word for word in without_stopwords.split() if word not in (stop_words)]))
    return table[col_name]

# calling
for token in tokens
    token_without_stop_words = delete_stop_words(token)
    # any other transformations

# merging the resulting data into a single object before writing it to the output file

data["metadata"] = ...
data["orphan_tokens"] = ...

Use this comment for the better understanding of what I mean.

Again, don't modify your inputs in-place (like in table[col_name].apply(...).

class Tokenizer:

@staticmethod
def delete_stop_words(table, col_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important thing. Actually, your method doesn't even have to know about any kind of "table" or "column" to exist. It takes a string of text and returns a new string of text, without the stopwords. So, let's get rid of these table and column parameters in all of our processing methods.

cleaner = Cleaner()
dollar_symbol = re.compile('(\?*[$].+?[ ])')
URL_symbol = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|[@]\w+')
lattice_symbol = re.compile('[#]\w+')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct naming here would be not "lattice" (this is a specific type of a "grid") but "hash".

data['metadata'] = data['original_text'].str.findall(URL_symbol)

# tokenize
tokens = Tokenizer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, your helper classes (Tokenizer, FinderOrphan and others) modify the data in your data frame data. Please avoid this. modify the data in a single place - top level of your script. Your classes or methods should know nothing about the data frame.

return table[col_name]

@staticmethod
def token(table, col_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, the method name should be a verb (tokenize) and not a noun.

class FinderOrphan:

@staticmethod
def find_orphan(table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is much better, but it should have a plural suffix (find_orphans).

tokens.token(data, 'cleared_text')

# find orphan_tokens
orphan_tokens = FinderOrphan()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of these local variables should be defined almost vice versa:

orphan_finder = FinderOrphan()
orphan_tokens = orphan_finder.find_orphan(data)

# or just

orphan_tokens = FinderOrphan().find_orphan(data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants