task #3

KondratiukYuliia · 2018-12-12T18:35:37Z

No description provided.

vittorius · 2018-12-13T08:51:08Z

main.py

+    """Writer data to json file. Read txt file"""
+
+    @staticmethod
+    def write_to_json(data_text):


A good example of the Single Responsibility Principle. Good job!

vittorius · 2018-12-13T08:53:30Z

main.py

+
+    @staticmethod
+    def clear_data(table, col_name):
+        table[col_name] = table['original_text'].replace('(\?*[$].+?[ ])', '', regex=True)


These checks could be distributed in several methods - think SRP.

vittorius · 2018-12-13T08:53:37Z

main.py

+class Cleaner:
+
+    @staticmethod
+    def clear_data(table, col_name):


Good method name.

vittorius · 2018-12-13T08:56:34Z

main.py

+        nltk_words = stopwords.words('english')
+        stop_words = get_stop_words('en')
+        stop_words.extend(nltk_words)
+        table[col_name] = table[col_name].apply(lambda without_stopwords: ' '.join(


This lambda is too crowded with code - could be defined above this line and then passed to the apply call. Try to have a single-expression labmdas when you pass them to methods as a param.

vittorius · 2018-12-13T08:59:02Z

main.py

+
+    @staticmethod
+    def token(table, col_name):
+        data['tokens'] = table[col_name].str.split()


Hmm, what is this data dictionary? Where does it come from?

Why do you need this var?

vittorius · 2018-12-19T14:01:37Z

main.py

+
+    @staticmethod
+    def write_to_json(data_text):
+        with open('task.json', 'w') as json_w:


Remember, you shouldn't hardcode such things a file name - this should be configurable. You may provide it as a default value though.

vittorius · 2018-12-19T14:10:00Z

main.py

+f.close()
+# data to DataFrame
+data = pd.DataFrame(text)
+data.columns = ['original_text']


The "original_text" column should have been named "body" (see the task description).

vittorius · 2018-12-19T14:11:01Z

main.py

-    t.my_method()
+    # open txt
+    filename = 'input.txt'
+f = open(filename)


Reading the file could be also extracted into a method or class (as it's done for the output).

vittorius · 2018-12-19T14:20:39Z

main.py

+
+    @staticmethod
+    def clear_data(table, col_name, pattern):
+        table[col_name].replace(pattern, ' ', regex=True)


Hm, why replacing with " " and not just with an empty string?

vittorius · 2018-12-19T14:21:25Z

main.py

+        nltk_words = stopwords.words('english')
+        stop_words = get_stop_words('en')
+        stop_words.extend(nltk_words)
+        ex_stopwords = lambda ex_stopwords:''.join([word for word in ex_stopwords.split() if word not in (stop_words)])


This lambda definition is still pretty hard to read.

vittorius · 2018-12-19T14:32:13Z

main.py

+        nltk_words = stopwords.words('english')
+        stop_words = get_stop_words('en')
+        stop_words.extend(nltk_words)
+        table[col_name].apply(lambda without_stopwords: ' '.join(


This approach is used in all of the processing methods: modify the table[col_name] contents in-place and then return it. This should be avoided. You should attempt as much as possible to make your method a pure function. That is, it's just a pipe: data goes in, gets transformed or augmented, and then goes out.

It could be:

def delete_stop_words(text): nltk_words = stopwords.words('english') stop_words = get_stop_words('en') stop_words.extend(nltk_words) table[col_name].apply(lambda without_stopwords: ' '.join( [word for word in without_stopwords.split() if word not in (stop_words)])) return table[col_name] # calling for token in tokens token_without_stop_words = delete_stop_words(token) # any other transformations # merging the resulting data into a single object before writing it to the output file data["metadata"] = ... data["orphan_tokens"] = ...

Use this comment for the better understanding of what I mean.

Again, don't modify your inputs in-place (like in table[col_name].apply(...).

vittorius · 2018-12-19T14:34:37Z

main.py

+class Tokenizer:
+
+    @staticmethod
+    def delete_stop_words(table, col_name):


An important thing. Actually, your method doesn't even have to know about any kind of "table" or "column" to exist. It takes a string of text and returns a new string of text, without the stopwords. So, let's get rid of these table and column parameters in all of our processing methods.

vittorius · 2018-12-19T14:37:41Z

main.py

+cleaner = Cleaner()
+dollar_symbol = re.compile('(\?*[$].+?[ ])')
+URL_symbol = re.compile('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+|[@]\w+')
+lattice_symbol = re.compile('[#]\w+')


The correct naming here would be not "lattice" (this is a specific type of a "grid") but "hash".

vittorius · 2018-12-19T14:39:44Z

main.py

+data['metadata'] = data['original_text'].str.findall(URL_symbol)
+
+# tokenize
+tokens = Tokenizer()


In general, your helper classes (Tokenizer, FinderOrphan and others) modify the data in your data frame data. Please avoid this. modify the data in a single place - top level of your script. Your classes or methods should know nothing about the data frame.

vittorius · 2018-12-19T14:40:41Z

main.py

+        return table[col_name]
+
+    @staticmethod
+    def token(table, col_name):


Remember, the method name should be a verb (tokenize) and not a noun.

vittorius · 2018-12-19T14:42:02Z

main.py

+class FinderOrphan:
+
+    @staticmethod
+    def find_orphan(table):


This name is much better, but it should have a plural suffix (find_orphans).

vittorius · 2018-12-19T14:46:36Z

main.py

+tokens.token(data, 'cleared_text')
+
+# find orphan_tokens
+orphan_tokens = FinderOrphan()


The names of these local variables should be defined almost vice versa:

orphan_finder = FinderOrphan() orphan_tokens = orphan_finder.find_orphan(data) # or just orphan_tokens = FinderOrphan().find_orphan(data)

KondratiukYuliia added 2 commits December 11, 2018 21:35

add some to print string

44e136e

no message

bfcbc8d

vittorius reviewed Dec 13, 2018

View reviewed changes

main.py Outdated

class Cleaner:

@staticmethod

def clear_data(table, col_name):

Copy link

Contributor

vittorius Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good method name.

vittorius reviewed Dec 13, 2018

View reviewed changes

new

a802071

vittorius reviewed Dec 19, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task #3

task #3

KondratiukYuliia commented Dec 12, 2018

vittorius Dec 13, 2018

vittorius Dec 13, 2018

vittorius Dec 13, 2018

vittorius Dec 13, 2018

vittorius Dec 13, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018 •

edited

Loading

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018 •

edited

Loading

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018

vittorius Dec 19, 2018

task #3

Are you sure you want to change the base?

task #3

Conversation

KondratiukYuliia commented Dec 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vittorius Dec 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vittorius Dec 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vittorius Dec 19, 2018 •

edited

Loading

vittorius Dec 19, 2018 •

edited

Loading