-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KN Picard Metrics Aggregation Script #364
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krithikanathamuni Thank you for putting this together. I made a few general comments on code structure and formatting, as well as more specific suggestions on code refactoring. Let me know if you have any questions!
@@ -0,0 +1,276 @@ | |||
def read_in(file_name, label): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can be more descriptive with the function name here. You could also consider adding a docstring with a description of the function and potentially inputs/outputs, where you can describe the format of the file_name
file, and what does the label
represent.
Versions of python > 3.5 also support typing (https://docs.python.org/3/library/typing.html) - you can specify types of the inputs, and function output. This really helps with readability and especially if you are reading someone else's code.
How detailed you want to be depends on whether this script will be used or extended again by you or other people in the future.
intable = line.startswith(label) | ||
if intable == True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replace these two lines with
intable = line.startswith(label) | |
if intable == True: | |
if line.startswith(label): |
newlist = [] | ||
wgsfile = file_name | ||
|
||
with tfio.gfile.GFile(file_name, "r") as inp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for using TensorFlow's GFile here? Can this be replace with Python's built-in open
method?
def read_in(file_name, label): | ||
started = False | ||
newlist = [] | ||
wgsfile = file_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is not used.
newlist.append(line.rstrip('\n')) | ||
if started and line == '\n': | ||
return newlist | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This break statement is not reachable.
|
||
def sequencing_artifact_metrics(table): | ||
sadfone = pd.DataFrame(table.loc[1]).T.reset_index(drop = True) | ||
sadf1 = sadfone.add_suffix("_1").drop(['SAMPLE_ALIAS_1', 'LIBRARY_1', 'WORST_CXT_1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be refactored to get rid of the duplicate code, but also to store the strings SAMPLE_ALIAS
, SAMPLE_ALIAS_1
strings into variables. This way if the name of the Picard attribute changes you only need to change one variable - but also there is less chance of having a typo somewhere that will break the code.
return SAMresult | ||
|
||
def windows(table, window): | ||
if len(table.columns) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to use in the script, but in general having the output of the function be conditioned on the hardcoded variable like this is a sign that the code should be refactored.
One way to refactor it is to rewrite this code in an object-oriented manner. Again, I don't think it is necessary here but for code that is more than few hundred lines long it's a good idea, and it will provide more readability and robustness.
table_name = "sample" | ||
samples = pd.read_csv(io.StringIO(fiss.fapi.get_entities_tsv(project, workspace, 'sample').text), sep='\t') | ||
samples.rename(columns = {'entity:sample_id':'sample'}, inplace = True) | ||
# specificcolumns = samples[['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove old code here.
# dropemptycolumns = dropemptyrows.dropna(axis = 1) | ||
# files = ! ls | ||
|
||
Dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the variable name more descriptive and lower case. dict
is also name for a built-in type in Python and typing library has a class Dict
so it's best to avoid using either.
result = pd.concat([result, newrow], ignore_index = True) | ||
return result | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra spaces here.
No description provided.