Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Move NA values to end of table #124

Merged
merged 7 commits into from
Oct 18, 2024
Merged

fix: Move NA values to end of table #124

merged 7 commits into from
Oct 18, 2024

Conversation

fxwiegand
Copy link
Contributor

@fxwiegand fxwiegand commented Oct 16, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced data processing and visualization capabilities for comparing datasets.
    • Added interactive exploration features in the visual output.
    • Improved handling of null values in sorting operations.
    • Standardized logging format across multiple rules, replacing notebook logs with log files.
  • Bug Fixes

    • Ensured clarity in output by renaming columns to reflect effects associated with specified labels.
    • Enhanced filtering and joining logic for datasets.

Copy link

coderabbitai bot commented Oct 16, 2024

Walkthrough

The changes to the compare_diffexp.py script involve enhancements to data processing and visualization logic. The prepare function now selects specific columns, focusing on the first "b_" column. Datasets are joined on target_id and ext_gene, and minimum values for qval and pval are computed. The results are filtered based on a qval_min threshold, and columns are renamed accordingly. Additionally, null values in pi_value are handled during sorting, and an Altair chart is constructed for interactive data visualization.

Changes

File Path Change Summary
workflow/scripts/compare_diffexp.py Enhanced data processing by updating the prepare function to select specific columns and compute minimum values for qval and pval. Implemented null value handling in sorting and constructed an Altair chart for visualization.
workflow/rules/meta_comparisons.smk Updated logging mechanism in rules meta_compare_diffexp, meta_compare_enrichment, and meta_compare_pathways to replace Jupyter notebook logs with standard log files.

Poem

In the meadow where data flows,
A script now dances, as knowledge grows.
With colors bright and points that gleam,
Nulls at the end, a coder's dream.
Join the datasets, let insights play,
In the world of numbers, hop away! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5f8dc17 and 58d7cd5.

📒 Files selected for processing (1)
  • workflow/scripts/compare_diffexp.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/scripts/compare_diffexp.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@fxwiegand fxwiegand added the WIP label Oct 16, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
workflow/scripts/compare_diffexp.py (1)

49-62: Improvements in data handling and sorting

The changes in this segment effectively address the PR objective of moving NA values to the end of the table. Here are the key points:

  1. The added print statement at line 49 may be useful for debugging. Consider removing it or wrapping it in a debug condition for production use.
  2. Handling of NaN values for 'pi_value' and 'difference' columns (lines 55-56) improves data quality and prevents potential issues in downstream operations.
  3. The sorting modification with nulls_last=True (line 61) ensures that NA values are moved to the end of the table, aligning with the PR objective.

These changes enhance the robustness of the data processing pipeline.

Consider wrapping the print statement in a debug condition or removing it for production use.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f594b15 and dca9161.

📒 Files selected for processing (1)
  • workflow/scripts/compare_diffexp.py (2 hunks)
🧰 Additional context used
🪛 Ruff
workflow/scripts/compare_diffexp.py

76-76: SyntaxError: Expected a statement


76-77: SyntaxError: Expected a statement

🔇 Additional comments (1)
workflow/scripts/compare_diffexp.py (1)

Line range hint 1-150: Overall improvements and addition of data visualization

The changes in this file effectively address the PR objective of moving NA values to the end of the table. The modifications to the data processing pipeline, particularly the handling of NaN values and the sorting with nulls_last=True, achieve this goal.

Additionally, the implementation of an Altair chart significantly enhances the data visualization capabilities of the script. This is a valuable addition, although it wasn't explicitly mentioned in the PR objectives.

The changes meet the stated PR objective and provide additional functionality through data visualization.

To ensure that the Altair chart implementation doesn't introduce any new dependencies or conflicts, please run the following script:

This script will help verify that Altair is properly imported and listed as a dependency in the project.

✅ Verification successful

To ensure that Altair is properly imported and listed as a dependency, please run the following updated script:

This script corrects the previous command and expands the search to additional dependency files to verify that Altair is properly included in the project dependencies.


To further verify the Altair import and dependency management, please run the following updated script:

This script adjusts the rg command to search for Python files based on their .py extension and performs a broader search for any references to Altair across all files in the repository. Additionally, it lists all Python files to help understand the project's structure and dependency management approach.


Dependency Verification Successful

The verification confirms that Altair is correctly imported in multiple scripts and is listed in the project's environment YAML files (QC.yaml, pystats.yaml). This ensures that Altair dependencies are properly managed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify Altair-related imports and check for any new dependencies

# Test: Check for Altair import in other files
rg --type python "import altair as alt" --files-without-match workflow/scripts/compare_diffexp.py

# Test: Check for any new dependencies in the project
if [ -f "requirements.txt" ]; then
    grep "altair" requirements.txt
elif [ -f "environment.yml" ]; then
    grep "altair" environment.yml
else
    echo "No requirements.txt or environment.yml found. Please ensure Altair is properly listed as a dependency."
fi

Length of output: 404


Script:

#!/bin/bash
# Description: Verify Altair-related imports and check for any new dependencies

# Corrected Test: Check for Altair import in other Python files
rg -t python "import altair as alt" --files-without-match workflow/scripts/compare_diffexp.py

# Test: Check for Altair in common dependency files
DEPENDENCY_FILES=("requirements.txt" "environment.yml" "setup.py" "pyproject.toml" "Pipfile")

found_dependency=0

for file in "${DEPENDENCY_FILES[@]}"; do
    if [ -f "$file" ]; then
        grep "altair" "$file"
        if [ $? -eq 0 ]; then
            found_dependency=1
            break
        fi
    fi
done

if [ $found_dependency -eq 0 ]; then
    echo "Altair is not listed in any recognized dependency files. Please ensure it is added appropriately."
fi

Length of output: 791


Script:

#!/bin/bash
# Description: Verify Altair-related imports and check for any new dependencies

# Corrected Test: Search for Altair import in all Python files by extension
rg "import altair as alt" --glob "*.py" --files-with-matches

# Comprehensive Test: Search for Altair in all files to identify dependency declarations
grep -r "altair" .

# Optionally, list all Python files to understand the repository structure
fd -e py

Length of output: 1478

Comment on lines 76 to 77
)
print(combined)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Syntax error and additional print statement

There are two issues in this segment:

  1. There's an extra closing parenthesis at line 76, which is causing a syntax error. This needs to be removed.
  2. Another print statement is added at line 77, which might be useful for debugging but could clutter output in production.

To fix the syntax error and address the print statement, apply the following changes:

-)
-print(combined)
+print(combined)  # Consider removing or wrapping in a debug condition

This will resolve the syntax error and maintain the print statement. However, consider removing the print statement or wrapping it in a debug condition for production use.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
)
print(combined)
print(combined) # Consider removing or wrapping in a debug condition
🧰 Tools
🪛 Ruff

76-76: SyntaxError: Expected a statement


76-77: SyntaxError: Expected a statement

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
workflow/rules/meta_comparisons.smk (2)

70-70: LGTM! Completes consistent logging update.

This change finalizes the transition from .ipynb to .log format for all three rules in this file. The consistency in logging format across meta_compare_diffexp, meta_compare_enrichment, and meta_compare_pathways rules is commendable.

Consider updating any documentation or README files to reflect this change in logging format. This will help maintain clear and up-to-date information for other developers or users of the workflow.


Line range hint 14-70: Overall changes improve logging, but direct connection to PR objective is unclear.

The consistent change from .ipynb to .log format across all three rules (meta_compare_diffexp, meta_compare_enrichment, and meta_compare_pathways) improves the logging mechanism and potentially simplifies log parsing. While this could indirectly support the PR objective of moving NA values to the end of tables by making log analysis easier, the direct connection is not immediately apparent from these changes alone.

Consider adding a comment in the file or updating the PR description to clarify how these logging changes contribute to the overall objective of moving NA values to the end of the table. This would help reviewers and future maintainers understand the full context and impact of these modifications.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dca9161 and 5f8dc17.

📒 Files selected for processing (2)
  • workflow/rules/meta_comparisons.smk (3 hunks)
  • workflow/scripts/compare_diffexp.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/scripts/compare_diffexp.py
🧰 Additional context used
🔇 Additional comments (2)
workflow/rules/meta_comparisons.smk (2)

45-45: LGTM! Consistent change across rules.

The modification from .ipynb to .log format for logging is consistent with the change in the meta_compare_diffexp rule. This uniformity in logging format across rules is a good practice.


14-14: LGTM! Consider verifying downstream log processing.

The change from .ipynb to .log format for logging is appropriate and aligns with standard practices. This modification should make log parsing easier, potentially supporting the PR's objective of reorganizing table data.

To ensure this change doesn't break any downstream processes, please run the following script:

✅ Verification successful

Verified: No remaining references to .ipynb log format found.

The change from .ipynb to .log format has been thoroughly verified. No references to the old .ipynb log format exist in the codebase, ensuring that downstream processes remain unaffected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any references to the old .ipynb log format in the codebase

# Search for references to the old .ipynb log format
rg --type python --type yaml --type snakemake "logs/meta_compare_diffexp/.*\.ipynb"

# If the above command returns any results, those files may need to be updated to use the new .log format

Length of output: 116


Script:

#!/bin/bash
# Description: Check for any references to the old .ipynb log format in the codebase

# Search for references to the old .ipynb log format
rg --type py --type yaml --type snakemake 'logs/meta_compare_diffexp/.*\.ipynb'

# If the above command returns any results, those files may need to be updated to use the new .log format

Length of output: 115


Script:

#!/bin/bash
# Description: Check for any references to the old .ipynb log format in the codebase

# Define snakemake file type
rg --type-add 'snakemake:*.smk' --type snakemake 'logs/meta_compare_diffexp/.*\.ipynb'

Length of output: 88

Comment on lines +53 to +56
.with_columns(
pl.col("pi_value").fill_nan(None),
pl.col("difference").fill_nan(None),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems convoluted, but at least it's well documented:
https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

Copy link
Collaborator

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Not sure though, why the CodeRabbit summary is so verbose...

@dlaehnemann dlaehnemann merged commit df807a5 into main Oct 18, 2024
5 of 7 checks passed
@dlaehnemann dlaehnemann deleted the fxwiegand-patch-3 branch October 18, 2024 08:56
dlaehnemann pushed a commit that referenced this pull request Oct 18, 2024
🤖 I have created a release *beep* *boop*
---


##
[2.8.2](v2.8.1...v2.8.2)
(2024-10-18)


### Bug Fixes

* Convert pval and qval to heatmap
([#123](#123))
([478c759](478c759))
* Move NA values to end of table
([#124](#124))
([df807a5](df807a5))
* Update max in memory rows for diffexp report
([#122](#122))
([830e678](830e678))


### Performance Improvements

* Update to latest datavzrd wrapper
([#125](#125))
([3a4c020](3a4c020))
* Update to latest datavzrd wrapper version
([#120](#120))
([7b2767c](7b2767c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants