Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

192 Improve get_code_dependency (Code Parser) #201

Merged
merged 85 commits into from
Dec 5, 2023

Conversation

m7pr
Copy link
Contributor

@m7pr m7pr commented Nov 23, 2023

Closes #192

Overview

The Code Parser feature accepts code in text form and outputs all necessary code to recreate any object from the original input.

If the code contains side effects that don't precisely specify the influenced object(s), users can add a comment with the tag # @linksto object_name1 object_name2 to explicitly identify the affected objects. This ensures that the side effects are included when generating the code needed to create the objects. This is also why the input needs to be a character, not expression or language because comments are preserved in character and are removed in expression and language.

Historical Background

  • The CodeDepends package was evaluated to check if it provided the features required to accomplish our goals (#238).
    • However, CodeDepends was found to have excessive dependencies and was deemed more than what we needed.
    • Consequently, we developed the current solution, named Code Parser`
      • It is based on the graph dependency concept in CodeDepends and utilizes utils::getParseData.
      • utils::getParseData works on a srcref data.frame structure,
      • srcref is created by calling attr(parse(text = code, keep.source = TRUE), 'srcref')
      • That's why code is limited to character or expression with srcref attribute.
  • The Code Parser was initially introduced in teal.code (PR #146), designed to take the qenv class as input.
  • Subsequently, we adopted a more general approach where the input is assumed to be character. This allows passing teal_data with code in teal_data@code slot, leading to the migration of this functionality to the teal.data (PR #194).

Implementation Plan

Code Parser consists of:

  • Code Graph - The code_dependency / code_graph function constructs the structure of dependencies between objects and their occurrences in specific calls derived from the code.
  • Graph Parser - The get_code_dependency / get_object_code function take a code graph as input and the names of objects existing in the code. It returns the code, including all necessary dependencies, to recreate the specified object and its influencers.

Pseudo code / algorithm

Code Graph
  • Take code as an input (a character or an expression with srcref attribute.
  • Put the code in utils::getParseData to extract information about the parsed code with built-in functions.
  • utils::getParseData creates a data.frame structure (pd) enumerating each call, and enumerating each object/symbol within calls. Each object/symbol has a token metadata specifying how it's treated by R (e.g., SYMBOL, ASSIGNMENT OPERATOR, FUNCTION_CALL, SYMBOL_FORMALS etc).
  • Thanks to pd, we are able to bind all elements of all calls of the input code into a list. The list has a length equal to the number of calls (calls_pd).
  • Then within calls, we would like to extract objects by their metadata (included in token) so that we seek for "SYMBOL", "SYMBOL_FUNCTION_CALL" and grep for ASSIGN operators to understand which object is influenced by other objects in this call.
  • We also check for COMMENT tokens that contain @linksto tag to understand whether some calls should be assigned as influencers of other objects.
  • With the above information, we need to build a structure that in some way presents:
    • which objects exist in which calls
    • which objects influence other objects and in which calls
    • which side-effects influence which objects in which calls

⚠️ the above structure and its creation is a part of this PR which simplifies the current implementation

Graph Parser

Having the Code Graph we take an object name (of multiple names) and we

  • (1) seek for a call in which this object was created (let's call it call X)
  • (2) we limit the input calls_pd of all calls until call X (let's call it calls_pd_x)
  • (3) we identify all influencers and side effects of object in the calls_pd_x
  • we repeat (1-3) for influencers from (3) and new calls_pd_x until all considered objects no longer have influencers

⚠️ the above process simplification is also a part of this PR since side_effects are detected by Code Parser and could be detected by Code Graph
⚠️ the above process will also be simplified in this PR as it is based on object names and calls indexes, but could be merged into a process that uses one of those two

Notes

  • The relationship between objects is assumed to be conveyed through <-, =, or -> assignment operators. No other object creation methods (such as assign, <<-, or any non-standard-evaluation method) are supported. This is addressed by using the # @linktso tag.
  • We do not assume any non-standard operations nor evaluations in data processing code; however, if someone needs to create objects the other way than with the assignment operators, the # @linktso tag is meant for it.
  • Any specific side effects that should be returned with a specific object should be tagged with # @linktso tag at the end of the line where the side effect is created.

@m7pr m7pr added the core label Nov 23, 2023
@m7pr
Copy link
Contributor Author

m7pr commented Nov 23, 2023

Pushed a small alternative 3224a48 but this is not finished yet

@m7pr m7pr changed the base branch from main to refactor November 24, 2023 11:12
@m7pr m7pr marked this pull request as ready for review November 27, 2023 14:47
@m7pr
Copy link
Contributor Author

m7pr commented Nov 27, 2023

Hey @insightsengineering/nest-core-dev I prepared curated version of Code Parser. I would appreciate your review!
For now there is few utils functions, like assert_classes, assert_code, assert_names or is_empty that are here just to make the code review and code readability easier from the high level perspective. Those should be incorporated in the main get_code_dependencies function. I also divided code_graph into 2 smaller functions extract_occurence and extract_side_effects so that we can track pieces of code responsible for single purpose. There is also a couple of smaller functions that makes the code review easier and allow to name pieces of the code. This was easier for moving things around on a prototyping phase and I though it's gonna be still helpful on the review side.

If anybody is willing to dive deeper into the code I think the biggest help that I need is simplifying the extract_occurence function and writing more edge-case tests.

Lastly I did not decide to export get_code_dependencies function as this is limited to our cases of simple data preparations and I dont think it will have bigger applications in broader situations for more sophisticated R codes.

@gogonzo gogonzo self-assigned this Nov 28, 2023
@m7pr m7pr requested a review from chlebowa December 1, 2023 15:26
@gogonzo gogonzo dismissed chlebowa’s stale review December 5, 2023 14:44

comments addressed

@gogonzo
Copy link
Contributor

gogonzo commented Dec 5, 2023

Good job @m7pr @chlebowa and Me :D

@gogonzo gogonzo merged commit f00cffe into refactor Dec 5, 2023
@gogonzo gogonzo deleted the 192_improve_code_parser@main branch December 5, 2023 14:45
@m7pr
Copy link
Contributor Author

m7pr commented Dec 11, 2023

Hey @chlebowa thanks for the final review and a huge documentation cleanup! You are da man! Thanks @gogonzo for all the feedback related to implementation. It looks like code parser is way simpler than what we had in the first attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential simplification of get_code_dependency
3 participants