Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Location to a namedtuple, and associated cleanup #205

Merged
merged 6 commits into from
Oct 17, 2024

Conversation

reventlov
Copy link
Collaborator

This change makes the Location dataclass, which does not change
frequently, into a new SourceLocation namedtuple, and changes the
SourceLocation serialization. As a result, with this change:

  • embossc runs about 25% faster on a large (7kLOC) input; python3 -OO embossc runs about 19% faster on the same input.
  • Serialized IR is about 45% smaller.

Details:

  • Replace the ir_data.Location dataclass with a new
    parser_types.SourceLocation namedtuple. The rename helps clarify
    the difference between a location within source code
    (SourceLocation) and a location within a structure
    (FieldLocation).
  • Similarly, replace ir_data.Position with
    parser_types.SourcePosition.
  • Update any place that edits a SourceLocation with an appropriate
    assignment; e.g., x.source_location.end = y becomes
    x.source_location = x.source_location._replace(end=y). In most
    cases, several fields were updated consecutively; those updates are
    been merged.
  • Update the JSON serialization to use the compact format.
  • Replace format_location() and format_position() with
    __str__() methods on SourceLocation and SourcePosition,
    respectively.
  • Replace parse_location() and parse_position() with from_str()
    class methods on SourceLocation and SourcePosition,
    respectively.
  • Move the make_location() functionality into
    SourceLocation.__new__().
  • Update _to_dict and _from_dict in IrDataSerializer to
    stringify and destringify SourceLocation. It is tempting to
    try to do this during the JSON serialization step (with a default=
    parameter to json.dumps and an object_hook= parameter to
    json.loads), but it is tricky to get the object_hook to know
    when to convert.
  • Centralize the logic for merging source_locations into
    merge_source_locations().

This change makes the `Location` dataclass, which does not change
frequently, into a new `SourceLocation` namedtuple, and changes the
`SourceLocation` serialization.  As a result, with this change:

*   `embossc` runs about 25% faster on a large (7kLOC) input; `python3
    -OO emboss` runs about 19% faster on the same input.
*   Serialized IR is about 45% smaller.

Details:

*   Replace the `ir_data.Location` dataclass with a new
    `parser_types.SourceLocation` namedtuple.  The rename helps clarify
    the difference between a location within source code
    (`SourceLocation`) and a location within a structure
    (`FieldLocation`).
*   Similarly, replace `ir_data.Position` with
    `parser_types.SourcePosition`.
*   Update any place that edits a `SourceLocation` with an appropriate
    assignment; e.g., `x.source_location.end = y` becomes
    `x.source_location = x.source_location._replace(end=y)`.  In most
    cases, several fields were updated consecutively; those updates are
    been merged.
*   Update the JSON serialization to use the compact format.
*   Replace `format_location()` and `format_position()` with
    `__str__()` methods on `SourceLocation` and `SourcePosition`,
    respectively.
*   Replace `parse_location()` and `parse_position()` with `from_str()`
    class methods on `SourceLocation` and `SourcePosition`,
    respectively.
*   Move the `make_location()` functionality into
    `SourceLocation.__new__()`.
*   Update `_to_dict` and `_from_dict` in `IrDataSerializer` to
    stringify and destringify `SourceLocation`.  It is tempting to
    try to do this during the JSON serialization step (with a `default=`
    parameter to `json.dumps` and an `object_hook=` parameter to
    `json.loads`), but it is tricky to get the `object_hook` to know
    when to convert.
@reventlov
Copy link
Collaborator Author

The +/- on this PR shows a lot of lines removed, but they're mostly in IR fragments -- the actual code changes are roughly a wash in LOC.

Copy link
Collaborator

@EricRahm EricRahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for doing this. A few questions around handing of default values

else:
result.source_location = ir_data_utils.copy(parse_tree.source_location)
result.source_location = parse_tree.source_location
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to tell if we need a copy here. I guess b/c it's a tuple the data is frozen so we don't? Overall I wonder if modifying ir_data_utils.update and ir_data_utils.copy to special case namedtuple would make sense. The overhead might not be worth it though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, more or less the point of this change is to make it so that it is never necessary to deep copy SourceLocation -- it's frozen, so it doesn't matter if multiple nodes point to the same SourceLocation instance. About half of the time in _copy() was just copying Location.

It would be (relatively) cheap to special case update and copy to check if their arguments are tuple (specifically checking for anything created by namedtuple() is harder), but I don't really see the point? Right now, they just raise an exception if you give them something that isn't an IR dataclass, so any mistakes will be caught quickly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's fine as-is, it just took me a second to grok why we weren't using the standard methods for this anymore.

compiler/front_end/module_ir_test.py Show resolved Hide resolved
result.append("{}.end missing".format(path))
else:
end = source_location.end

if start and end:
if start.HasField("line") and end.HasField("line"):
if start.line and end.line:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is checking that start.line != (0,0) and end.line != (0,0), is that actually what we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tightened up the code in here, mostly to take advantage of the tuple ordering. I think that the altered version addresses all of your comments.

This function is used to check the source_location invariants after module_ir.build_ir() is done -- basically, every node with a source_location field must have a source location, the source location must be inside the parent's source location, and both line and column must be >= 1.

I also added a number of tests to parser_types_test to cover the internal invariants in SourcePosition and SourceLocation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, so for a source location to be valid it must not be: (0,0), (0,1), (1,0) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in here, yes. (0, 0) is "valid" in the sense that you might have a node that didn't "come from" something in the source text, but module_ir shouldn't be creating any of those.

I just updated SourcePosition.__new__ to assert-fail on (0, x) and (x, 0), and SourceLocation.__new__ to assert-fail if start and not end or end and not start, and removed the checks here for only line or column being 0.

and end.HasField("column")
and start.column > end.column
):
if start.column and end.column and start.column > end.column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we shouldn't have the start.column and end.column portion here, they'll always be present, this is just checking that they're not zero now right?

result.append("{}.start missing".format(path))
else:
start = source_location.start
if not source_location.HasField("end"):
if not source_location.end:
result.append("{}.end missing".format(path))
else:
end = source_location.end

if start and end:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is checking what we want now? Maybe is not None is more corect.

@@ -4057,12 +3984,12 @@ def _check_source_location(source_location, path, min_start, max_end):
for name, field in (("start", start), ("end", end)):
if not field:
continue
if field.HasField("line"):
if field.line:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and again

if field.line <= 0:
result.append("{}.{}.line <= 0 ({})".format(path, name, field.line))
else:
result.append("{}.{}.line missing".format(path, name))
if field.HasField("column"):
if field.column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and again

compiler/util/parser_types.py Show resolved Hide resolved
compiler/util/parser_types.py Outdated Show resolved Hide resolved
compiler/util/parser_types.py Show resolved Hide resolved
Copy link
Collaborator

@EricRahm EricRahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as long as (0,0) is always invalid things make sense to me.

compiler/front_end/module_ir_test.py Show resolved Hide resolved
compiler/front_end/module_ir_test.py Outdated Show resolved Hide resolved
compiler/util/parser_types.py Outdated Show resolved Hide resolved
@reventlov
Copy link
Collaborator Author

Gotcha, so if (0,0) is always invalid I agree the bool operator is fine, I think I just didn't get that

It's invalid in the sense that it does not refer to an actual location (using 1-based numbering, not 0-based, because that's how lines and columns are customarily numbered). I added some explanation to the docstring.

Copy link
Collaborator

@EricRahm EricRahm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

compiler/util/parser_types.py Show resolved Hide resolved
compiler/front_end/module_ir_test.py Show resolved Hide resolved
compiler/util/parser_types.py Show resolved Hide resolved
@reventlov reventlov merged commit 73cbd98 into google:master Oct 17, 2024
4 checks passed
@reventlov reventlov deleted the source_location_tuple branch October 17, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants