Convert Location to a namedtuple, and associated cleanup #205

reventlov · 2024-10-14T15:17:09Z

This change makes the Location dataclass, which does not change
frequently, into a new SourceLocation namedtuple, and changes the
SourceLocation serialization. As a result, with this change:

embossc runs about 25% faster on a large (7kLOC) input; python3 -OO embossc runs about 19% faster on the same input.
Serialized IR is about 45% smaller.

Details:

Replace the ir_data.Location dataclass with a new
parser_types.SourceLocation namedtuple. The rename helps clarify
the difference between a location within source code
(SourceLocation) and a location within a structure
(FieldLocation).
Similarly, replace ir_data.Position with
parser_types.SourcePosition.
Update any place that edits a SourceLocation with an appropriate
assignment; e.g., x.source_location.end = y becomes
x.source_location = x.source_location._replace(end=y). In most
cases, several fields were updated consecutively; those updates are
been merged.
Update the JSON serialization to use the compact format.
Replace format_location() and format_position() with
__str__() methods on SourceLocation and SourcePosition,
respectively.
Replace parse_location() and parse_position() with from_str()
class methods on SourceLocation and SourcePosition,
respectively.
Move the make_location() functionality into
SourceLocation.__new__().
Update _to_dict and _from_dict in IrDataSerializer to
stringify and destringify SourceLocation. It is tempting to
try to do this during the JSON serialization step (with a default=
parameter to json.dumps and an object_hook= parameter to
json.loads), but it is tricky to get the object_hook to know
when to convert.
Centralize the logic for merging source_locations into
merge_source_locations().

This change makes the `Location` dataclass, which does not change frequently, into a new `SourceLocation` namedtuple, and changes the `SourceLocation` serialization. As a result, with this change: * `embossc` runs about 25% faster on a large (7kLOC) input; `python3 -OO emboss` runs about 19% faster on the same input. * Serialized IR is about 45% smaller. Details: * Replace the `ir_data.Location` dataclass with a new `parser_types.SourceLocation` namedtuple. The rename helps clarify the difference between a location within source code (`SourceLocation`) and a location within a structure (`FieldLocation`). * Similarly, replace `ir_data.Position` with `parser_types.SourcePosition`. * Update any place that edits a `SourceLocation` with an appropriate assignment; e.g., `x.source_location.end = y` becomes `x.source_location = x.source_location._replace(end=y)`. In most cases, several fields were updated consecutively; those updates are been merged. * Update the JSON serialization to use the compact format. * Replace `format_location()` and `format_position()` with `__str__()` methods on `SourceLocation` and `SourcePosition`, respectively. * Replace `parse_location()` and `parse_position()` with `from_str()` class methods on `SourceLocation` and `SourcePosition`, respectively. * Move the `make_location()` functionality into `SourceLocation.__new__()`. * Update `_to_dict` and `_from_dict` in `IrDataSerializer` to stringify and destringify `SourceLocation`. It is tempting to try to do this during the JSON serialization step (with a `default=` parameter to `json.dumps` and an `object_hook=` parameter to `json.loads`), but it is tricky to get the `object_hook` to know when to convert.

reventlov · 2024-10-14T15:18:40Z

The +/- on this PR shows a lot of lines removed, but they're mostly in IR fragments -- the actual code changes are roughly a wash in LOC.

EricRahm

This is great, thanks for doing this. A few questions around handing of default values

EricRahm · 2024-10-14T17:29:35Z

compiler/front_end/module_ir.py

 else:
- result.source_location = ir_data_utils.copy(parse_tree.source_location)
+ result.source_location = parse_tree.source_location


It's hard for me to tell if we need a copy here. I guess b/c it's a tuple the data is frozen so we don't? Overall I wonder if modifying ir_data_utils.update and ir_data_utils.copy to special case namedtuple would make sense. The overhead might not be worth it though.

Yeah, more or less the point of this change is to make it so that it is never necessary to deep copy SourceLocation -- it's frozen, so it doesn't matter if multiple nodes point to the same SourceLocation instance. About half of the time in _copy() was just copying Location.

It would be (relatively) cheap to special case update and copy to check if their arguments are tuple (specifically checking for anything created by namedtuple() is harder), but I don't really see the point? Right now, they just raise an exception if you give them something that isn't an IR dataclass, so any mistakes will be caught quickly.

Yeah I think it's fine as-is, it just took me a second to grok why we weren't using the standard methods for this anymore.

compiler/front_end/module_ir_test.py

EricRahm · 2024-10-14T18:03:11Z

compiler/front_end/module_ir_test.py

 result.append("{}.end missing".format(path))
 else:
 end = source_location.end

 if start and end:
- if start.HasField("line") and end.HasField("line"):
+ if start.line and end.line:


I guess this is checking that start.line != (0,0) and end.line != (0,0), is that actually what we want?

I tightened up the code in here, mostly to take advantage of the tuple ordering. I think that the altered version addresses all of your comments.

This function is used to check the source_location invariants after module_ir.build_ir() is done -- basically, every node with a source_location field must have a source location, the source location must be inside the parent's source location, and both line and column must be >= 1.

I also added a number of tests to parser_types_test to cover the internal invariants in SourcePosition and SourceLocation.

Ah okay, so for a source location to be valid it must not be: (0,0), (0,1), (1,0) ?

At least in here, yes. (0, 0) is "valid" in the sense that you might have a node that didn't "come from" something in the source text, but module_ir shouldn't be creating any of those.

I just updated SourcePosition.__new__ to assert-fail on (0, x) and (x, 0), and SourceLocation.__new__ to assert-fail if start and not end or end and not start, and removed the checks here for only line or column being 0.

EricRahm · 2024-10-14T19:24:25Z

compiler/front_end/module_ir_test.py

- and end.HasField("column")
- and start.column > end.column
- ):
+ if start.column and end.column and start.column > end.column:


nit: we shouldn't have the start.column and end.column portion here, they'll always be present, this is just checking that they're not zero now right?

EricRahm · 2024-10-14T19:27:54Z

compiler/front_end/module_ir_test.py

 result.append("{}.start missing".format(path))
 else:
 start = source_location.start
- if not source_location.HasField("end"):
+ if not source_location.end:
 result.append("{}.end missing".format(path))
 else:
 end = source_location.end

 if start and end:


I'm not sure this is checking what we want now? Maybe is not None is more corect.

EricRahm · 2024-10-14T19:31:12Z

compiler/front_end/module_ir_test.py

@@ -4057,12 +3984,12 @@ def _check_source_location(source_location, path, min_start, max_end):
 for name, field in (("start", start), ("end", end)):
 if not field:
 continue
- if field.HasField("line"):
+ if field.line:


...and again

EricRahm · 2024-10-14T19:31:20Z

compiler/front_end/module_ir_test.py

 if field.line <= 0:
 result.append("{}.{}.line <= 0 ({})".format(path, name, field.line))
 else:
 result.append("{}.{}.line missing".format(path, name))
- if field.HasField("column"):
+ if field.column:


... and again

compiler/util/parser_types.py

EricRahm

lgtm as long as (0,0) is always invalid things make sense to me.

compiler/front_end/module_ir_test.py

compiler/util/parser_types.py

reventlov · 2024-10-16T20:23:42Z

Gotcha, so if (0,0) is always invalid I agree the bool operator is fine, I think I just didn't get that

It's invalid in the sense that it does not refer to an actual location (using 1-based numbering, not 0-based, because that's how lines and columns are customarily numbered). I added some explanation to the docstring.

EricRahm

lgtm!

compiler/util/parser_types.py

compiler/front_end/module_ir_test.py

compiler/util/parser_types.py

reventlov added 2 commits October 13, 2024 04:32

Restore incorrectly-removed test.

2b55d69

reventlov requested a review from EricRahm October 14, 2024 15:17

EricRahm reviewed Oct 14, 2024

View reviewed changes

Revamp _check_source_location and add more parser_types tests.

f653880

reventlov requested a review from EricRahm October 16, 2024 17:49

EricRahm approved these changes Oct 16, 2024

View reviewed changes

compiler/front_end/module_ir_test.py Show resolved Hide resolved

compiler/front_end/module_ir_test.py Outdated Show resolved Hide resolved

compiler/util/parser_types.py Outdated Show resolved Hide resolved

reventlov added 3 commits October 16, 2024 19:44

Move more assertions into SourceLocation and SourcePosition.

e096739

Add explanatory comment.

dcceb48

Reformat with Black.

d16c2a9

reventlov requested a review from EricRahm October 16, 2024 21:41

EricRahm approved these changes Oct 16, 2024

View reviewed changes

compiler/util/parser_types.py Show resolved Hide resolved

compiler/front_end/module_ir_test.py Show resolved Hide resolved

compiler/util/parser_types.py Show resolved Hide resolved

reventlov merged commit 73cbd98 into google:master Oct 17, 2024
4 checks passed

reventlov deleted the source_location_tuple branch October 17, 2024 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Location to a namedtuple, and associated cleanup #205

Convert Location to a namedtuple, and associated cleanup #205

reventlov commented Oct 14, 2024

reventlov commented Oct 14, 2024

EricRahm left a comment

EricRahm Oct 14, 2024

reventlov Oct 14, 2024

EricRahm Oct 16, 2024

EricRahm Oct 14, 2024

reventlov Oct 14, 2024

EricRahm Oct 16, 2024

reventlov Oct 16, 2024

EricRahm Oct 14, 2024

EricRahm Oct 14, 2024

EricRahm Oct 14, 2024

EricRahm Oct 14, 2024

EricRahm left a comment

reventlov commented Oct 16, 2024

EricRahm left a comment

Convert Location to a namedtuple, and associated cleanup #205

Convert Location to a namedtuple, and associated cleanup #205

Conversation

reventlov commented Oct 14, 2024

reventlov commented Oct 14, 2024

EricRahm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricRahm left a comment

Choose a reason for hiding this comment

reventlov commented Oct 16, 2024

EricRahm left a comment

Choose a reason for hiding this comment