aw-transform: Add union_events_split #179

johan-bjareholt · 2020-10-18T12:27:36Z

One step in fixing the "Merge web watcher events into window events in the query? (to allow for classifying by url/domain)" ActivityWatch/aw-webui#151. We already have it in the same query, but the data is not merged.
Hopefully this will be a good transform to merge window and browser data in that manner.

codecov · 2020-10-18T12:36:02Z

Codecov Report

Attention: Patch coverage is 5.06329% with 75 lines in your changes missing coverage. Please review.

Project coverage is 59.97%. Comparing base (7d55fca) to head (77eef3d).
Report is 275 commits behind head on master.

Files	Patch %	Lines
aw-transform/src/union.rs	0.00%	65 Missing ⚠️
aw-query/src/functions.rs	28.57%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #179       +/-   ##
===========================================
+ Coverage   46.30%   59.97%   +13.67%     
===========================================
  Files          51       44        -7     
  Lines        6148     4765     -1383     
  Branches     1454        0     -1454     
===========================================
+ Hits         2847     2858       +11     
+ Misses       2442     1907      -535     
+ Partials      859        0      -859

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ErikBjare

Only taken a quick look but looks alright.

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).

ErikBjare · 2020-10-26T12:04:48Z

aw-transform/src/union.rs

+
+    'event1: for mut event1 in events1 {
+        let event1_endtime = event1.calculate_endtime();
+        'event2: for event2 in events2 {


Instead of a nested loop one might want to step through each list under certain conditions (similarly to how we do it in aw-server-python for some transforms: e1_i++ and e2_i++).

The borrow checker would probably hate that though, and I guess the timestamp checks are pretty fast despite the worst-case O(N^2).

I don't think the borrow checker would have any issues with that, you could just use iterators which should work I think.

johan-bjareholt · 2020-10-29T20:12:39Z

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).

So like the opposite of filter_period_intersect and then concatenating window and browser events?

ErikBjare · 2020-10-30T14:03:02Z

@johan-bjareholt I think something like:

window_events = query(...)   # query and do AFK filtering

for browser in browsers:  # this for-loop would need to be expressed in the function constructing the query
    web_events = query("aw-watcher-web-$browser")
    window_events_browser = filter_by(window_events, {app: $browserAppname})  # can't remember what transform is used to do this
    web_events_active = filter_period_intersect(browser_events, window_events_browser)`  # filter web events by browser being the active window
    web_events_active = amend_data(web_events_active, {app: $browserAppname})  # optional, amends the missing {app: $browserAppname}
    window_events = union(web_events_active, window_events)  # picks events from web_events_active first, then fills rest with window_events

Not sure if we have something like union (in the way I mean it) already, but I mean something that does:

"bucket"          | events
---------------------------------------------------------------
web_events_active |           [github  ]           [youtube   ]
window            |[spotify  ][browser ][terminal ][browser   ]
---------------------------------------------------------------
result            |[spotify  ][github  ][terminal ][youtube   ]

A complete end-to-end example would be:

bucket  | events
---------------------------------------------
window  | [terminal  ][firefox  ][terminal  ]
browser | [youtube   ][google               ]
---------------------------------------------
result  | [terminal  ][google   ][terminal  ]

johan-bjareholt · 2020-10-30T16:06:10Z

web_events_active = amend_data(web_events_active, {app: $browserAppname})  # optional, amends the missing {app: $browserAppname}

I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid. One reason would be because it encourages people to write dynamically generated queries like we do in the web-ui which is something we don't want.

window_events = union(web_events_active, window_events) # picks events from web_events_active first, then fills rest with window_events

To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.

johan-bjareholt · 2020-10-30T16:07:56Z

However, as we discussed AFK, I'm not convinced this is the best approach to merging the web events compared to simply replacing all browser-window events when an active browser event exists, but that has its own issues (like having to inject the browser appname, slightly changing the title by removing the appended - Mozilla Firefox, and maybe more).

What this transform does is essentially this, but it also injects the browsers appname. See the example in the code

///     |---------|--------------------|
///     | events1 |[a     ][b     ]    |
///     | events2 |    [c     ]    [d ]|
///     | result  |[a ][ac][bc][b ]    |
///     |---------|--------------------|

So if we put that into perspective of window events and browser events already filtered with the browser window:

///     |---------|--------------------------|
///     | window  |[firefox   ][terminal    ]|
///     | browser |    [google    ]          |
///     | result  |[f ][f+g   ][terminal    ]|
///     |---------|--------------------------|

Which is pretty much exacly what you just wrote?

ErikBjare · 2020-11-01T09:35:41Z

To call that "union" is very misleading in my opinion. A union should be a common ground, not exclude anything.

For sure, I was just unsure what to call it.

I think this would be a strange transform, inserting data into an event which is hardcoded in the query and does not come from a bucket feels like something we should try to avoid.

I agree in general, but in this case the data does come from a bucket (and doesn't actually introduce new hardcoding, since we already have our list of $browserAppname). We're reinserting exactly what we know we lost before (with filter_by and filter_period_intersect).

But on second thought I realize it'll still be messy, due to there being multiple possible appnames for each browser.

So if we put that into perspective of window events and browser events already filtered with the browser window: <example>

I thought it would become:

///     |---------|---------------------------|
///     | window  |[firefox   ][terminal     ]|
///     | browser |    [google     ]          |
///     | result  |[f ][f+g   ][t+g][terminal]|
///     |---------|---------------------------|

Basically what I'm trying to get rid of splitting events into two.

Edit: Ah nvm, your example would indeed become as you wrote after filtering the browser events.

But it would still lead to things like this, no?:

/// Abbreviations:
///  - ff: Firefox
///  - t1: Tab 1
///  - ff(t1): Firefox window event polled from when `t1` was active
///  - ff(t1)+t2: Firefox event with title from `t1`, but URL from `t2` (after `merge_map`)
///
///     |---------|------------------------------------|
///     | window  |[ff(t1)                ][ff(t2)    ]|
///     | browser |[t1        ][t2                    ]|
///     | result  |[ff(t1)+t1 ][ff(t1)+t2 ][ff(t2)+t2 ]|
///     |---------|------------------------------------|

So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.

Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split and flood after?

Maybe that would lead to:

///     |---------|-------------------------------|
///     | window  |[ff(t1)    ]      [ff(t2)     ]|
///     | browser |[t1           ][t2            ]|
///     | result  |[ff(t1)+t1 ]      [ff(t2)+t2  ]|
///     |---------|-------------------------------|

Which after flooding would become the correct:

///     |---------|-------------------------------|
///     | window  |[ff(t1)    ]      [ff(t2)     ]|
///     | browser |[t1           ][t2            ]|
///     | result  |[ff(t1)+t1    ][ff(t2)+t2     ]|
///     |---------|-------------------------------|

(or something similar, depending on flooding strategy)

This might require a lot of extra memory though, since we'd need both window_events (not flooded, for filtering the browser events that go into union_events_split) and window_events_flooded (flooded, for filtering the browser events that go into filter_period_intersect).

Edit 3: Regardless, I'd be happy to merge this if there were more comprehensive tests (for example, checking that these 'middle-events' get created as expected).

Unless we can come up with a neat solution to the problem (which I'm no longer sure there is) I think it's better to just merge this in the meantime, and work on perfecting the transforms/queries later (as this'll probably work good enough).

ErikBjare

Second review, now that I understand things better.

ErikBjare · 2020-11-01T09:52:21Z

aw-transform/src/union.rs

+
+        /* test non-object conflict, prefer map1 value */
+        // TODO: This does not work yet!
+        //       It should be a pretty rare use-case anyway


This wouldn't be rare? Both window-events and web-events have a title?

ErikBjare · 2020-11-01T09:54:10Z

aw-transform/src/union.rs

+            duration: Duration::seconds(3),
+            data: json_map! {"test": json!(1)},
+        };
+        let mut e2 = e1.clone();


I'd like to have more realistic data here, for example:

For e1: {app: "firefox", title: "google - firefox"}
For e2: {title: "google", url: "google.com"}

I'd also like event series that are at least two events long (like the examples I recently commented about). Would also help with ensuring behavior stays consistent when we eventually remove the nested loops.

johan-bjareholt · 2020-11-01T20:26:00Z

So if I'm not mistaken, this would result in 'middle-events' where the title (which is gotten from the window event, if I understood merge_map correctly) and URLs (from the web event) are misaligned. This will happen anytime two events don't perfectly overlap (which is always), leading to a lot of small misaligned events like this.

That's a very good point which I thought of before but forgot.

Edit 2: I'm not sure, but maybe this could be resolved by using non-flooded window-events for the union_events_split and flood after?

That sounds like a very clever way of solving it, will try that out.

This might require a lot of extra memory though, since we'd need both window_events (not flooded, for filtering the browser events that go into union_events_split) and window_events_flooded (flooded, for filtering the browser events that go into filter_period_intersect).

I took a deep dive into the query code and it seems like you are correct here. Here's an example of how we would have to change our transforms

- events = flood(query_bucket("bucketname"));
+ events_unflooded = query_bucket("bucketname");
+ events_flooded = flood(events_unflooded)

The query language is very inefficient with its assignments, every time we assign something it gets cloned every time it's used afterwards because we do not know if the variable will be used afterwards or not and we need to guarantee that the variable won't change. So previously when we just called flood(query_bucket("bucketname")) we could just re-use the previous events while when we assign it we need to first keep the value in the variable and then clone it on each reference as well as not release the value of the variable until the whole query is complete. This can be improved in the future by #119.

But I think that the impact of just one more clone would be minimal compared to the whole issue we have today with #119, there are lots of more unnecessary clones than just this.

johan-bjareholt · 2020-11-16T20:34:31Z

Here's a query that works with this transform, the web-ui queries.ts code is really messy though so I'm having a hard time adapting it, especially now that "canonicalQuery" should probably include browser events.

{
  "timeperiods": [
    "2020-11-01T01:00:00Z/2020-11-30T01:00:00Z"
  ],
  "query": [
    "window_events = query_bucket(\"aw-watcher-window_johan-laptop2\");",
    "browser_events = query_bucket(\"aw-watcher-web-firefox\");",
    "browser_events = split_url_events(browser_events);",
    "firefox_events = filter_keyvals(window_events, \"app\", [\"firefox\"]);",
    "events = union_events_split(firefox_events, browser_events);",
    "events = merge_events_by_keys(events, [\"$domain\"]);",
    "events = categorize(events, [
        [[\"docs.rs\"], {\"type\": \"regex\", \"regex\": \"^docs.rs$\"}],
        [[\"reddit\"], {\"type\": \"regex\", \"regex\": \"^reddit.com$\"}],
        [[\"crates.io\"], {\"type\": \"regex\", \"regex\": \"^crates.io$\"}]
    ]);",
    "events = sort_by_duration(events);",
    "RETURN = events;"
  ]
}

AllanChain · 2022-03-14T10:15:37Z

Hi! Seems that all major problems have been solved in above discussions. What's currently blocking this PR?

johan-bjareholt requested a review from ErikBjare October 18, 2020 12:27

johan-bjareholt force-pushed the dev/union-events-split branch from d409bbb to 2e5e4fd Compare October 18, 2020 13:08

ErikBjare reviewed Oct 26, 2020

View reviewed changes

johan-bjareholt mentioned this pull request Oct 29, 2020

Add explicit cloning to query2 #119

Open

ErikBjare reviewed Nov 1, 2020

View reviewed changes

johan-bjareholt mentioned this pull request Nov 1, 2020

List of possible query2 optimizations #148

Open

aw-transform: Add union_events_split

77eef3d

johan-bjareholt force-pushed the dev/union-events-split branch from 2e5e4fd to 77eef3d Compare January 6, 2021 08:16

ErikBjare mentioned this pull request Feb 22, 2021

Implemented audible-as-active ActivityWatch/aw-webui#85

Closed

ErikBjare mentioned this pull request Apr 25, 2021

Can't categorize activities by URLs ActivityWatch/activitywatch#352

Open

ErikBjare force-pushed the master branch 2 times, most recently from 9d93857 to 096da6d Compare April 28, 2022 07:25

ErikBjare force-pushed the master branch from 98adff1 to 1219744 Compare November 22, 2023 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aw-transform: Add union_events_split #179

aw-transform: Add union_events_split #179

johan-bjareholt commented Oct 18, 2020

codecov bot commented Oct 18, 2020 •

edited

Loading

ErikBjare left a comment

ErikBjare Oct 26, 2020

johan-bjareholt Oct 29, 2020

johan-bjareholt commented Oct 29, 2020

ErikBjare commented Oct 30, 2020 •

edited

Loading

johan-bjareholt commented Oct 30, 2020

johan-bjareholt commented Oct 30, 2020 •

edited

Loading

ErikBjare commented Nov 1, 2020 •

edited

Loading

ErikBjare left a comment

ErikBjare Nov 1, 2020

ErikBjare Nov 1, 2020 •

edited

Loading

johan-bjareholt commented Nov 1, 2020

johan-bjareholt commented Nov 16, 2020 •

edited

Loading

AllanChain commented Mar 14, 2022

aw-transform: Add union_events_split #179

Are you sure you want to change the base?

aw-transform: Add union_events_split #179

Conversation

johan-bjareholt commented Oct 18, 2020

codecov bot commented Oct 18, 2020 • edited Loading

Codecov Report

ErikBjare left a comment

Choose a reason for hiding this comment

ErikBjare Oct 26, 2020

Choose a reason for hiding this comment

johan-bjareholt Oct 29, 2020

Choose a reason for hiding this comment

johan-bjareholt commented Oct 29, 2020

ErikBjare commented Oct 30, 2020 • edited Loading

johan-bjareholt commented Oct 30, 2020

johan-bjareholt commented Oct 30, 2020 • edited Loading

ErikBjare commented Nov 1, 2020 • edited Loading

ErikBjare left a comment

Choose a reason for hiding this comment

ErikBjare Nov 1, 2020

Choose a reason for hiding this comment

ErikBjare Nov 1, 2020 • edited Loading

Choose a reason for hiding this comment

johan-bjareholt commented Nov 1, 2020

johan-bjareholt commented Nov 16, 2020 • edited Loading

AllanChain commented Mar 14, 2022

codecov bot commented Oct 18, 2020 •

edited

Loading

ErikBjare commented Oct 30, 2020 •

edited

Loading

johan-bjareholt commented Oct 30, 2020 •

edited

Loading

ErikBjare commented Nov 1, 2020 •

edited

Loading

ErikBjare Nov 1, 2020 •

edited

Loading

johan-bjareholt commented Nov 16, 2020 •

edited

Loading