Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: follow search notifications #26

Open
3 tasks done
josh-chamberlain opened this issue Aug 31, 2022 · 13 comments
Open
3 tasks done

v2: follow search notifications #26

josh-chamberlain opened this issue Aug 31, 2022 · 13 comments
Assignees
Labels
api v2 For v2 release

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Aug 31, 2022

This is how people get notified about updates. The idea being that if new data becomes available, a FOIA request comes through, or a request is completed, they can come back for the data they were interested in.

Front end

  • a button for following a search on v2: search results #250
    • for non-logged-in-users, we should tell them they need to log in or sign up to get notified
    • upon success, we could show a toast: "You are now following this search. See all followed searches here" with a link to the user profile

Back end

@josh-chamberlain josh-chamberlain changed the title Ability to subscribe to notifications for a given Data Source Ability to subscribe to notifications for a given Data Source / Community Mar 3, 2023
@josh-chamberlain josh-chamberlain changed the title Ability to subscribe to notifications for a given Data Source / Community Allow people to "subscribe" to Data Sources / Communities Mar 3, 2023
@josh-chamberlain josh-chamberlain transferred this issue from Police-Data-Accessibility-Project/meta Aug 18, 2023
mbodeantor added a commit that referenced this issue Apr 4, 2024
mbodeantor added a commit that referenced this issue Apr 4, 2024
@josh-chamberlain josh-chamberlain changed the title Allow people to "subscribe" to Data Sources / Communities v2: notifications Apr 19, 2024
@josh-chamberlain josh-chamberlain moved this from Future to Needs Refinement in Open issues & Roadmap Apr 19, 2024
@josh-chamberlain josh-chamberlain changed the title v2: notifications v2: follow search notifications Apr 29, 2024
@josh-chamberlain josh-chamberlain moved this from Needs Refinement to Todo in Open issues & Roadmap Jun 11, 2024
EvilDrPurple pushed a commit that referenced this issue Jun 17, 2024
…e_312_replace_http_status_code_when_source_not_found

⚙️ Adjust status code for data source not found response
@maxachis
Copy link
Contributor

maxachis commented Aug 4, 2024

This poses some interesting implementation questions: most prominently, how do we know when a search's results get updated?

Off the top of my head, I can think of a few options:

  1. We rerun all followed searches on a regular basis and compare their new results with previous results. Comparatively simple, but the time to rerun these searches will scale linearly with the number of searches being saved and/or followed. We'll need to think about how often we rerun the searches, and whether that impacts the number of searches a user can follow. This is a Known-Unknown (KU) in that it'd be relatively straightforward to implement.
  2. We work backwards from the data added to find the searches that would be impacted by it. This would at least mean we don't have to rerun a bunch of searches, although it does frontload the act of adding data, and it's also an unknown-unknown (UU) as to whether we can and how to do this. It may require the careful structuring of searches so that they can be properly followed.

Then we also have the question of how to manage this in the database:

  1. We'll need a saved_searches table, as well as a user_to_saved_searches table to associate users with saved searches.
  2. For the saved_searches table, we'll need to figure out how to structure the table itself. A column for each possible parameter? That means it's coupled with the search endpoint, so if the endpoint changes, the database structure will have to change as well.
  3. We'll need to figure out when (or if) to remove saved searches if users unfollow the saved search (for example, do we delete a saved search if all users who previously followed it unfollowed it?)
  4. Some people might make similar saved searches, so we'll need to figure out how to consolidate these. Require each saved search to be unique is one way. Case insensitivity might be important.

We'll probably need one or two endpoints: One to follow a search, and one to unfollow (which might be able to be one endpoint). Then we'll need to implement saved searches as well, so #21 may be a blocker.

@maxachis
Copy link
Contributor

maxachis commented Aug 4, 2024

@josh-chamberlain Given that you indicated that #21 is relatively low priority, how is this prioritized in relation?

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented Aug 5, 2024

@maxachis this might be another semantic thing, so bear with me: we don't need "saved" searches for this to work. But, we'd be keeping track of people's "followed" searches instead. Basically, we realized "following" and "saving" a search were not functionally different enough to make them both. Who knows when or if we will ever make advanced searches?

The anticipated most common use of this function is to "follow" a geography, occasionally with a record type modifier.

Therefore, I think we can have a followed_searches table. We have a geographic object associated with the search, and a coarse record type. From those, you can recreate the search. We don't have plans to make search more complicated any time soon.

If you want, it would be acceptable for v2 to only allow following a geography. Instead of following a search, people could follow a geographic area and maybe get some notifications about data they don't want. That'd be simpler, and OK.

Then every once in a while (as often as the digest email) we would check all the follows for new sources and requests!

@maxachis
Copy link
Contributor

@josh-chamberlain Here is my implementation proposal. Because this is complex, it's a lengthy one, with a lot of headers to demarcate everything:

Implementation Proposal

Overview

We track specific location combinations in the database, add updates for each locational combination (geo) as relevant data sources and requests are updated, and periodically discharge updates to users.

If someone is subscribed to the geo or any superseding geos, they would get updates.

Suppose we have state S, county C, locality L.

  • A user U can follow a search for the geo of SCL, SC, or S
  • If one of several notification events occur for data requests or sources associated with a given geo, U will be notified if:
    • _U _follows SCL
    • _U _follows SC
    • _U _follows S

Notifying Events

These are the events which could trigger a notification

  • A data source with that geo is created
  • A data source with that geo is updated
  • A request for that geo is created
  • A request is associated with a source with that geo
  • A request for that geo is closed
  • Unclear: A request status with that geo is otherwise updated

Database Design

Three tables:

  • followed_geos: Takes a combination of the foreign keys to a state, county, and locality (the latter two can be null)
    • When a user follows a particular search, and the SCL does not yet exist, an entry is created in here
  • link_users_followed_geos : Associates a user with a followed geo. When a user follows a search, an entry is added in here.
  • queued_updates: takes
    • A foreign key to the first table
    • A foreign key to a data source
    • An optional foreign key to a request
    • An update_status enum, which can be
      • Data source created
      • Data source updated
      • Request opened for source
      • Request closed for source
      • Request status updated? (to encapsulate possible non-closing/opening updates)
    • What do I do if a particular combination has multiple things occur to it during the polling time?

Backend Implementation

When a notifying event occurs, we get the SCL of the Geo and check if:

  • An entry exists in followed_geos matching SCL
  • An entry exists in followed_geos matching SC
  • An entry exists in followed_geos matching S

For any of the three valid cases above, we get the followed_geos id for that combination, and add a new entry to queued_updates with the fg_id, the data source id, the request id (if involved), and an update status.

Then, during some periodic interval, we collect all queued updates, send them to the users who are following the requisite followed_geos, and wipe the queued_updates table.

  • We could provide all updates, or simply prioritize the most pressing update for a D-R combination
    • Or maybe group multiple D-R updates under a single D update, if it's a data source with a lot of information associated with it.

How often will notifications occur?

  • Monthly.
  • If we want to be fancy, we could add adjustable update times, but that might not be a v2 thing.

Edge cases

  • We'll need to take into account if a user unsubscribes from a search follow — we'll need an endpoint to remove that (i.e., a DELETE to complement the POST of /followed-searches )
    • We'll then need to remove their entry in link_users_followed_geos. If they're the only user following a geo, we'll want to remove followed_geos and any associated queued_updates

Endpoint design

For all of these, assume that you must be either the user being queried or an admin to have the requisite permissions to perform these actions.

  • GET /followed-searches/{user_id} will return all followed searches for that user.
  • POST /followed-searches/{user_id} will add a followed search for that user and return a followed_search_id . Could be provided in JSON or in query args, unsure at this moment.
  • GET /followed-searches/{user_id}/{followed_search_id} will get more detailed information for that specific followed search.
  • DELETE /followed-searches/{user_id}/{followed_search_id} will unfollow that search for that user.

Issues

  • We don't have a unique table for a locality — just the agencies where we derive the localities from. Even though we have searches and other stuff where a locality is considered a valid parameter. That makes for some convoluted logic, and it'll be harder to make followed_geos unless we have that be its own table. We can just populate it as we add more agencies with the relevant localities — we don't need to add every locality in the United States.
  • Minor point: We should really rename state_names to just states or us_states — it's a little confusing.
  • Do we also notify if a data source is removed?
  • Any status change, or only some?
  • What do I do if a particular combination has multiple things occur to it prior to the notification time?
    • What if a data source is added and a data request is associated with it?
    • What if the above, but also the data request is closed?
    • Less plausible scenario, but: What if the request goes through multiple status changes in a short span of time?
  • Is it only data_sources that are being returned, and hence the only thing to track?

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented Sep 13, 2024

@maxachis thank you for thinking through all these details.

Overview

This makes sense. Originally I'd thought of the record category as part of the search, but that's too granular, and this is a more pure idea. People can follow a geographic object and it's that simple!

Notifying events

I think this is simpler to communicate and execute:

  • when a followed geo contains approved data sources that were not there before
  • when new requests are ready to start
  • when requests are completed

In the notification we'd list each followed search, its sources, and its new/completed requests.

Database design

This is making sense to me. I'll emphasize: these geographic locations ("SCLs") are discrete objects with foreign keys, as you say. I like that the way you have it designed, it would be pretty easy to answer the question "how many people are following this search, anyway?"

  • that said, for each followed_geo, could we just have an array of user_ids which have followed that geo, rather than a link table?

For queued_updates:

  • update_status feels like a bit of a misnomer; the way you wrote it seems like update_reason or something like that?
  • would this be simpler if we simply always did updates on the first of the month? I'm thinking it could obviate the need to spell out in the queue what each update will be in advance, and just iterate through them at update time and use the logic in the code to determine what the notification should look like on the fly.
  • this might also help cases where many updates stack up on one search that need to then be compressed into a sensible notification; we determine the structure of the notification just-in-time.

Edge cases

Thanks for remembering about un-following. I added a note to #274 for when we get there.

Endpoint design

I defer to your wisdom.

Issues

  • We don't have a unique table for a locality — just the agencies where we derive the localities from. Even though we have searches and other stuff where a locality is considered a valid parameter. That makes for some convoluted logic, and it'll be harder to make followed_geos unless we have that be its own table. We can just populate it as we add more agencies with the relevant localities — we don't need to add every locality in the United States.
    • It's true that we don't need to, but the number of localities is roughly the same as the number of agencies. It's a little odd that we allow people to search based on geographic objects, when under the hood they're just agencies, but...I can't find a practical issue with it.
  • Minor point: We should really rename state_names to just states or us_states — it's a little confusing.
    • Makes sense to me. In Airtable, we only have state_iso and never even use the full name.
  • Do we also notify if a data source is removed?
    • I don't think so.
  • Any status change, or only some?
    • I believe I clarified this above; only when they hit completed or are newly created with at least the ready to start status. This is where status enumeration #425 might be useful, because we could just say things like "status >10".
  • What do I do if a particular combination has multiple things occur to it prior to the notification time?
    • This is likely—the people subscribing to notifications are often going to be the ones in places with high activity. I alluded to it above somewhat, but I think we should just make a big list. If the lists get ludicrously long, that would be an OK problem to have. Is the example below helpful? I'm not sure I'm understanding your question quite right.
  • What if a data source is added and a data request is associated with it?
    • I don't think we should worry about the association thing.
  • What if the above, but also the data request is closed?
    • Same
  • Less plausible scenario, but: What if the request goes through multiple status changes in a short span of time?
    • If a request is both opened and completed, just show it in the completed list
  • Is it only data_sources that are being returned, and hence the only thing to track?
    • I think I answered this with my other comments, but LMK if that is not the case!

Example notification

There have been updates to places you follow on PDAP:

# Allegheny County, PA
## New data sources:
- hyperlinked data source name
- etc

## New requests:
- hyperlinked request name

## Completed requests:
- hyperlinked request name

# Albuquerque, NM
## New data sources:
- etc

@maxachis
Copy link
Contributor

maxachis commented Sep 13, 2024

@josh-chamberlain Here are my follow-ups! Assume anything I don't respond to is "Yes, I align with your thinking on this and have no concerns!

Array of User IDs or Link Table?

that said, for each followed_geo, could we just have an array of user_ids which have followed that geo, rather than a link table?

Yes! But there are a few implementation concerns that point me away from that:

  1. With arrays, the entire array has to be rewritten every time it's modified, which could become prohibitive as more users follow a geo.
  2. If we have multiple users following something at the same time, we could run into conflicts with doing that -- with a link table, however multiple users can be added with much lower risk of conflict.
  3. If we want to provide users with a list of what geos they follow, it'll be easier with a link table.

So I think the Link table makes more sense.

Localities as a separate table?

It's true that we don't need to, but the number of localities is roughly the same as the number of agencies. It's a little odd that we allow people to search based on geographic objects, when under the hood they're just agencies, but...I can't find a practical issue with it.

I ran the numbers on how many distinct rows we have in agencies versus how many distinct county_fips/state_iso/municipality combinations, and came to:

  • 23448 agencies
  • 13656 distinct county_fips, state_iso, municipality combinations

So I think there is an argument for breaking it up into a separate table from the standpoint of normalization. I'd also argue that the queries I have to perform to get the full unique geo would be easier -- as is, I'd have to first get all the distinct municipality combinations, and then match on them.

Always doing updates on the first of the month?

would this be simpler if we simply always did updates on the first of the month? I'm thinking it could obviate the need to spell out in the queue what each update will be in advance, and just iterate through them at update time and use the logic in the code to determine what the notification should look like on the fly.

From a certain standpoint, it could be indeed be simpler. Btu there are caveats:

  1. We now either have to keep updates we plan to do in some sort of queue, which will require its own set of infrastructure. Or we have to go through and examine everything which was updated in the time frame and compile it all together.
  2. If we're holding off on updating our data until we reach some specific point in time, that means we now have a delay in the updates to the web app so as to simplify the notification logic, which doesn't seem quite kosher.
  3. It limits our ability to modify the frequency at which people get notifications (especially if we want people to be able to modify that for themselves)
  4. If we want to go into the form of my current proposal at a later date, there would not be a lot of infrastructure overlap.
  5. Rather than dispersing the computational overhead more evenly throughout as everything is being added, we instead have a larger burst at one point in time, where we are performing the updates, organizing the notifications, and sending them out. '
  6. We don't know what user expectations are for how timely they want this data to be updated, and the frontload option limits our ability to adjust.
  7. Handling errors becomes more difficult, because all of the logic is occurring more infrequently. If something goes wrong with the original approach, we would have time to identify and address more of those possible issues before things get sent out -- by contrast, if we frontload the notifications all at once, the surface area for bugs is compressed into a considerably more narrow time frame. And if we have to restart, we have to repeat all of that logic at once. That could all impose more substantial delays on end users.

So I think it could be simpler, but I do think the downsides outweigh the upsides.

@josh-chamberlain
Copy link
Contributor Author

@maxachis the first two categories seem aligned to me. And now we're just left with questions about timing and queues.

  1. We now either have to keep updates we plan to do in some sort of queue, which will require its own set of infrastructure. Or we have to go through and examine everything which was updated in the time frame and compile it all together.
    • I'm confused—I thought queued_updates was basically this concept. In any case, I was envisioning that we would examine and check for notification-worthy events monthly. Stacking them into a queue as they occur is probably workable.
  2. If we're holding off on updating our data until we reach some specific point in time, that means we now have a delay in the updates to the web app so as to simplify the notification logic, which doesn't seem quite kosher.
    • I think I'm confused about updates vs notifications, which I thought were the same. I agree that we shouldn't make our database slower to update.
  3. It limits our ability to modify the frequency at which people get notifications (especially if we want people to be able to modify that for themselves)
    • Let's say we go check our database monthly for notification-worthy events. Checking it more or less often for some searches doesn't seem like an issue to me, though that's not a planned feature.
  4. If we want to go into the form of my current proposal at a later date, there would not be a lot of infrastructure overlap.
  5. Rather than dispersing the computational overhead more evenly throughout as everything is being added, we instead have a larger burst at one point in time, where we are performing the updates, organizing the notifications, and sending them out.
    • I don't get the distinction between performing the updates and organizing the notifications. I think that notifications for followed searches should be doable with read-only access to the database, so I'm not sure what needs to be updated.
  6. We don't know what user expectations are for how timely they want this data to be updated, and the frontload option limits our ability to adjust.
    • again, I think I'm confused about data to be updated vs notification.
  7. Handling errors becomes more difficult, because all of the logic is occurring more infrequently. If something goes wrong with the original approach, we would have time to identify and address more of those possible issues before things get sent out -- by contrast, if we frontload the notifications all at once, the surface area for bugs is compressed into a considerably more narrow time frame. And if we have to restart, we have to repeat all of that logic at once. That could all impose more substantial delays on end users.

I think there's some basic/central premise or terminology gap that we should resolve synchronously! When you get this, let's coordinate something in Discord. It shouldn't take too long, I don't think.

@maxachis
Copy link
Contributor

maxachis commented Sep 26, 2024

So I've managed to resolve the above wordy diatribes with @josh-chamberlain to my satisfaction, but I would like to get @joshuagraber 's input on three remaining components:

Endpoint Nomenclature

For designing the endpoints, does it make more sense for their structure to be:

  1. /followed-searches/{user-id}
    OR
  2. /user/followed-searches, where the user id is derived from authentication header.
    OR
  3. A combination of both
    OR
  4. Some other option

/user/followed-searches aligns more with our existing idea of /user/data-requests which we've discussed in #446. But in that situation, that's meant only as a way to GET results, rather than add new ones or update existing ones for a particular user. And if we do the "user implicit in header" option, that might make admin modifications of these associations more difficult.

My thinking is that for getting information relevant to a particular user, we have /user/followed-searches. For getting followed searches in general, as well as creating, updating, or removing followed searches, we use /followed-searches/

Include a PUT request?

The way followed searches are currently meant to work, each is simply a location. So from my standpoint, it makes sense to only add or remove follows, rather than edit follows -- there's simply no independent information to modify.

Inputs in Post request.

What should the inputs of the post request be? In the database, we now have ids for every distinct location, which means we could

  1. Create a locations endpoint which GETs the id based on certain queries, and then that location id can be provided in a POST request, or referenced in a DELETE request (as opposed to a distinct followed-search id being referenced in a delete request.
  2. Do what we do with \search, where we input the State, the County Name, and the Locality Name, and internally find and associate a location by its id, but don't expose that id to the user.

I favor 2, simply because it aligns with what we're already doing with /search, and I suspect @joshuagraber will as well, but I want to put forth both options.

@joshuagraber
Copy link
Contributor

  1. Endpoint nomenclature: I would prefer 2.
  2. Include a PUT req? Not necessary from my perspective, and we can always add later if it becomes necessary for whatever reason.
  3. Inputs in POST req: Indeed, I prefer 2.

@maxachis
Copy link
Contributor

@josh-chamberlain A few question on notifications on data requests

Should data requests get a specific location as well?

Currently, in the data_requests table, we have a location_described_submitted column. This would be hard to search on and associate with specific locational searches. Do we want to have it so that data requests have a location id as well, and users submitting a data request have to pick a location via the typeahead suggestion?

Should searches return data requests as well?

Currently, searches only return data sources (and relevant agencies); they do not return data requests. If we want our followed search notifications to return data requests as well, should the search functionality be updated in turn?

@josh-chamberlain
Copy link
Contributor Author

@maxachis

  • specific location: yes, I think this makes sense.
  • in searches: tbd. i think that we can still make it two endpoint calls on the front end, so that we can return sources first as a priority, then requests lazily. Requests didn't make it into the new wireframe for search results, so I am still talking to Vikram about a home for them. Either way, I think a separate endpoint is OK.

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented Oct 27, 2024

@maxachis I updated the issue to include #463 if you think that makes sense. Feel free to edit or reorg, but I think that's a key piece.

@joshuagraber
Copy link
Contributor

Just chiming in here to say that the basic work to follow and un-follow searches from the client web app is completed ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api v2 For v2 release
Projects
Status: Todo
Status: No status
Development

No branches or pull requests

3 participants