Skip to content
This repository has been archived by the owner on Feb 7, 2019. It is now read-only.

[WIP] Parse URLs when filtering #660

Closed
wants to merge 1 commit into from

Conversation

jimporter
Copy link
Contributor

@jimporter jimporter commented Mar 14, 2018

This is still very WIP, but it should give an idea of how I plan to pass the data around. The general idea is that when the popup opens, we fill in the protocol, host, and port (if the port is default, it's implicitly defined by the protocol), which then gets passed to the filter code.

The filter code rebuilds a URL object and tries to match it to URL objects for the origins of each item. Origins are allowed to omit their protocol, in which case we treat them as HTTPS. If the search query omits its protocol (this will only happen when the user is typing in the filter themselves), we just do a string comparison. This way, you only have to type part of a URL (e.g. "goo" for "https://www.google.com") to filter the results down when you're searching manually.

Up next, I plan to make the matching on valid URLs work closer to our plan.

@jimporter jimporter requested a review from linuxwolf March 14, 2018 19:31
@ghost ghost assigned jimporter Mar 14, 2018
@ghost ghost added the in progress We are actively working on it. label Mar 14, 2018
Copy link
Contributor

@linuxwolf linuxwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this progresses, I hope the comments below are useful.

Also, we should consider using psl or another maintained and tested module. I don't think it's worth our time to write our own public suffix implementation; the (meta) rules have subtle gotchas.

return filter.split(/\s+/);
}

function tryMakeURL(string, defaultProtocol = null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what the user types, or the extension grabs from the activeTab (let's call this target), using exactly what is provided is appropriate. For what we get from a saved entry in Lockbox (even if the entry had "http:") (let's call this allowed), I am far less sure.

Matching on "http:" carries a lot of risk. It's ok to for "http://localhost' (or "http://127.0.0.1", or "http://[::1]"); this is considered a secure origin (at least for now). For other hosts/ports, I worry we are potentially exposing users to exfiltration. Also, matching on other origins carries risk, too (e.g., "javascript:").

This is probably a long way of saying the data massaging here needs to be more robust; my personal preference is sooner than later, but it'll need to happen.

[I do see the valid protocol constraints, but they are far from the rest of this constraints logic]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would we want this validation to happen? If we want to be strict about what's allowed in the datastore, should we just throw an exception when trying to save an entry if the user specifies a "bad" origin? At the moment, this code is assuming that the origin could be literally anything, since we have no constraints on what a user enters into the datastore. However, we could mandate that origins take a particular form, and possibly even normalize them when saving them (e.g. www.google.com is normalized to https://www.google.com). I'd actually prefer this to what I have now, since it's one less place where I have to make guesses about what the user meant.

For HTTP, how do we balance the fact that some sites just don't use HTTPS (and users would like to log into them) with the potential for some kind of MITM attack? Note that the code here only applies to the doorhanger, which is in chrome, not content. When we have a matcher used by (semi-) autofill, we may well choose to be even more restrictive. For example, I'd be ok with a plan where we say, "Autofill only works on HTTPS, but you can get pre-filtered results in the doorhanger even on HTTP".

I also think it would make sense to provide more restrictions here (e.g. constraining protocols), but we'd probably also have the constraints where they are now (or in a Redux action creator) since I don't think we should even fill in the search box on URLs with "bad" protocols. That said, a lot of this stuff will probably move to a common file somewhere once it's fleshed out a bit better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would we want this validation to happen? If we want to be strict about what's allowed in the datastore, should we just throw an exception when trying to save an entry if the user specifies a "bad" origin? At the moment, this code is assuming that the origin could be literally anything, since we have no constraints on what a user enters into the datastore. However, we could mandate that origins take a particular form, and possibly even normalize them when saving them (e.g. www.google.com is normalized to https://www.google.com). I'd actually prefer this to what I have now, since it's one less place where I have to make guesses about what the user meant.

The datastore should have been more restrictive, and we can (should) make it more restrictive now (ish). We still have to deal with existing data, though.

For HTTP, how do we balance the fact that some sites just don't use HTTPS (and users would like to log into them) with the potential for some kind of MITM attack? Note that the code here only applies to the doorhanger, which is in chrome, not content. When we have a matcher used by (semi-) autofill, we may well choose to be even more restrictive. For example, I'd be ok with a plan where we say, "Autofill only works on HTTPS, but you can get pre-filtered results in the doorhanger even on HTTP".

Since this code only applies to doorhanger's filter, I suppose for now it's ok to match on http: also. Howver, it is MUST NOT to extend that to our upcoming fill work, unless security-review finds it acceptable.

for (let i of values) {
if (match(filterElement, i)) {
if (matcher(filterElement, i)) {
return true;
}
}
return false;
}

export function filterItem(filter, item) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these changes, my sense is filtering needs to get more differentiating -- instead of relying on a string that is (potentially mis)interpreted in a few ways, it's an object that better outlines what the filter intent is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means. What part of filtering needs to be converted from a string to an object? The visible representation of a filter should be a string, since it's what goes into the text field, and one of the main reasons for having filters in the first place is as something a user could type.

If you're suggesting that the automatically-generated filter should only find results in the origin field of the item, I agree and that's #627. I'd like it to be something like Google's mini-format for search queries, e.g. url:https://www.example.com (there'd be other prefixes too, like title:, username:, and maybe one day, tag:).

@@ -26,7 +26,7 @@ browser.tabs.query({ active: true, currentWindow: true }).then((result) => {
const validProtocols = ["http:", "https:"];
const url = new URL(result[0].url);
if (validProtocols.includes(url.protocol)) {
store.dispatch(filterItems(url.hostname));
store.dispatch(filterItems(url.protocol + "//" + url.host));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected url to be passed in directly here, rather than breaking it up. This feels like the constraints are spread around the codebase; that will make it harder for us to understand the implications and limitations.

Copy link
Contributor Author

@jimporter jimporter Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't (well, we shouldn't) pass the URL object, since that goes right into the Redux state, and Redux is supposed to be just JSON (as I understand it, this is very helpful for Redux devtools). We could make a new Redux action creator, but it'd do pretty much exactly what this does before getting stored in Redux.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a reasonable argument against passing the URL directly.

However, this still needs to be hostname and not host. It needs to keep the optional port, or it risks false positives. For the vast majority of people for the vast majority of their entries, hostname and host are identical, so I think false negatives by real users will be rare.

I can see an argument for treating the lack of an explicit port as matching any port, but I'd like to get sign off from security-review.

@jimporter
Copy link
Contributor Author

jimporter commented Mar 15, 2018

@linuxwolf Good call on using psl. I figured there had to be something like that, but I just wanted to sketch out a simple implementation of the URL checking, focusing on how the information passes from the index on page load, through the Redux store, and to the filtered output in our React UI. Here's the stuff I've tried to address in this WIP:

  1. When opening the doorhanger, we initialize the contents of the search box to be something that a) has all the info we need, and b) makes sense to the user (i.e. it should be possible for a user to perform a manual search exactly like the automatic search). To me, this means "initialize the search field to protocol://host[:port]", since that's something a user could enter. In particular, I'd like to avoid any internal "magic" (e.g. a hidden flag that tells us the filter is autogenerated and should be handled differently), since I think that has a lot of potential for bugs and/or user confusion.
  2. If the search field doesn't have a protocol, we know that it must be user-entered, so we do a regular ol' substring match. This is so that as the user types, the list of displayed entries gradually gets smaller. If we did a strict URL match, the user would see 0 results until they typed the whole URL. That'd suck, since you wouldn't be able to just type "goo" to match "https://www.google.com".
  3. Since our Redux representation of the filter is just a string (that's what we'd get if a user just typed it in), we need to parse that just prior to performing the filter process into some kind of fancy object. That lets us perform a richer comparison than just string equality. For now, that just means using the URL constructor, but later, it could mean using psl or a combination of the two.
  4. Since we have no validation on what gets stored in the origin field in lockbox-datastore, I made the temporary decision that we'd just do our best to parse it as a URL (defaulting to HTTPS if there's no protocol). However, we could add some requirements about what can be stored in the origin field, which would eliminate the need for this guesswork.

Next up, I'd like to break some of these bits out into a generic "URL Utilities" file, since much of this can be used by autofill as well. Both src/webextension/list/filter.js and src/webextension/list/popup/index.js could then use this, so there's less URL magic distributed randomly in the tree. Then I can start on a more-complete implementation of strictly matching origins (i.e. implement the Gist you wrote up).

return match(filterElement.string, value.string);
}
return (filterElement.url.protocol === value.url.protocol &&
filterElement.url.host === value.url.host);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this more, I'm less and less comfortable with the missing port.

Copy link
Contributor Author

@jimporter jimporter Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL.host includes the port if it's not the default port for that protocol. So:

  • For https://www.google.com, URL.host is www.google.com
  • For https://www.google.com:8000, URL.host is www.google.com:8000

I plan to make this a lot more explicit when I build some common functions to use here (i.e. I'll write documentation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for that; I keep having them swapped in my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries; this is why we need documentation (and instinctively, hostname is longer than host, so the value seems like it should be longer too!).

@@ -26,7 +26,7 @@ browser.tabs.query({ active: true, currentWindow: true }).then((result) => {
const validProtocols = ["http:", "https:"];
const url = new URL(result[0].url);
if (validProtocols.includes(url.protocol)) {
store.dispatch(filterItems(url.hostname));
store.dispatch(filterItems(url.protocol + "//" + url.host));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a reasonable argument against passing the URL directly.

However, this still needs to be hostname and not host. It needs to keep the optional port, or it risks false positives. For the vast majority of people for the vast majority of their entries, hostname and host are identical, so I think false negatives by real users will be rare.

I can see an argument for treating the lack of an explicit port as matching any port, but I'd like to get sign off from security-review.

@devinreams devinreams closed this Apr 12, 2018
@ghost ghost removed the in progress We are actively working on it. label Apr 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants