Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Specification for the Community History Problem (MVP) #162

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

0x-r4bbit
Copy link
Member

No description provided.

Copy link
Member Author

@0x-r4bbit 0x-r4bbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staheri14 @oskarth @iurimatias @John-44

Hey everyone,

here's a first draft of the specification for the community history problem.
Please review this and leave feedback on whether this goes in the correct direction.

I've also added some inline comments for clarification and questions.

3. A special type of channel for distributing magnet links ([Magnet URI scheme](https://en.wikipedia.org/wiki/Magnet_URI_scheme), [Extensions for Peers to Send Metadata Files](https://www.bittorrent.org/beps/bep_0009.html)) is created
4. Community owner invites members and creates additional channels
5. Community owner node receives messages and stores them into local database
6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@John-44 I think your initial proposal said the owner is gonna do this after 14 days initially and then do it subsequentially every 7 days. I wasn't sure why that was, so I want with 7 days right away. Happy to change this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht I meant 7 days initially, sorry for not being clear on this. So 7 days right away is correct :-)

Copy link

@John-44 John-44 Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht isn't the magnet link created in step 8? If so, shouldn't this sentence be updated to say:

"6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database into a binary blob"?

Then in step 7 this binary blob is prepended to the previous binary blob (if any), and then the magnet link is created in step 8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I haven't explicitly said "blob" everywhere, because I think we can safely assume that computers send blobs at the end of the day (we also don't mention blobs in any other spec when talking about data).

isn't the magnet link created in step 8?

As later in the document explicitly stated, there's actually multiple magnet links being created:

  1. One for each archive (either every 7 days or for whatever time range messages were missed)
  2. One for every time a new message archive index is published (which includes the magnet link to the previously created archive + magnet links for all archives prior to that.

See "Bundling history archives into archive indices" section for a more detailed explaination.

Also, happy to add visuals to make this more clear.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd introduce BitTorrent client somewhere maybe in the terminology part


## Storing live messages

Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specified 14/WAKU2-MESSAGE here, but this is probably not true as Status currently uses Waku V1. However, it seems Status can talk to V2 store nodes, so it apparently also understand V2 messages.

Let me know if this needs change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht is this how Status-Go stores messages at the moment? What metadata does Status-go store for each message in it's local database today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status-go only stores ApplicationMeseageMetadata (as specified in 6/PAYLOAD). Which is also one of the reasons why it's important for the community owner node to store the full WakuMessage in addition to that (because otherwise we lose message integrity).

More on that in the "Storing live messages" section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this explicit. We should use 14/WAKU2-MESSAGE if possible for reasons mentioned.

Depending on how timestamp is set, it may also be possible to reconstruct from 7/WAKU-DATA https://rfc.vac.dev/spec/7/ but this can get iffy, also with hashes etc... I defer to @staheri14 on this

(edit I see this is elaborated on below)

1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.

Exported messages MUST be restored as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/) for bundling. Waku messages that have been exported for bundling can now be removed from the community owner node's database (community owner nodes still maintain a database of application messages).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, from this comment are you saying that today Status Desktop already stores all messages received twice, once in status-go's local message database and it also keeps the received messages in a separate 'node' database?

We definitely do not want to delete messages from the owner node's status-go local message database

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re deleting messages, please see this previous comment of mine
https://github.com/status-im/specs/pull/162/files#r770793021

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment that no WakuMessage younger than 30 days will be removed


For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.

The the community owner node MUST derive a magnet link from the newly created `WakuMessageArchiveIndex` so it can be distributed to community member nodes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This magnet link is then later sent to the "special" channel so members can fetch the index and figure out which archives to download.

Theoretically, it's not necessarily required to distribute the index as a magnet link first. We might as well send the index as message directly to the status network.

This would save one roundtrip for member nodes to get the index from the torrent network, but would put a bit more bandwidth pressure on the status/waku network.

Thoughts welcome!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I think you've just seen the same thing I have - that we don't need to worry about a WakuMessageArchiveIndex, and just need to send the most recent magnet link. Unless there is something I'm missing of course! ;-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our last convo, I think it would be good to persist the WakuMessageArchiveIndex in the long-term storage layer i.e., BitTorrent, otherwise, there is a possibility of losing the WakuMessageArchiveIndex if not properly persisted by Status nodes locally


```
{community_id}-archives
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet how exactly topics look like so this might not make a lot of sense.

Feel free to add suggestions on what those topics should look like!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, content Topics follow this format https://rfc.vac.dev/spec/23/#content-topics
/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

Generally, fetching message archives is a tree step process:

1. Receive message archive index signal, download index, then determine which message archives to download
3. Download individual archives
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's 2 steps. Needs update.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want individual archives, I think we want all of a community's history to be in a single binary blob (with a separator string inserted into the binary blob at the point each new blob is prepended to the prexisting blob, to enable easy splitting of the blob for the purposes of enabling partial download of a community's history and/or keeping the storage used by a community with in a set bound (in phase 2 of this project - enabling partial downloads and fixed storage use per community isn't needed for the MVP)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we prevent nodes from redownloading all the data they have downloaded already, if they don't know at this point what that blob looks like (they need to download it first, then can perform that splitting).

Using the archive index, I've tried to come up with a solution to account for that. It also provides flexibility for nodes to selectively decide for what date range they want to fetch archives (something the blob-only solution won't do, unless I'm missing something here)

When message archives are fetched, community member nodes MUST unwrap the resulting `WakuMessage` instances into `ApplicationMetadataMessage` instances and store them in their local database.
Community member nodes SHOULD NOT store the wrapped `WakuMessage` messages.

Already stored messages with the same `id` or `clock` value MUST be replaced with messages extracted from archives, if both of these values are equal.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staheri14 I know we talked about this but I still went for "only replace what needs replacement" for now. This can cause inconsistency compared to the canonical history though, so if we really want to replace everything from T1 - T2, no matter what, let me know and I'll update this.

/cc @John-44

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht my vote is that we prioritize consistency of the community's message history over the risk that the owner node might not download a particular message. I think it's better for everybody in the community to have the exact same message history, even if this exact same message history can be missing a message if the owner node for some reason didn't download a message prior to creating the message archive for that 7ish day period

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with the consistency i.e., replacing everything from T1-T2, we can later design a synchronization protocol across store nodes to make sure they all have consistent message hisotry


Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.

Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@John-44 very important consideration that I think we haven't talked about yet.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht yes it is, well spotted!

As soon as we have tokenised community ownership it won't be possible for two people (aka ethereum accounts) to own a single community, but there is nothing to stop a single person signing into two Status Desktop instances with the same profile and therefore their same account (that owns a community) would be running on two nodes.

Could we detect if this occurs, and automatically select only one node out of however many nodes the owner has spun up to be the node that produces history? We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single message is missing in one of the histories, the hashes presented in archive indices

Agree, and if we can make design robust to this it'd be useful. I suppose this is related to the whole archival index discussion? (I haven't kept up here in detail, just noticed a lot of back and forth).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

Yes, I think what we can do is set a "main owner". So even if there are multiple ppl with private keys, only one main owner could be set. Obviously, with multiple owners having the private key and write privileges, each of them can change that value as they like.

This could still be problematic if they serve different archives. Then the question is: will member nodes simply ignore all the older archives in a given time range (because they might look completely different), or will they also download all of it and keep replacing all of it.

In other words: If member nodes detect that the history has changed, will they replace that entire history, or will they stick to only downloading the latest #n archives?

message CommunityMessageArchiveIndex {
uint64 clock = 1;
string magnet_uri = 2;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to not distribute the index via magnet link, but as-is, then this payload needs to change.

@0x-r4bbit
Copy link
Member Author

I'm gonna fix the typos once the review process/iteration is done

Community owner nodes go through the following (high level) process to provide community members with message histories (assumes community owner node is available 24/7):

1. Community owner creates a Status community
2. Community owner enables community history archive support
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think community history archive support should be on by default when a community owner creates a new community

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will allow community members to turn this feature off, so I assume there's gonna be some UI switch that they can use when creating/editing communities.

I don't think we want to force community owners to have this enabled. Which means, owner could switch it off during creation. Because I assume that, I made it explicit that it has been enabled.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Community Owners should definitely be able to switch the community history archive service off, we will have a toggle in the community admin settings to let a community owner do this. I was proposing that this service should be switched on by default when creating a new community.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Community owner enables community history archive support

What does this mean in terms of specifications? Is it like a flag that should be set when running a status node? if there is a specification, please link it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, there's no such specification yet. I assume this is something that should be store in the logged-in user's Settings. @richard-ramos what do you think?


### Serving community history archives

Community owner nodes go through the following (high level) process to provide community members with message histories (assumes community owner node is available 24/7):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally a community owner node is on 24/7, but we also support a community owner node coming online for say an hour only once every three days at a minimum. This minimum community owner node liveness assumption shouldn't break anything in this proposal, I had this in mind when I wrote the rough sketch of this this could work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this part of the overview covers the (ideal) scenario that the community owner node is online 24/7 and therefore generates archives every 7 days.

In reality, it might go offline, which is covered in the "Serving archives for missed messages" part.
Can make this more explicit if this isn't clear.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't worry, I think this is clear, I just started commenting from the top as I read through the doc, once I reached that section it was clear

4. Community owner invites members and creates additional channels
5. Community owner node receives messages and stores them into local database
6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client
7. Community owner node creates message archive index and bundles the previously generated magnet link into it
Copy link

@John-44 John-44 Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "Community owner node prepends the newly created binary blob to the previously generated binary blob (if any - the first time this operation is performed there will be no previous binary blob prepend the new binary blob to)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as mentioned here https://github.com/status-im/specs/pull/162/files#r767952892, this is technically not the case.

We're not actually bundling the raw data with the new raw data. The reason for that is, we can't guarantee the size in which pieces will be created by torrent, so we can't guarantee that the hashes for each piece will be the same every time (different length == different hash).

So instead, we're distributing an index that looks like:

index: {
  "0x123": {
    from: ...
    to: ...
    magnet_uri: magnet_link_for_archive_1
  },
  "0x456": {
    from: ...
    to: ...
    magnet_uri: magnet_link_to_archive_2
  }
  ...
}

Then, based on the hashes (which are derived from the magnet_uri which is unique), member nodes can decide which ones they need (all, some, none).

See "Fetching message archives" section for a detailed explanation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for that is, we can't guarantee the size in which pieces will be created by torrent, so we can't guarantee that the hashes for each piece will be the same every time

Quick update here, this should say "I assume we can't guarantee [...]".

Torrent clients seem to offer UIs to let users configure the size and/or number of pieces, so it might be possible to go a different route that doesn't use indices.

Will discuss with @staheri14 and update spec accordingly!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht thanks for the call the other evening on this topic. Yes we definitely should be able set the torrent piece size to a fixed size, irrespective of the size of the torrent. Prob. better to go for a larger piece size, as this will work better when a torrent grows larger. 4MB seems to be a common torrent piece size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Community owner node creates message archive index and bundles the previously generated magnet link into it

I'd add a link to the section where "message archive index" is explained

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
7. Community owner node creates message archive index and bundles the previously generated magnet link into it
7. Community owner node creates message archive index and bundles the magnet link generated in step 6 into it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to the section where "message archive index" is explained

Will do!

5. Community owner node receives messages and stores them into local database
6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client
7. Community owner node creates message archive index and bundles the previously generated magnet link into it
8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 2) through the Waku network
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing "from index" with "from binary blob" to make it clearer what we are talking about?

Copy link
Member Author

@0x-r4bbit 0x-r4bbit Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "from archive index". Sorry there's a missing word...

Apart from that, as mentioned here https://github.com/status-im/specs/pull/162/files#r767957001, this would be incorrect.

We are creating a magnet link from the archive index.

Although as mentioned here https://github.com/status-im/specs/pull/162/files#r767717175, creating a magnet link for the index may not be necessary, as we can distribute it via waku messages as is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relation between the special channel and the waku network needs to be clarified

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 2) through the Waku network
8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 3) through the Waku network

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relation between the special channel and the waku network needs to be clarified

Can you elaborate what you mean by that? From waku's perspective, it's just another channel.
I don't go into the specifics of what that channel looks like and how other nodes can recognize it because this is still just the high-level overview part

If the community owner node goes offline, it MUST go through the following process:

1. Community owner node restarts
2. Community owner node requests messages from store nodes for the missed time range
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be more explicit, perhaps add a mention of 'all channels in their community' to the end of the sentence?

"2. Community owner node requests messages from store nodes for the missed time range for all channels in their community"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can add that!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specification on how to measure the missed time range in waku v1? in waku v2 FT-store, we measure it based on the waku message timestamp of the last message stored in the db, up to the time the node goes back online

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a community owner always know the full list of channels in a community as soon as one is created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specification on how to measure the missed time range in waku v1?

@staheri14 I don't know if there is even such a mechanism in Waku V1. @richard-ramos I believe there's no FT-Store equivalent in Wakuv1, is that correct?

I thought, given that the owner node is essentially a Status app, we'd implement it similar to (or even exactly) how it's done in Waku2 Ft-Store

Will a community owner always know the full list of channels in a community as soon as one is created?

I assume only community owner/admins are able to create channels in communities, so they should always know the full list of channels in a community.


1. Community owner node restarts
2. Community owner node requests messages from store nodes for the missed time range
3. Missed messages are stored into local database
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps change:

  1. Missed messages are stored into local database

to

  1. All missed messages are added to into the Community owner node's local message database

I'm suggested adding "All" at the beginning of the sentence to make it explicit that the next step shouldn't happen until fetching all of the missing messages is complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to update this to be more explicit.

Just pointing out though that this:

to make it explicit that the next step shouldn't happen until fetching all of the missing messages is complete.

we can't actually guarantee.

I've added a comment about that in the "Downloading message archive indices" section.

1. Community owner node restarts
2. Community owner node requests messages from store nodes for the missed time range
3. Missed messages are stored into local database
4. Community owner node creates message archive and magnet link for missed messages
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

  1. If 7 or more days have elapsed since the last message history torrent was created then he community owner node exports and compresses last 7 days worth of messages from database into a binary blob"

  2. Community owner node prepends the newly created binary blob to the previously generated binary blob

  3. Community owner node creates magnet link from index and distributes it to community members via special channel

(note: I'm also suggesting updating steps 5 and 6 directly below this comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 7 or more days have elapsed since the last message history torrent was created then he community owner node exports and compresses last 7 days worth of messages from database into a binary blob"

Does this still apply in this case here? Cause this is discussing the "community owner node was offline for #n days" and simply wants to restore the message gap.

Once that is done it'll go back to "every 7 days" routine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht say an owner node goes offline for 3 days immediately after creating the last torrent. In this scenario when the owner node comes back online, only 3 days have elapsed since the last torrent was created, so a new torrent doesn't need to be created for another 4 days. If we don't do this, if an Owner Node goes on and offline frequently, that community will end up with torrents being created far more frequently than needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, got it now. Yes, absolutely correct. Will update.


1. User joins community and becomes community member
2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Member node can fetch a magnet link to a binary blob that contains the community's previous history, and the history included in this binary blob starts starts (for example) at 7 days in the past and then goes backwards, then the member node only needs to fetch message history up until the point when the community history archive service takes over, and this might often be less than 30 days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 30 days here are the up to 30 days that store nodes save. When a community member joins a community, that member has 0 days of message history. So what happens is that it will probably fetch last 30 days of message history from store nodes (this is how this works today).

Then, now that there's a notion of message archive index signals, member nodes can recognise those and start requesting archives older than 30 days.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht I think what currently what happens is that when a member joins a community, only the messages in the 'message auto-fetch window' are automatically downloaded, I can't remember what this currently defaults to on Desktop but I think it might be 7 days? If a user wants to fetch messages beyond the 'message auto-fetch window', the user needs to go into the channel they want to fetch messages for an manually trigger a message fetch. I would love if we could extend this 'message auto-fetch window' to fetch the full 30 days of message history stored on mailservers, however I remember we ran into a bunch of issue when we tried to do this earlier in the year and that's why we landed on a shorter default 'message auto-fetch window'.

Now for this community history archive service, it's very important that for the Owner Node this 'message auto-fetch window' is lengthened to always try to fetch all messages from a community that have been sent since the Owner Node was last online, even if the Owner Node was (for example) last online 28 days ago. This should probably be written down in the spec somewhere.

For community members, I would love if the 'message auto-fetch window' default setting could be lengthened to 30 days, however if the issue that prevented us doing this earlier in the year are still present, we might need to settle of a shorter default message auto-fetch window, say 8 days??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's interesting. I wasn't aware of that. Should've verified. My understanding was that all the available live message history (30 days) will be loaded.

I don't think it makes such a big difference if we fetch the first 7 or 8 days, but what's important is that, when a member joins, that member needs to fetch enough messages to receive the last published magnet link. If that one wasn't published within that "message auto-fetch window", we'll have to keep fetching for older messages periodically until we receive a magnet link message either way.

@richard-ramos do you know if it's possible to extend the 'message auto-fetch' for individual channels? So that community members will fetch the last 30 days (not 7) for that special magnet link channel?

1. User joins community and becomes community member
2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think:

"4. Member node receives magnet link message from store nodes"

should actually be

"4. Member node receives magnet link message from the special hidden channel"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay can change that!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Member node receives magnet link message from store nodes
4. Member node receives the waku message that contains the message archival index magnet link from the special hidden channel

It is just a suggestion, feel free to edit as you think fit better

2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is just a magnet link and nothing else, so it doesn't really need to be extracted in any way other than just the whole message being passed to the torrent client

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "extracting" part here refers to the fact that messages are distributed as WakuMessage and need to be unwrapped to actually see what's inside.

If this seems confusing I can probably remove that detail here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, up to you :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can explain what is inside the waku message distributed in the hidden channel, then it would be clear for the reader what does it mean to extract the message

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Member node extracts magnet link from message and passes it to torrent client
5. Member node extracts magnet link from the waku message and passes it to torrent client

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

Glad to see this is WakuMessages, because it'll make future compatibility infinitely easier! Magnet link could be its own field as well. By keeping this an open kv map (protobuf or json), we can extend it as there's a need, e.g. with any type of compression/time period or whatever we may want to communicate.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

@oskarth this is covered here: https://github.com/status-im/specs/pull/162/files#diff-1d7b2a048dd5e11a0620aaa98e258f12170764802e703696ee623392214cbd95R233

Essentially we expect messages in the special channel to be ApplicationMetadataMessages (just like most other messages sent by Status) and then introduce a new payload type by which we know that the message is not a normal chat message, but indeed a special message that contains a magnet link.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

Very good point. Status node needs to verify that the magnet link message that came in through the special channel is signed by the community owner and then is assumed to be trusted.

3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client
6. Torrent client downloads latest message archive index via magnet link
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Torrent client downloads latest message archive index via magnet link

perhaps change to

  1. Torrent client downloads latest message archive binary blob via magnet link

(I think "binary blob" is more descriptive of what we are talking about here than "index"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think "binary blob" is more descriptive of what we are talking about here than "index"

As mentioned in a bunch of other comments: everything is a blob at the end of the day. What matters is how its encoded. In this case it happens to be a WakuMessageArchiveIndex

^ This information is more important for this spec to work, than that everything is a blob.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, just remembered, we need to add something here about the torrent client performing a 'force recheck' function using the new magnet link on top of the previously downloaded binary. This is very important to stop clients needing to download the same data multiple times

4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client
6. Torrent client downloads latest message archive index via magnet link
7. Member node fetches missing archives via torrent
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change from

"7. Member node fetches missing archives via torrent"

to

""7. Member node fetches binary blob that contains the message history archive for the community via torrent"

5. Member node extracts magnet link from message and passes it to torrent client
6. Torrent client downloads latest message archive index via magnet link
7. Member node fetches missing archives via torrent
8. Member node unpacks and decompresses message archive data to then hydrate its local database
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add ", deleting any messages for that community that the database previously stored in the same timerange as covered by the message history archive binary blob"

e.g.

"8. Member node unpacks and decompresses message archive data to then hydrate its local database, deleting any messages for that community that the database previously stored in the same timerange as covered by the message history archive binary blob"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool, will change this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this conversation.

  1. The assumption in currently is definitely that network is reliable
  2. In a distributed system the network isn't actually reliable, but we are assuming it is to simplify things for now (e.g. with cluster operation etc)

How does this connect with the replacing or not of local db? Maybe more generally: what are the scenarios where the two file systems might be out of sync, and are we 100% confident that our reconciliation algorithm is correct here? From a development POV, at a minimum this should be printed out to debug and perhaps some form of backup should be used. Some scenarios I can imagine:

  1. Client didn't request all store messages so a lot of new data comes in from archive, all good (easy to sync)
  2. Client actually has more data than archive, which means there's an inconsistent view. My understanding is that from a product POV we prefer to then force users to have the same view based on community owner POV. Outside of network/logical issues (who gets what information what). This also means community owner have some control to censor if they so wish (they can in other ways too so eh).
  3. ...probably a few more

I think whatever we think makes sense from a product POV that simplifies things is fine here, as long as we are explicit about what assumptions around reconciliation we are making, as well as what type of attacks that makes it vulnerable to.

Sorry maybe I'm misunderstanding here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @oskarth has described here is correct. We're forcing the community owner's history to all members, even when they have received more messages than the owner (and therefore causing the member to delete messages that aren't missing in the owner state, but also don't exist there).

This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning

@staheri14 can you maybe elaborate how this is related to updating the member's database?


Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.

Community owner nodes SHOULD remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect. Community owner nodes SHOULD NOT remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.

Because if a community owner node did this, then the anybody directly using the owner node to browse the community wouldn't be able to search their own community, which would be weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not as clear in this part (I'll make this more specific), but as mentioned here: https://github.com/status-im/specs/pull/162/files#r767997713

This is referring to the WakuMessage's, which are no longer needed after they have been migrated to long-term storage. ApplicationMetadataMessage's will stay around. Those are the ones used for rendering/searching etc.

Copy link

@John-44 John-44 Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation, this makes sense to me now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no message should be deleted unless they have got older than 30 days. Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

Not sure I follow this part. Which protocol are you referring to? And why should it be updated?


The `dn` parameter ("display name") in the resulting magnet link MAY be optional.

The resulting magnet link MUST be bundled into a `WakuMessageArchiveIndex`, which is then later distributed to other Status nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht I don't understand this step, why must the magnet link be bundled into WakuMessageArchiveIndex? I would have thought it just needs to be sent to all other community members via the hidden channel, but perhaps I'm missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to explain this here: https://github.com/status-im/specs/pull/162/files#r767957001

Then "bundling" in this spec merely refers to:

"Let's create an index of magnet links including all magnet links that have been created in the past + the new one that was just created"

^ That thing (also known as WakuMessageArchiveIndex) is then sent to community members via the special channel.

We can decide if we want to send this as-is, or, if we should create a magnet link for that index as well. So, sending the index directly vs sending the index as magnet link.

Mentioned it here: #162 (comment)


## Bundling history archives into archive indices

Community owner nodes MUST provide message archives for the entire community history. However, each individual archive only contains a subset of the complete history, that is, either data for a time range of seven days, or, a time range in which the node was offline. Therefore, message history archives need to be bundled into a `WakuMessageArchiveIndex`, which later distributed via the Waku network and allows receiving nodes to fetch archives for individual time ranges.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire history of each community (including all channels in that community) should be contained in a single binary blob. Every time a new chunk of history is generated, it should be prepended to the preexisting binary blob, and a new magnet link created from this newly enlarged binary blog. As such, members will only ever need to fetch the most recent magnet link from the hidden channel to access the entire history of a community.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the tricky part. If we prepend this, community owners have to redownload the entire torrent every time as there's no easy way to recognise that some data of that torrent to be downloaded has already been downloaded (assuming that we can't control the size of pieces, which affects this).

Tried to touch on this here: #162 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PascalPrecht given that we can control the size of torrent pieces, as per our hangout discussion the other evening, this approach should be doable


The community owner node MUST create a `WakuMessageArchiveIndex` every time it creates a new `WakuMessageArchive`.

For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why all of this is needed, and why we don't just sent the latest magnet link over the hidden channel whenever it's created. The latest magnet link will always be the single thing the user needs to download the entire history of the communtiy

## Message archive distribution

Message archives are available via the BitTorrent network as soon as magnet links for them have been created.
Other community member nodes will download the message archives from the BitTorrent network once they receive a magnet link that contains a message archive index.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a need a 'message archive index'


All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.

Only the community owner has permission to send messages with this topic.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To say the same thing using the Status Communities termonology:

Only the community owner has permissions to post to the hidden channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this enforced? It isn't at a transport level. So how are clients verifying this?

Perhaps this can be phrased as:

"Only the community owner MAY post to the hidden channel. Other messages on this specified channel MUST be ignored by clients."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.

Only the community owner has permission to send messages with this topic.
Community members MUST NOT have permission to send messages with this topic.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To say the same thing using the Status Communities terminology:

Community members have permissions to read (but not to post to) the hidden channel.


## Canonical message histories

Only community owners are allowed to distribute messages with magnet links via the magnet link channel. Community members MUST NOT be allowed to distribute magnet links. Since the magnet links are created from the community owner node's database (and previously distributed archives), the message history provided by the community owner becomes the canonical message history and single source of truth for the community.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying "Community members MUST NOT be allowed to distribute magnet links." I would say "Community members MUST NOT be allowed to post any messages to the hidden channel".


Generally, fetching message archives is a tree step process:

1. Receive message archive index signal, download index, then determine which message archives to download
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1. is simpler e.g.

  1. receive a torrent link in the hidden channel, pass torrent link to torrent client so torrent client can start downloading the binary blob

2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)

### Downloading message archive indices
When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again I don't understand why the CommunityMessageArchiveIndex exists, and why we don't just directly send the magnet links in the hidden channel instead

Therefore, member nodes MUST wait for 20 seconds after receiving the last `CommunityMessageArchiveIndex` before they start extracting the magnet link to fetch the latest archive index.

### Downloading individual archives
Once a message archive index is downloaded, community member nodes use a local lookup table to determine which of the listed archives are missing. For this lookup to work, member nodes MUST store the KECCAK-256 hashes of the magnet links for archives they've downloaded.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of the 'message archive index' and then we can also get rid of the whole local lookup table, this should simplify things.

A member only ever needs the latest magnet link to be able to download all a community's history.

1. **Download all archives** - Extract each magnet link in the index and pass them to the underlying BitTorrent client (this is the case for new community member nodes that haven't downloaded any archives yet)
2. **Download only the latest archive** - Extract only the newest magnet link and pass it to the BitTorrent client (this the case for any member node that already has downloaded all previous history and is now interested in only the latst archive)
3. **Download specific archives** - Look into `from` and `to` fields of every `WakuMessageArchiveMetadata` and only extract magnet links for archives of a specific time range (can be the case for member nodes that have recently joined the network and are only interested in a subset of the complete history)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably get rid of points 1, 2 and 3 above, because only a single magent link is all that is ever needed needed


### Bandwidth consumption

Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid clients downloading the same community history archive data two (or more!) times, it's very important that a torrent 'Force recheck' function is performed on top of the previously downloaded binary, so that the previously downloaded data is not downloaded again


Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.

This means there's a possibility member nodes will download the same data at least twice.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Member nodes should never need to download the same data from the community history archive service twice, see comment about the need to perform a 'force recheck' function with any new magnet link on top of any previously downloaded binary before commencing download of the latest magnet link.

Of course a member, a client does first receive messages live messages via waku, and then receives the messages a second time via torrent, so in this way a client is downloading every message exactly twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what this consideration is pointing out.

We do have the index file now in the latest version of this spec which gives us metadata about available archives. One thing member nodes could do is check whether they have been online in the time range of the latest available archive and then decide to not download the data and just consider it "downloaded"

But that will conflict with the idea that the community owner node is the canonical history. If the member node has received more or different messages than the community owner in that time range, the histories won't be identical.

So I guess for now we just need to accept that there's a possibility that data is being download twice (live messages + archive via torrent)


### Multiple community owners

It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short term problem that we can probably ignore for now. The reason we can ignore this is that as soon as possible we will tokenize community ownership with an NFT, and this will ensure that there can only ever be one owner. Now that one owner could run two nodes, but we can detect if this is happening and warn them that they shouldn't do this. Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can ignore it, it is still a possible attack vector that at minimum should be mentioned.

It isn't obvious to me that separating the two is going to always be straightforward. Example attack: community owner key gets compromised, and they start posting two different databases that are off-by-one that keeps churning local user db. Not a huge concern and perhaps not very likely, and if community is compromised a user can always leave etc etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

I believe there's no way for us to guarantee/enforce this. Theoretically, community owners could run a version of a node that bypasses all of that and still just publishes magnet links on the special channel.

So I guess the easiest thing we can do to account for that is to set and store a "main owner" that other nodes will then use to verify that the magnet link message was signed by that main owner.

All other magnet link messages will be ignored.

Copy link

@staheri14 staheri14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @PascalPrecht for preparing the specs! in general looks good to me!
I have left some comments for the first half of the specs, I will leave further comments for the rest as I go through


## Abstract

Messages are stored permanently by store nodes ([11/WAKU-MAILSERVER](/spec/11), or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/)) for up to 30 days. Messages older than that are no longer provided by store nodes, making it impossible for other nodes to request historical messages older than that. This is especially problematic in the case of Status communities, where recently joined members of a community aren't able to request complete message histories of the community channels.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such limit of 30 days persistence in the wakuv2 store protocol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, there's no limit but a max. number of days that messages are stored, which is configurable. It looked like currently 30 days is what's being used (and the default?), so I described it as such.

Will update this paragraph accordingly


| Name | References |
| -------------------- | --- |
| Waku node | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering why an Ethereum node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was honestly taken from 10/WAKU-USAGE. Probably out of date. Will change this.

| Waku node | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
| Store node | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
| Waku network | A group of Waku nodes connected through the internet connection and forming a graph |
| Community owner | A Status user that owns a Status community |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to be specific about "ownership", also what we mean by "Status user" and "community", let's discuss them in our call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'd be useful to refer to what keys etc they have access to. If "owner" is a well defined concept in community spec, can refer to this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points, will update.
I looks like linking to an existing concept isn't possible at the moment, because it turns out the original spec for communities has never landed: #151

| Waku network | A group of Waku nodes connected through the internet connection and forming a graph |
| Community owner | A Status user that owns a Status community |
| Community member | A Status user that is part of a Status community |
| Community owner node | A Status node with message archive capabilities enabled, run by a community owner |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "Status node" and "Status user" different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. wondering where you're getting at with this question.. A Status user is a Status account, a Status node is an application that runs a Status node (which a Status account can log into).

I'll add an entry for Status node as well.

| -------------------- | --- |
| Waku node | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
| Store node | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
| Waku network | A group of Waku nodes connected through the internet connection and forming a graph |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest being specific about the protocol through which waku nodes are connected i.e., wakuv2 relay (and its equivalent in waku v1)

2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can explain what is inside the waku message distributed in the hidden channel, then it would be clear for the reader what does it mean to extract the message

2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Member node extracts magnet link from message and passes it to torrent client
5. Member node extracts magnet link from the waku message and passes it to torrent client

1. User joins community and becomes community member
2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Member node receives magnet link message from store nodes
4. Member node receives the waku message that contains the message archival index magnet link from the special hidden channel

It is just a suggestion, feel free to edit as you think fit better

5. Member node extracts magnet link from message and passes it to torrent client
6. Torrent client downloads latest message archive index via magnet link
7. Member node fetches missing archives via torrent
8. Member node unpacks and decompresses message archive data to then hydrate its local database

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning


Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.

Community owner nodes SHOULD remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no message should be deleted unless they have got older than 30 days. Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

Copy link

@staheri14 staheri14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the second half of the specs and left some comments @PascalPrecht.

1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.

Exported messages MUST be restored as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/) for bundling. Waku messages that have been exported for bundling can now be removed from the community owner node's database (community owner nodes still maintain a database of application messages).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re deleting messages, please see this previous comment of mine
https://github.com/status-im/specs/pull/162/files#r770793021

The range for the `timestamp` depends on the context in which the community owner node attempts to create a history archive. This can be one of the following:

1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this comment about this https://github.com/status-im/specs/pull/162/files#r768666978
I also think bundling messages should be always based on 7 days interval (decoupled from nodes restart)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think bundling messages should be always based on 7 days interval (decoupled from nodes restart)

You mean that, when it missed 30 days of messages, it should still create 4 archives for that (4x 7 days), while the last 2 days of messages go into the next archive?

Makes sense!


The `to` field SHOULD contain a timestamp of the time range's the higher bound.

The `contentTopic` field MUST contain the same `contentTopic` that the archive's `messages` have.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that contentTopic is better to be repeated, to include all the possible content topics within the community

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


### WakuMessageHistoryArchive

The `from` field SHOULD contain a timestamp of the time range's lower bound.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to be specific about the semantic of time here, in waku v2, timestamps are double and contain Unix epoch time in seconds https://rfc.vac.dev/spec/14/#wakumessage (maybe no need for these details in current state of specs, but once we decide on the implementation details we shall update the specs)

message WakuMessageArchive {
uint64 from = 1
uint64 to = 2
string contentTopic = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string contentTopic = 3
repeated string contentTopic = 3


For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.

The the community owner node MUST derive a magnet link from the newly created `WakuMessageArchiveIndex` so it can be distributed to community member nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our last convo, I think it would be good to persist the WakuMessageArchiveIndex in the long-term storage layer i.e., BitTorrent, otherwise, there is a possibility of losing the WakuMessageArchiveIndex if not properly persisted by Status nodes locally


```
{community_id}-archives
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, content Topics follow this format https://rfc.vac.dev/spec/23/#content-topics
/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)

### Downloading message archive indices
When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought member nodes receive a waku message whose payload is a ApplicationMetadataMessage which embodies a CommunityMessageArchiveIndex as its payload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this to CommunityMessageArchive.

But yes, that's the payload. And it has the magnet_uri that needs to be passed to the bittorrent client.


Due to the nature of distributed systems, there's no guarantee that a received message is the "last" message. This is especially true when member nodes request historical messages from store nodes.

Therefore, member nodes MUST wait for 20 seconds after receiving the last `CommunityMessageArchiveIndex` before they start extracting the magnet link to fetch the latest archive index.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach could be to add a sequence number to the CommunityMessageArchiveIndex, and member nodes can immediately decide if they should proceed with downloading or not.

Also, I am not sure why the 20 second waiting time is needed? archives are published every 7 days, why should two successive archives be sent within 20 seconds interval

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all of this is for the case that messages are requested and the node receives a magnet link message but doesn't actually know whether it's the latest one (this can be the case for a new member that doesn't have any history at all yet). Maybe there's another such message arriving in the near future. So I thought we need some threshold before we start processing the magnet link.

When message archives are fetched, community member nodes MUST unwrap the resulting `WakuMessage` instances into `ApplicationMetadataMessage` instances and store them in their local database.
Community member nodes SHOULD NOT store the wrapped `WakuMessage` messages.

Already stored messages with the same `id` or `clock` value MUST be replaced with messages extracted from archives, if both of these values are equal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with the consistency i.e., replacing everything from T1-T2, we can later design a synchronization protocol across store nodes to make sure they all have consistent message hisotry

Copy link
Contributor

@oskarth oskarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I see there's a lot of discussion regarding archival index etc. I haven't had the bandwidth to dig into this in any detail, but seems like you all have thought about it from a bunch of different POVs and discussed about it in more detail so I assume we are on our way to a reasonable solution here :P

If it is still an open question by beginning of January, perhaps summarizing the current envisioned approaches and trade-offs would be useful?

EDIT: I see there's this https://hackmd.io/@YoQpkPmuRJ-48bA5PRoaRg/HyBDfl59Y which already does this, then there's a bunch of new comments in the chat. Um... this is too involved for me to personally get into weeds off right now, can have a closer look beginning of next year.

Possibly naive question: is it possible to get best of both worlds? with one torrent and using message index etc. For example, a lot of torrents have multiple archives within them and a user can choose which ones they want to download, e.g. individual media files in some collection, separated by day (say).

- Community owner nodes provide archives with historical messages **at least** every 30 days
- Community owner nodes receive all community messages
- Community owner nodes are honest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a sentence or two on that some of the assumptions are less than ideal, and will be enhanced in future work (potentially linking to https://forum.vac.dev/t/status-communities-protocol-and-product-point-of-view/114/2 or some other GH issue, or leave links out if it feels more in line with general spec).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

If the community owner node goes offline, it MUST go through the following process:

1. Community owner node restarts
2. Community owner node requests messages from store nodes for the missed time range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a community owner always know the full list of channels in a community as soon as one is created?


Community member nodes go through the following (high level) process to fetch and restore community message histories:

1. User joins community and becomes community member
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the "community spec" is different from this spec, but it acts a form of requirement. It'd be very useful to have a clear community spec here to refer these things unambiguously (owner, channels, members, etc etc)

Community member nodes go through the following (high level) process to fetch and restore community message histories:

1. User joins community and becomes community member
2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

In Waku v2 terms, it could be under a special content topic namespaced under a community (say), also indicating what data format is used (compressed magnet link or whatever), see https://rfc.vac.dev/spec/23/#content-topics

Since this spec is written to work for Waku v1, just any unique topic seems useful to start with, and this can be improved later on.

2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
3. Member node requests message history (last 30 days) of community channels from store nodes
4. Member node receives magnet link message from store nodes
5. Member node extracts magnet link from message and passes it to torrent client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

Glad to see this is WakuMessages, because it'll make future compatibility infinitely easier! Magnet link could be its own field as well. By keeping this an open kv map (protobuf or json), we can extend it as there's a need, e.g. with any type of compression/time period or whatever we may want to communicate.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

uint64 from = 1
uint64 to = 2
string contentTopic = 3
repeated WakuMessage messages = 4 // `WakuMessage` is provided by 14/WAKU2-MESSAGE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to do this with a simple function that just passes the payload and maybe maps content topic to content topic or so.


All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.

Only the community owner has permission to send messages with this topic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this enforced? It isn't at a transport level. So how are clients verifying this?

Perhaps this can be phrased as:

"Only the community owner MAY post to the hidden channel. Other messages on this specified channel MUST be ignored by clients."


## Canonical message histories

Only community owners are allowed to distribute messages with magnet links via the magnet link channel. Community members MUST NOT be allowed to distribute magnet links. Since the magnet links are created from the community owner node's database (and previously distributed archives), the message history provided by the community owner becomes the canonical message history and single source of truth for the community.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be rephrased for clarity, it is important to for a spec reader to understand that anyone CAN post to this topic. There's no protocol level validation in terms of relaying messages or whatever.

The semantics we are pointing to here is that any messages from a bad source MUST NOT be accepted. This points to a validation process, that each client has to perform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise what could easily happen is that some implementation, say js-waku, just starts seeding a magnet link on the assumption that the channel is "safe", and this could be god knows what that some troll decided to upload.


### Multiple community owners

It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can ignore it, it is still a possible attack vector that at minimum should be mentioned.

It isn't obvious to me that separating the two is going to always be straightforward. Example attack: community owner key gets compromised, and they start posting two different databases that are off-by-one that keeps churning local user db. Not a huge concern and perhaps not very likely, and if community is compromised a user can always leave etc etc.


Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.

Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single message is missing in one of the histories, the hashes presented in archive indices

Agree, and if we can make design robust to this it'd be useful. I suppose this is related to the whole archival index discussion? (I haven't kept up here in detail, just noticed a lot of back and forth).

Copy link
Member Author

@0x-r4bbit 0x-r4bbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey everyone!

I've updated the draft. The changes are in a separate commit so it's a bit easier to review.


## Abstract

Messages are stored permanently by store nodes ([11/WAKU-MAILSERVER](/spec/11), or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/)) for up to 30 days. Messages older than that are no longer provided by store nodes, making it impossible for other nodes to request historical messages older than that. This is especially problematic in the case of Status communities, where recently joined members of a community aren't able to request complete message histories of the community channels.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, there's no limit but a max. number of days that messages are stored, which is configurable. It looked like currently 30 days is what's being used (and the default?), so I described it as such.

Will update this paragraph accordingly


| Name | References |
| -------------------- | --- |
| Waku node | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was honestly taken from 10/WAKU-USAGE. Probably out of date. Will change this.

| Waku node | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
| Store node | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
| Waku network | A group of Waku nodes connected through the internet connection and forming a graph |
| Community owner | A Status user that owns a Status community |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points, will update.
I looks like linking to an existing concept isn't possible at the moment, because it turns out the original spec for communities has never landed: #151

| Waku network | A group of Waku nodes connected through the internet connection and forming a graph |
| Community owner | A Status user that owns a Status community |
| Community member | A Status user that is part of a Status community |
| Community owner node | A Status node with message archive capabilities enabled, run by a community owner |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. wondering where you're getting at with this question.. A Status user is a Status account, a Status node is an application that runs a Status node (which a Status account can log into).

I'll add an entry for Status node as well.

- Community owner nodes provide archives with historical messages **at least** every 30 days
- Community owner nodes receive all community messages
- Community owner nodes are honest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!


All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.

Only the community owner has permission to send messages with this topic.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)

### Downloading message archive indices
When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this to CommunityMessageArchive.

But yes, that's the payload. And it has the magnet_uri that needs to be passed to the bittorrent client.


Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.

This means there's a possibility member nodes will download the same data at least twice.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what this consideration is pointing out.

We do have the index file now in the latest version of this spec which gives us metadata about available archives. One thing member nodes could do is check whether they have been online in the time range of the latest available archive and then decide to not download the data and just consider it "downloaded"

But that will conflict with the idea that the community owner node is the canonical history. If the member node has received more or different messages than the community owner in that time range, the histories won't be identical.

So I guess for now we just need to accept that there's a possibility that data is being download twice (live messages + archive via torrent)


### Multiple community owners

It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

I believe there's no way for us to guarantee/enforce this. Theoretically, community owners could run a version of a node that bypasses all of that and still just publishes magnet links on the special channel.

So I guess the easiest thing we can do to account for that is to set and store a "main owner" that other nodes will then use to verify that the magnet link message was signed by that main owner.

All other magnet link messages will be ignored.


Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.

Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

Yes, I think what we can do is set a "main owner". So even if there are multiple ppl with private keys, only one main owner could be set. Obviously, with multiple owners having the private key and write privileges, each of them can change that value as they like.

This could still be problematic if they serve different archives. Then the question is: will member nodes simply ignore all the older archives in a given time range (because they might look completely different), or will they also download all of it and keep replacing all of it.

In other words: If member nodes detect that the history has changed, will they replace that entire history, or will they stick to only downloading the latest #n archives?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants