Draft: Specification for the Community History Problem (MVP) #162

0x-r4bbit · 2021-12-13T12:27:08Z

No description provided.

0x-r4bbit

@staheri14 @oskarth @iurimatias @John-44

Hey everyone,

here's a first draft of the specification for the community history problem.
Please review this and leave feedback on whether this goes in the correct direction.

I've also added some inline comments for clarification and questions.

0x-r4bbit · 2021-12-13T12:28:54Z

docs/raw/serving-community-history.md

+3. A special type of channel for distributing magnet links ([Magnet URI scheme](https://en.wikipedia.org/wiki/Magnet_URI_scheme), [Extensions for Peers to Send Metadata Files](https://www.bittorrent.org/beps/bep_0009.html)) is created
+4. Community owner invites members and creates additional channels
+5. Community owner node receives messages and stores them into local database
+6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client


@John-44 I think your initial proposal said the owner is gonna do this after 14 days initially and then do it subsequentially every 7 days. I wasn't sure why that was, so I want with 7 days right away. Happy to change this.

@PascalPrecht I meant 7 days initially, sorry for not being clear on this. So 7 days right away is correct :-)

@PascalPrecht isn't the magnet link created in step 8? If so, shouldn't this sentence be updated to say:

"6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database into a binary blob"?

Then in step 7 this binary blob is prepended to the previous binary blob (if any), and then the magnet link is created in step 8?

Generally I haven't explicitly said "blob" everywhere, because I think we can safely assume that computers send blobs at the end of the day (we also don't mention blobs in any other spec when talking about data).

isn't the magnet link created in step 8?

As later in the document explicitly stated, there's actually multiple magnet links being created:

One for each archive (either every 7 days or for whatever time range messages were missed)

One for every time a new message archive index is published (which includes the magnet link to the previously created archive + magnet links for all archives prior to that.

See "Bundling history archives into archive indices" section for a more detailed explaination.

Also, happy to add visuals to make this more clear.

I'd introduce BitTorrent client somewhere maybe in the terminology part

0x-r4bbit · 2021-12-13T12:30:46Z

docs/raw/serving-community-history.md

+
+## Storing live messages
+
+Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.


I've specified 14/WAKU2-MESSAGE here, but this is probably not true as Status currently uses Waku V1. However, it seems Status can talk to V2 store nodes, so it apparently also understand V2 messages.

Let me know if this needs change.

@PascalPrecht is this how Status-Go stores messages at the moment? What metadata does Status-go store for each message in it's local database today?

Status-go only stores ApplicationMeseageMetadata (as specified in 6/PAYLOAD). Which is also one of the reasons why it's important for the community owner node to store the full WakuMessage in addition to that (because otherwise we lose message integrity).

More on that in the "Storing live messages" section.

Thanks for making this explicit. We should use 14/WAKU2-MESSAGE if possible for reasons mentioned.

Depending on how timestamp is set, it may also be possible to reconstruct from 7/WAKU-DATA https://rfc.vac.dev/spec/7/ but this can get iffy, also with hashes etc... I defer to @staheri14 on this

(edit I see this is elaborated on below)

0x-r4bbit · 2021-12-13T12:31:18Z

docs/raw/serving-community-history.md

+1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
+2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.
+
+Exported messages MUST be restored as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/) for bundling. Waku messages that have been exported for bundling can now be removed from the community owner node's database (community owner nodes still maintain a database of application messages).


ahh, from this comment are you saying that today Status Desktop already stores all messages received twice, once in status-go's local message database and it also keeps the received messages in a separate 'node' database?

We definitely do not want to delete messages from the owner node's status-go local message database

Please see this comment: https://github.com/status-im/specs/pull/162/files#r767998941

Re deleting messages, please see this previous comment of mine
https://github.com/status-im/specs/pull/162/files#r770793021

Added a comment that no WakuMessage younger than 30 days will be removed

0x-r4bbit · 2021-12-13T12:33:31Z

docs/raw/serving-community-history.md

+
+For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.
+
+The the community owner node MUST derive a magnet link from the newly created `WakuMessageArchiveIndex` so it can be distributed to community member nodes.


This magnet link is then later sent to the "special" channel so members can fetch the index and figure out which archives to download.

Theoretically, it's not necessarily required to distribute the index as a magnet link first. We might as well send the index as message directly to the status network.

This would save one roundtrip for member nodes to get the index from the torrent network, but would put a bit more bandwidth pressure on the status/waku network.

Thoughts welcome!

ahh, I think you've just seen the same thing I have - that we don't need to worry about a WakuMessageArchiveIndex, and just need to send the most recent magnet link. Unless there is something I'm missing of course! ;-)

Following our last convo, I think it would be good to persist the WakuMessageArchiveIndex in the long-term storage layer i.e., BitTorrent, otherwise, there is a possibility of losing the WakuMessageArchiveIndex if not properly persisted by Status nodes locally

0x-r4bbit · 2021-12-13T12:34:31Z

docs/raw/serving-community-history.md

+
+```
+{community_id}-archives
+```


I don't know yet how exactly topics look like so this might not make a lot of sense.

Feel free to add suggestions on what those topics should look like!

To be more specific, content Topics follow this format https://rfc.vac.dev/spec/23/#content-topics
/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

0x-r4bbit · 2021-12-13T12:34:55Z

docs/raw/serving-community-history.md

+Generally, fetching message archives is a tree step process:
+
+1. Receive message archive index signal, download index, then determine which message archives to download
+3. Download individual archives


That's 2 steps. Needs update.

I don't think we want individual archives, I think we want all of a community's history to be in a single binary blob (with a separator string inserted into the binary blob at the point each new blob is prepended to the prexisting blob, to enable easy splitting of the blob for the purposes of enabling partial download of a community's history and/or keeping the storage used by a community with in a set bound (in phase 2 of this project - enabling partial downloads and fixed storage use per community isn't needed for the MVP)

How do we prevent nodes from redownloading all the data they have downloaded already, if they don't know at this point what that blob looks like (they need to download it first, then can perform that splitting).

Using the archive index, I've tried to come up with a solution to account for that. It also provides flexibility for nodes to selectively decide for what date range they want to fetch archives (something the blob-only solution won't do, unless I'm missing something here)

0x-r4bbit · 2021-12-13T12:37:16Z

docs/raw/serving-community-history.md

+When message archives are fetched, community member nodes MUST unwrap the resulting `WakuMessage` instances into `ApplicationMetadataMessage` instances and store them in their local database.
+Community member nodes SHOULD NOT store the wrapped `WakuMessage` messages.
+
+Already stored messages with the same `id` or `clock` value MUST be replaced with messages extracted from archives, if both of these values are equal.


@staheri14 I know we talked about this but I still went for "only replace what needs replacement" for now. This can cause inconsistency compared to the canonical history though, so if we really want to replace everything from T1 - T2, no matter what, let me know and I'll update this.

/cc @John-44

@PascalPrecht my vote is that we prioritize consistency of the community's message history over the risk that the owner node might not download a particular message. I think it's better for everybody in the community to have the exact same message history, even if this exact same message history can be missing a message if the owner node for some reason didn't download a message prior to creating the message archive for that 7ish day period

I also agree with the consistency i.e., replacing everything from T1-T2, we can later design a synchronization protocol across store nodes to make sure they all have consistent message hisotry

0x-r4bbit · 2021-12-13T12:37:43Z

docs/raw/serving-community-history.md

+
+Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.
+
+Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).


@John-44 very important consideration that I think we haven't talked about yet.

@PascalPrecht yes it is, well spotted!

As soon as we have tokenised community ownership it won't be possible for two people (aka ethereum accounts) to own a single community, but there is nothing to stop a single person signing into two Status Desktop instances with the same profile and therefore their same account (that owns a community) would be running on two nodes.

Could we detect if this occurs, and automatically select only one node out of however many nodes the owner has spun up to be the node that produces history? We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

What do you think?

single message is missing in one of the histories, the hashes presented in archive indices

Agree, and if we can make design robust to this it'd be useful. I suppose this is related to the whole archival index discussion? (I haven't kept up here in detail, just noticed a lot of back and forth).

We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

Yes, I think what we can do is set a "main owner". So even if there are multiple ppl with private keys, only one main owner could be set. Obviously, with multiple owners having the private key and write privileges, each of them can change that value as they like.

This could still be problematic if they serve different archives. Then the question is: will member nodes simply ignore all the older archives in a given time range (because they might look completely different), or will they also download all of it and keep replacing all of it.

In other words: If member nodes detect that the history has changed, will they replace that entire history, or will they stick to only downloading the latest #n archives?

0x-r4bbit · 2021-12-13T12:38:19Z

docs/spec/6-payloads.md

+message CommunityMessageArchiveIndex {
+  uint64 clock = 1;
+  string magnet_uri = 2;
+}


If we decide to not distribute the index via magnet link, but as-is, then this payload needs to change.

0x-r4bbit · 2021-12-13T13:05:12Z

I'm gonna fix the typos once the review process/iteration is done

John-44 · 2021-12-13T14:58:17Z

docs/raw/serving-community-history.md

+Community owner nodes go through the following (high level) process to provide community members with message histories (assumes community owner node is available 24/7):
+
+1. Community owner creates a Status community
+2. Community owner enables community history archive support


I think community history archive support should be on by default when a community owner creates a new community

We will allow community members to turn this feature off, so I assume there's gonna be some UI switch that they can use when creating/editing communities.

I don't think we want to force community owners to have this enabled. Which means, owner could switch it off during creation. Because I assume that, I made it explicit that it has been enabled.

Yes, Community Owners should definitely be able to switch the community history archive service off, we will have a toggle in the community admin settings to let a community owner do this. I was proposing that this service should be switched on by default when creating a new community.

Community owner enables community history archive support

What does this mean in terms of specifications? Is it like a flag that should be set when running a status node? if there is a specification, please link it here

Good point, there's no such specification yet. I assume this is something that should be store in the logged-in user's Settings. @richard-ramos what do you think?

John-44 · 2021-12-13T15:00:14Z

docs/raw/serving-community-history.md

+
+### Serving community history archives
+
+Community owner nodes go through the following (high level) process to provide community members with message histories (assumes community owner node is available 24/7):


Ideally a community owner node is on 24/7, but we also support a community owner node coming online for say an hour only once every three days at a minimum. This minimum community owner node liveness assumption shouldn't break anything in this proposal, I had this in mind when I wrote the rough sketch of this this could work.

Yes, this part of the overview covers the (ideal) scenario that the community owner node is online 24/7 and therefore generates archives every 7 days.

In reality, it might go offline, which is covered in the "Serving archives for missed messages" part.
Can make this more explicit if this isn't clear.

don't worry, I think this is clear, I just started commenting from the top as I read through the doc, once I reached that section it was clear

John-44 · 2021-12-13T15:02:34Z

docs/raw/serving-community-history.md

+4. Community owner invites members and creates additional channels
+5. Community owner node receives messages and stores them into local database
+6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client
+7. Community owner node creates message archive index and bundles the previously generated magnet link into it


I think this should be "Community owner node prepends the newly created binary blob to the previously generated binary blob (if any - the first time this operation is performed there will be no previous binary blob prepend the new binary blob to)"

Yes, as mentioned here https://github.com/status-im/specs/pull/162/files#r767952892, this is technically not the case.

We're not actually bundling the raw data with the new raw data. The reason for that is, we can't guarantee the size in which pieces will be created by torrent, so we can't guarantee that the hashes for each piece will be the same every time (different length == different hash).

So instead, we're distributing an index that looks like:

index: { "0x123": { from: ... to: ... magnet_uri: magnet_link_for_archive_1 }, "0x456": { from: ... to: ... magnet_uri: magnet_link_to_archive_2 } ... }

Then, based on the hashes (which are derived from the magnet_uri which is unique), member nodes can decide which ones they need (all, some, none).

See "Fetching message archives" section for a detailed explanation

The reason for that is, we can't guarantee the size in which pieces will be created by torrent, so we can't guarantee that the hashes for each piece will be the same every time

Quick update here, this should say "I assume we can't guarantee [...]".

Torrent clients seem to offer UIs to let users configure the size and/or number of pieces, so it might be possible to go a different route that doesn't use indices.

Will discuss with @staheri14 and update spec accordingly!

@PascalPrecht thanks for the call the other evening on this topic. Yes we definitely should be able set the torrent piece size to a fixed size, irrespective of the size of the torrent. Prob. better to go for a larger piece size, as this will work better when a torrent grows larger. 4MB seems to be a common torrent piece size

Community owner node creates message archive index and bundles the previously generated magnet link into it

I'd add a link to the section where "message archive index" is explained

Suggested change

7. Community owner node creates message archive index and bundles the previously generated magnet link into it

7. Community owner node creates message archive index and bundles the magnet link generated in step 6 into it

I'd add a link to the section where "message archive index" is explained

Will do!

John-44 · 2021-12-13T15:09:57Z

docs/raw/serving-community-history.md

+5. Community owner node receives messages and stores them into local database
+6. After 7 days, the community owner node exports and compresses last 7 days worth of messages from database and creates a magnet link from that data via torrent client
+7. Community owner node creates message archive index and bundles the previously generated magnet link into it
+8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 2) through the Waku network


How about replacing "from index" with "from binary blob" to make it clearer what we are talking about?

This should say "from archive index". Sorry there's a missing word...

Apart from that, as mentioned here https://github.com/status-im/specs/pull/162/files#r767957001, this would be incorrect.

We are creating a magnet link from the archive index.

Although as mentioned here https://github.com/status-im/specs/pull/162/files#r767717175, creating a magnet link for the index may not be necessary, as we can distribute it via waku messages as is.

The relation between the special channel and the waku network needs to be clarified

Suggested change

8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 2) through the Waku network

8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 3) through the Waku network

The relation between the special channel and the waku network needs to be clarified

Can you elaborate what you mean by that? From waku's perspective, it's just another channel.
I don't go into the specifics of what that channel looks like and how other nodes can recognize it because this is still just the high-level overview part

John-44 · 2021-12-13T15:11:31Z

docs/raw/serving-community-history.md

+If the community owner node goes offline, it MUST go through the following process:
+
+1. Community owner node restarts
+2. Community owner node requests messages from store nodes for the missed time range


to be more explicit, perhaps add a mention of 'all channels in their community' to the end of the sentence?

"2. Community owner node requests messages from store nodes for the missed time range for all channels in their community"

Sure can add that!

Is there any specification on how to measure the missed time range in waku v1? in waku v2 FT-store, we measure it based on the waku message timestamp of the last message stored in the db, up to the time the node goes back online

Will a community owner always know the full list of channels in a community as soon as one is created?

Is there any specification on how to measure the missed time range in waku v1?

@staheri14 I don't know if there is even such a mechanism in Waku V1. @richard-ramos I believe there's no FT-Store equivalent in Wakuv1, is that correct?

I thought, given that the owner node is essentially a Status app, we'd implement it similar to (or even exactly) how it's done in Waku2 Ft-Store

Will a community owner always know the full list of channels in a community as soon as one is created?

I assume only community owner/admins are able to create channels in communities, so they should always know the full list of channels in a community.

John-44 · 2021-12-13T15:14:27Z

docs/raw/serving-community-history.md

+
+1. Community owner node restarts
+2. Community owner node requests messages from store nodes for the missed time range
+3. Missed messages are stored into local database


perhaps change:

Missed messages are stored into local database

to

All missed messages are added to into the Community owner node's local message database

I'm suggested adding "All" at the beginning of the sentence to make it explicit that the next step shouldn't happen until fetching all of the missing messages is complete.

Happy to update this to be more explicit.

Just pointing out though that this:

to make it explicit that the next step shouldn't happen until fetching all of the missing messages is complete.

we can't actually guarantee.

I've added a comment about that in the "Downloading message archive indices" section.

John-44 · 2021-12-13T17:24:24Z

docs/raw/serving-community-history.md

+1. Community owner node restarts
+2. Community owner node requests messages from store nodes for the missed time range
+3. Missed messages are stored into local database
+4. Community owner node creates message archive and magnet link for missed messages


Shouldn't this be:

If 7 or more days have elapsed since the last message history torrent was created then he community owner node exports and compresses last 7 days worth of messages from database into a binary blob"

Community owner node prepends the newly created binary blob to the previously generated binary blob

Community owner node creates magnet link from index and distributes it to community members via special channel

(note: I'm also suggesting updating steps 5 and 6 directly below this comment)

If 7 or more days have elapsed since the last message history torrent was created then he community owner node exports and compresses last 7 days worth of messages from database into a binary blob"

Does this still apply in this case here? Cause this is discussing the "community owner node was offline for #n days" and simply wants to restore the message gap.

Once that is done it'll go back to "every 7 days" routine.

@PascalPrecht say an owner node goes offline for 3 days immediately after creating the last torrent. In this scenario when the owner node comes back online, only 3 days have elapsed since the last torrent was created, so a new torrent doesn't need to be created for another 4 days. If we don't do this, if an Owner Node goes on and offline frequently, that community will end up with torrents being created far more frequently than needed

Aah, got it now. Yes, absolutely correct. Will update.

John-44 · 2021-12-13T17:27:04Z

docs/raw/serving-community-history.md

+
+1. User joins community and becomes community member
+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes


If the Member node can fetch a magnet link to a binary blob that contains the community's previous history, and the history included in this binary blob starts starts (for example) at 7 days in the past and then goes backwards, then the member node only needs to fetch message history up until the point when the community history archive service takes over, and this might often be less than 30 days

The 30 days here are the up to 30 days that store nodes save. When a community member joins a community, that member has 0 days of message history. So what happens is that it will probably fetch last 30 days of message history from store nodes (this is how this works today).

Then, now that there's a notion of message archive index signals, member nodes can recognise those and start requesting archives older than 30 days.

@PascalPrecht I think what currently what happens is that when a member joins a community, only the messages in the 'message auto-fetch window' are automatically downloaded, I can't remember what this currently defaults to on Desktop but I think it might be 7 days? If a user wants to fetch messages beyond the 'message auto-fetch window', the user needs to go into the channel they want to fetch messages for an manually trigger a message fetch. I would love if we could extend this 'message auto-fetch window' to fetch the full 30 days of message history stored on mailservers, however I remember we ran into a bunch of issue when we tried to do this earlier in the year and that's why we landed on a shorter default 'message auto-fetch window'.

Now for this community history archive service, it's very important that for the Owner Node this 'message auto-fetch window' is lengthened to always try to fetch all messages from a community that have been sent since the Owner Node was last online, even if the Owner Node was (for example) last online 28 days ago. This should probably be written down in the spec somewhere.

For community members, I would love if the 'message auto-fetch window' default setting could be lengthened to 30 days, however if the issue that prevented us doing this earlier in the year are still present, we might need to settle of a shorter default message auto-fetch window, say 8 days??

Ah, that's interesting. I wasn't aware of that. Should've verified. My understanding was that all the available live message history (30 days) will be loaded.

I don't think it makes such a big difference if we fetch the first 7 or 8 days, but what's important is that, when a member joins, that member needs to fetch enough messages to receive the last published magnet link. If that one wasn't published within that "message auto-fetch window", we'll have to keep fetching for older messages periodically until we receive a magnet link message either way.

@richard-ramos do you know if it's possible to extend the 'message auto-fetch' for individual channels? So that community members will fetch the last 30 days (not 7) for that special magnet link channel?

John-44 · 2021-12-13T17:27:52Z

docs/raw/serving-community-history.md

+1. User joins community and becomes community member
+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes


I think:

"4. Member node receives magnet link message from store nodes"

should actually be

"4. Member node receives magnet link message from the special hidden channel"

Okay can change that!

Suggested change

4. Member node receives magnet link message from store nodes

4. Member node receives the waku message that contains the message archival index magnet link from the special hidden channel

It is just a suggestion, feel free to edit as you think fit better

John-44 · 2021-12-13T17:29:04Z

docs/raw/serving-community-history.md

+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client


The message is just a magnet link and nothing else, so it doesn't really need to be extracted in any way other than just the whole message being passed to the torrent client

The "extracting" part here refers to the fact that messages are distributed as WakuMessage and need to be unwrapped to actually see what's inside.

If this seems confusing I can probably remove that detail here.

I think this is fine, up to you :-)

I think if we can explain what is inside the waku message distributed in the hidden channel, then it would be clear for the reader what does it mean to extract the message

Suggested change

5. Member node extracts magnet link from message and passes it to torrent client

5. Member node extracts magnet link from the waku message and passes it to torrent client

Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

Glad to see this is WakuMessages, because it'll make future compatibility infinitely easier! Magnet link could be its own field as well. By keeping this an open kv map (protobuf or json), we can extend it as there's a need, e.g. with any type of compression/time period or whatever we may want to communicate.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

@oskarth this is covered here: https://github.com/status-im/specs/pull/162/files#diff-1d7b2a048dd5e11a0620aaa98e258f12170764802e703696ee623392214cbd95R233

Essentially we expect messages in the special channel to be ApplicationMetadataMessages (just like most other messages sent by Status) and then introduce a new payload type by which we know that the message is not a normal chat message, but indeed a special message that contains a magnet link.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

Very good point. Status node needs to verify that the magnet link message that came in through the special channel is signed by the community owner and then is assumed to be trusted.

John-44 · 2021-12-13T17:29:58Z

docs/raw/serving-community-history.md

+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client
+6. Torrent client downloads latest message archive index via magnet link


Torrent client downloads latest message archive index via magnet link

perhaps change to

Torrent client downloads latest message archive binary blob via magnet link

(I think "binary blob" is more descriptive of what we are talking about here than "index"

(I think "binary blob" is more descriptive of what we are talking about here than "index"

As mentioned in a bunch of other comments: everything is a blob at the end of the day. What matters is how its encoded. In this case it happens to be a WakuMessageArchiveIndex

^ This information is more important for this spec to work, than that everything is a blob.

ahh, just remembered, we need to add something here about the torrent client performing a 'force recheck' function using the new magnet link on top of the previously downloaded binary. This is very important to stop clients needing to download the same data multiple times

John-44 · 2021-12-13T17:30:49Z

docs/raw/serving-community-history.md

+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client
+6. Torrent client downloads latest message archive index via magnet link
+7. Member node fetches missing archives via torrent


Perhaps change from

"7. Member node fetches missing archives via torrent"

to

""7. Member node fetches binary blob that contains the message history archive for the community via torrent"

John-44 · 2021-12-13T17:32:33Z

docs/raw/serving-community-history.md

+5. Member node extracts magnet link from message and passes it to torrent client
+6. Torrent client downloads latest message archive index via magnet link
+7. Member node fetches missing archives via torrent
+8. Member node unpacks and decompresses message archive data to then hydrate its local database


Perhaps add ", deleting any messages for that community that the database previously stored in the same timerange as covered by the message history archive binary blob"

e.g.

"8. Member node unpacks and decompresses message archive data to then hydrate its local database, deleting any messages for that community that the database previously stored in the same timerange as covered by the message history archive binary blob"

Okay cool, will change this

This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning

I'm not sure I follow this conversation.

The assumption in currently is definitely that network is reliable

In a distributed system the network isn't actually reliable, but we are assuming it is to simplify things for now (e.g. with cluster operation etc)

How does this connect with the replacing or not of local db? Maybe more generally: what are the scenarios where the two file systems might be out of sync, and are we 100% confident that our reconciliation algorithm is correct here? From a development POV, at a minimum this should be printed out to debug and perhaps some form of backup should be used. Some scenarios I can imagine:

Client didn't request all store messages so a lot of new data comes in from archive, all good (easy to sync)

Client actually has more data than archive, which means there's an inconsistent view. My understanding is that from a product POV we prefer to then force users to have the same view based on community owner POV. Outside of network/logical issues (who gets what information what). This also means community owner have some control to censor if they so wish (they can in other ways too so eh).

...probably a few more

I think whatever we think makes sense from a product POV that simplifies things is fine here, as long as we are explicit about what assumptions around reconciliation we are making, as well as what type of attacks that makes it vulnerable to.

Sorry maybe I'm misunderstanding here.

What @oskarth has described here is correct. We're forcing the community owner's history to all members, even when they have received more messages than the owner (and therefore causing the member to delete messages that aren't missing in the owner state, but also don't exist there).

This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning

@staheri14 can you maybe elaborate how this is related to updating the member's database?

John-44 · 2021-12-13T17:34:18Z

docs/raw/serving-community-history.md

+
+Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.
+
+Community owner nodes SHOULD remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.


I think this is incorrect. Community owner nodes SHOULD NOT remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.

Because if a community owner node did this, then the anybody directly using the owner node to browse the community wouldn't be able to search their own community, which would be weird

Probably not as clear in this part (I'll make this more specific), but as mentioned here: https://github.com/status-im/specs/pull/162/files#r767997713

This is referring to the WakuMessage's, which are no longer needed after they have been migrated to long-term storage. ApplicationMetadataMessage's will stay around. Those are the ones used for rendering/searching etc.

thanks for the explanation, this makes sense to me now

I'd say no message should be deleted unless they have got older than 30 days. Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

Not sure I follow this part. Which protocol are you referring to? And why should it be updated?

John-44 · 2021-12-13T17:42:08Z

docs/raw/serving-community-history.md

+
+The `dn` parameter ("display name") in the resulting magnet link MAY be optional.
+
+The resulting magnet link MUST be bundled into a `WakuMessageArchiveIndex`, which is then later distributed to other Status nodes.


@PascalPrecht I don't understand this step, why must the magnet link be bundled into WakuMessageArchiveIndex? I would have thought it just needs to be sent to all other community members via the hidden channel, but perhaps I'm missing something?

I've tried to explain this here: https://github.com/status-im/specs/pull/162/files#r767957001

Then "bundling" in this spec merely refers to:

"Let's create an index of magnet links including all magnet links that have been created in the past + the new one that was just created"

^ That thing (also known as WakuMessageArchiveIndex) is then sent to community members via the special channel.

We can decide if we want to send this as-is, or, if we should create a magnet link for that index as well. So, sending the index directly vs sending the index as magnet link.

Mentioned it here: #162 (comment)

John-44 · 2021-12-13T17:44:37Z

docs/raw/serving-community-history.md

+
+## Bundling history archives into archive indices
+
+Community owner nodes MUST provide message archives for the entire community history. However, each individual archive only contains a subset of the complete history, that is, either data for a time range of seven days, or, a time range in which the node was offline. Therefore, message history archives need to be bundled into a `WakuMessageArchiveIndex`, which later distributed via the Waku network and allows receiving nodes to fetch archives for individual time ranges.


The entire history of each community (including all channels in that community) should be contained in a single binary blob. Every time a new chunk of history is generated, it should be prepended to the preexisting binary blob, and a new magnet link created from this newly enlarged binary blog. As such, members will only ever need to fetch the most recent magnet link from the hidden channel to access the entire history of a community.

This is the tricky part. If we prepend this, community owners have to redownload the entire torrent every time as there's no easy way to recognise that some data of that torrent to be downloaded has already been downloaded (assuming that we can't control the size of pieces, which affects this).

Tried to touch on this here: #162 (comment)

@PascalPrecht given that we can control the size of torrent pieces, as per our hangout discussion the other evening, this approach should be doable

John-44 · 2021-12-13T17:47:02Z

docs/raw/serving-community-history.md

+
+The community owner node MUST create a `WakuMessageArchiveIndex` every time it creates a new `WakuMessageArchive`.
+
+For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.


I don't understand why all of this is needed, and why we don't just sent the latest magnet link over the hidden channel whenever it's created. The latest magnet link will always be the single thing the user needs to download the entire history of the communtiy

John-44 · 2021-12-13T17:48:52Z

docs/raw/serving-community-history.md

+## Message archive distribution
+
+Message archives are available via the BitTorrent network as soon as magnet links for them have been created.
+Other community member nodes will download the message archives from the BitTorrent network once they receive a magnet link that contains a message archive index.


I don't think we need a need a 'message archive index'

John-44 · 2021-12-13T17:50:38Z

docs/raw/serving-community-history.md

+
+All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.
+
+Only the community owner has permission to send messages with this topic.


To say the same thing using the Status Communities termonology:

Only the community owner has permissions to post to the hidden channel.

How is this enforced? It isn't at a transport level. So how are clients verifying this?

Perhaps this can be phrased as:

"Only the community owner MAY post to the hidden channel. Other messages on this specified channel MUST be ignored by clients."

John-44 · 2021-12-13T17:51:12Z

docs/raw/serving-community-history.md

+All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.
+
+Only the community owner has permission to send messages with this topic.
+Community members MUST NOT have permission to send messages with this topic.


To say the same thing using the Status Communities terminology:

Community members have permissions to read (but not to post to) the hidden channel.

John-44 · 2021-12-13T17:52:26Z

docs/raw/serving-community-history.md

+
+## Canonical message histories
+
+Only community owners are allowed to distribute messages with magnet links via the magnet link channel. Community members MUST NOT be allowed to distribute magnet links. Since the magnet links are created from the community owner node's database (and previously distributed archives), the message history provided by the community owner becomes the canonical message history and single source of truth for the community.


Instead of saying "Community members MUST NOT be allowed to distribute magnet links." I would say "Community members MUST NOT be allowed to post any messages to the hidden channel".

John-44 · 2021-12-13T17:54:09Z

docs/raw/serving-community-history.md

+
+Generally, fetching message archives is a tree step process:
+
+1. Receive message archive index signal, download index, then determine which message archives to download


I think 1. is simpler e.g.

receive a torrent link in the hidden channel, pass torrent link to torrent client so torrent client can start downloading the binary blob

John-44 · 2021-12-13T18:00:02Z

docs/raw/serving-community-history.md

+2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)
+
+### Downloading message archive indices
+When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.


again I don't understand why the CommunityMessageArchiveIndex exists, and why we don't just directly send the magnet links in the hidden channel instead

John-44 · 2021-12-13T18:01:53Z

docs/raw/serving-community-history.md

+Therefore, member nodes MUST wait for 20 seconds after receiving the last `CommunityMessageArchiveIndex` before they start extracting the magnet link to fetch the latest archive index.
+
+### Downloading individual archives
+Once a message archive index is downloaded, community member nodes use a local lookup table to determine which of the listed archives are missing. For this lookup to work, member nodes MUST store the KECCAK-256 hashes of the magnet links for archives they've downloaded.


Can we get rid of the 'message archive index' and then we can also get rid of the whole local lookup table, this should simplify things.

A member only ever needs the latest magnet link to be able to download all a community's history.

John-44 · 2021-12-13T18:02:50Z

docs/raw/serving-community-history.md

+1. **Download all archives** - Extract each magnet link in the index and pass them to the underlying BitTorrent client (this is the case for new community member nodes that haven't downloaded any archives yet)
+2. **Download only the latest archive** - Extract only the newest magnet link and pass it to the BitTorrent client (this the case for any member node that already has downloaded all previous history and is now interested in only the latst archive)
+3. **Download specific archives** - Look into `from` and `to` fields of every `WakuMessageArchiveMetadata` and only extract magnet links for archives of a specific time range (can be the case for member nodes that have recently joined the network and are only interested in a subset of the complete history)
+


I think we can probably get rid of points 1, 2 and 3 above, because only a single magent link is all that is ever needed needed

John-44 · 2021-12-13T18:05:45Z

docs/raw/serving-community-history.md

+
+### Bandwidth consumption
+
+Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.


To avoid clients downloading the same community history archive data two (or more!) times, it's very important that a torrent 'Force recheck' function is performed on top of the previously downloaded binary, so that the previously downloaded data is not downloaded again

John-44 · 2021-12-13T18:10:21Z

docs/raw/serving-community-history.md

+
+Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.
+
+This means there's a possibility member nodes will download the same data at least twice.


Member nodes should never need to download the same data from the community history archive service twice, see comment about the need to perform a 'force recheck' function with any new magnet link on top of any previously downloaded binary before commencing download of the latest magnet link.

Of course a member, a client does first receive messages live messages via waku, and then receives the messages a second time via torrent, so in this way a client is downloading every message exactly twice

That's exactly what this consideration is pointing out.

We do have the index file now in the latest version of this spec which gives us metadata about available archives. One thing member nodes could do is check whether they have been online in the time range of the latest available archive and then decide to not download the data and just consider it "downloaded"

But that will conflict with the idea that the community owner node is the canonical history. If the member node has received more or different messages than the community owner in that time range, the histories won't be identical.

So I guess for now we just need to accept that there's a possibility that data is being download twice (live messages + archive via torrent)

John-44 · 2021-12-13T18:13:19Z

docs/raw/serving-community-history.md

+
+### Multiple community owners
+
+It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.


This is a short term problem that we can probably ignore for now. The reason we can ignore this is that as soon as possible we will tokenize community ownership with an NFT, and this will ensure that there can only ever be one owner. Now that one owner could run two nodes, but we can detect if this is happening and warn them that they shouldn't do this. Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

While we can ignore it, it is still a possible attack vector that at minimum should be mentioned.

It isn't obvious to me that separating the two is going to always be straightforward. Example attack: community owner key gets compromised, and they start posting two different databases that are off-by-one that keeps churning local user db. Not a huge concern and perhaps not very likely, and if community is compromised a user can always leave etc etc.

Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

I believe there's no way for us to guarantee/enforce this. Theoretically, community owners could run a version of a node that bypasses all of that and still just publishes magnet links on the special channel.

So I guess the easiest thing we can do to account for that is to set and store a "main owner" that other nodes will then use to verify that the magnet link message was signed by that main owner.

All other magnet link messages will be ignored.

staheri14

Thanks a lot @PascalPrecht for preparing the specs! in general looks good to me!
I have left some comments for the first half of the specs, I will leave further comments for the rest as I go through

staheri14 · 2021-12-16T16:20:36Z

docs/raw/serving-community-history.md

+
+## Abstract
+
+Messages are stored permanently by store nodes ([11/WAKU-MAILSERVER](/spec/11), or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/)) for up to 30 days. Messages older than that are no longer provided by store nodes, making it impossible for other nodes to request historical messages older than that. This is especially problematic in the case of Status communities, where recently joined members of a community aren't able to request complete message histories of the community channels.


There is no such limit of 30 days persistence in the wakuv2 store protocol

Correct, there's no limit but a max. number of days that messages are stored, which is configurable. It looked like currently 30 days is what's being used (and the default?), so I described it as such.

Will update this paragraph accordingly

staheri14 · 2021-12-16T16:22:37Z

docs/raw/serving-community-history.md

+
+| Name                 | References |
+| -------------------- | --- |
+| Waku node            | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|


wondering why an Ethereum node?

This was honestly taken from 10/WAKU-USAGE. Probably out of date. Will change this.

staheri14 · 2021-12-16T16:28:36Z

docs/raw/serving-community-history.md

+| Waku node            | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
+| Store node           | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
+| Waku network         | A group of Waku nodes connected through the internet connection and forming a graph |
+| Community owner      | A Status user that owns a Status community |


We may need to be specific about "ownership", also what we mean by "Status user" and "community", let's discuss them in our call

Agree, I'd be useful to refer to what keys etc they have access to. If "owner" is a well defined concept in community spec, can refer to this.

Great points, will update.
I looks like linking to an existing concept isn't possible at the moment, because it turns out the original spec for communities has never landed: #151

staheri14 · 2021-12-16T16:29:23Z

docs/raw/serving-community-history.md

+| Waku network         | A group of Waku nodes connected through the internet connection and forming a graph |
+| Community owner      | A Status user that owns a Status community |
+| Community member     | A Status user that is part of a Status community |
+| Community owner node | A Status node with message archive capabilities enabled, run by a community owner |


Are "Status node" and "Status user" different?

Hm.. wondering where you're getting at with this question.. A Status user is a Status account, a Status node is an application that runs a Status node (which a Status account can log into).

I'll add an entry for Status node as well.

staheri14 · 2021-12-16T16:31:32Z

docs/raw/serving-community-history.md

+| -------------------- | --- |
+| Waku node            | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
+| Store node           | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
+| Waku network         | A group of Waku nodes connected through the internet connection and forming a graph |


I'd suggest being specific about the protocol through which waku nodes are connected i.e., wakuv2 relay (and its equivalent in waku v1)

staheri14 · 2021-12-16T17:58:49Z

docs/raw/serving-community-history.md

+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client


I think if we can explain what is inside the waku message distributed in the hidden channel, then it would be clear for the reader what does it mean to extract the message

staheri14 · 2021-12-16T17:59:10Z

docs/raw/serving-community-history.md

+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client


Suggested change

5. Member node extracts magnet link from message and passes it to torrent client

5. Member node extracts magnet link from the waku message and passes it to torrent client

staheri14 · 2021-12-16T18:01:29Z

docs/raw/serving-community-history.md

+1. User joins community and becomes community member
+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes


Suggested change

4. Member node receives magnet link message from store nodes

4. Member node receives the waku message that contains the message archival index magnet link from the special hidden channel

It is just a suggestion, feel free to edit as you think fit better

staheri14 · 2021-12-16T18:05:57Z

docs/raw/serving-community-history.md

+5. Member node extracts magnet link from message and passes it to torrent client
+6. Torrent client downloads latest message archive index via magnet link
+7. Member node fetches missing archives via torrent
+8. Member node unpacks and decompresses message archive data to then hydrate its local database


This should not happen considering that the network is reliable (and messages are received by all the live nodes), or otherwise we should update the assumptions set out at the beginning

staheri14 · 2021-12-16T18:09:46Z

docs/raw/serving-community-history.md

+
+Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.
+
+Community owner nodes SHOULD remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.


I'd say no message should be deleted unless they have got older than 30 days. Also, the store protocol db should not be updated, we are just using it to get our input to the BitTorrent i.e. messages in the waku message format, we have not yet thought about how to update store protocol db.

staheri14

I have reviewed the second half of the specs and left some comments @PascalPrecht.

staheri14 · 2021-12-17T21:27:51Z

docs/raw/serving-community-history.md

+1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
+2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.
+
+Exported messages MUST be restored as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/) for bundling. Waku messages that have been exported for bundling can now be removed from the community owner node's database (community owner nodes still maintain a database of application messages).


Re deleting messages, please see this previous comment of mine
https://github.com/status-im/specs/pull/162/files#r770793021

staheri14 · 2021-12-17T21:33:12Z

docs/raw/serving-community-history.md

+The range for the `timestamp` depends on the context in which the community owner node attempts to create a history archive. This can be one of the following:
+
+1. The community owner node attempts to create an archive periodically for the past seven days (including the current day). In this case, the `timestamp` has to lie within the day the last time an archive was created and the current day.
+2. The community owner node has been offline and attempts to create an archive for all the live messages it has missed since it went offline. In this case, the `timestamp` has to lie within the day the latest message was received and the current day.


Please see this comment about this https://github.com/status-im/specs/pull/162/files#r768666978
I also think bundling messages should be always based on 7 days interval (decoupled from nodes restart)

I also think bundling messages should be always based on 7 days interval (decoupled from nodes restart)

You mean that, when it missed 30 days of messages, it should still create 4 archives for that (4x 7 days), while the last 2 days of messages go into the next archive?

Makes sense!

staheri14 · 2021-12-17T21:34:48Z

docs/raw/serving-community-history.md

+
+The `to` field SHOULD contain a timestamp of the time range's the higher bound.
+
+The `contentTopic` field MUST contain the same `contentTopic` that the archive's `messages` have.


I think we agreed that contentTopic is better to be repeated, to include all the possible content topics within the community

staheri14 · 2021-12-17T21:39:40Z

docs/raw/serving-community-history.md

+
+### WakuMessageHistoryArchive
+
+The `from` field SHOULD contain a timestamp of the time range's lower bound.


We may want to be specific about the semantic of time here, in waku v2, timestamps are double and contain Unix epoch time in seconds https://rfc.vac.dev/spec/14/#wakumessage (maybe no need for these details in current state of specs, but once we decide on the implementation details we shall update the specs)

staheri14 · 2021-12-17T21:40:15Z

docs/raw/serving-community-history.md

+message WakuMessageArchive {
+  uint64 from = 1
+  uint64 to = 2
+  string contentTopic = 3


Suggested change

string contentTopic = 3

repeated string contentTopic = 3

staheri14 · 2021-12-17T22:04:51Z

docs/raw/serving-community-history.md

+
+For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.
+
+The the community owner node MUST derive a magnet link from the newly created `WakuMessageArchiveIndex` so it can be distributed to community member nodes.


Following our last convo, I think it would be good to persist the WakuMessageArchiveIndex in the long-term storage layer i.e., BitTorrent, otherwise, there is a possibility of losing the WakuMessageArchiveIndex if not properly persisted by Status nodes locally

staheri14 · 2021-12-17T22:35:45Z

docs/raw/serving-community-history.md

+
+```
+{community_id}-archives
+```


To be more specific, content Topics follow this format https://rfc.vac.dev/spec/23/#content-topics
/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

staheri14 · 2021-12-17T22:47:04Z

docs/raw/serving-community-history.md

+2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)
+
+### Downloading message archive indices
+When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.


I thought member nodes receive a waku message whose payload is a ApplicationMetadataMessage which embodies a CommunityMessageArchiveIndex as its payload.

I've renamed this to CommunityMessageArchive.

But yes, that's the payload. And it has the magnet_uri that needs to be passed to the bittorrent client.

staheri14 · 2021-12-17T22:51:20Z

docs/raw/serving-community-history.md

+
+Due to the nature of distributed systems, there's no guarantee that a received message is the "last" message. This is especially true when member nodes request historical messages from store nodes. 
+
+Therefore, member nodes MUST wait for 20 seconds after receiving the last `CommunityMessageArchiveIndex` before they start extracting the magnet link to fetch the latest archive index.


One approach could be to add a sequence number to the CommunityMessageArchiveIndex, and member nodes can immediately decide if they should proceed with downloading or not.

Also, I am not sure why the 20 second waiting time is needed? archives are published every 7 days, why should two successive archives be sent within 20 seconds interval

So all of this is for the case that messages are requested and the node receives a magnet link message but doesn't actually know whether it's the latest one (this can be the case for a new member that doesn't have any history at all yet). Maybe there's another such message arriving in the near future. So I thought we need some threshold before we start processing the magnet link.

staheri14 · 2021-12-17T22:55:29Z

docs/raw/serving-community-history.md

+When message archives are fetched, community member nodes MUST unwrap the resulting `WakuMessage` instances into `ApplicationMetadataMessage` instances and store them in their local database.
+Community member nodes SHOULD NOT store the wrapped `WakuMessage` messages.
+
+Already stored messages with the same `id` or `clock` value MUST be replaced with messages extracted from archives, if both of these values are equal.


I also agree with the consistency i.e., replacing everything from T1-T2, we can later design a synchronization protocol across store nodes to make sure they all have consistent message hisotry

oskarth

Nice work!

I see there's a lot of discussion regarding archival index etc. I haven't had the bandwidth to dig into this in any detail, but seems like you all have thought about it from a bunch of different POVs and discussed about it in more detail so I assume we are on our way to a reasonable solution here :P

If it is still an open question by beginning of January, perhaps summarizing the current envisioned approaches and trade-offs would be useful?

EDIT: I see there's this https://hackmd.io/@YoQpkPmuRJ-48bA5PRoaRg/HyBDfl59Y which already does this, then there's a bunch of new comments in the chat. Um... this is too involved for me to personally get into weeds off right now, can have a closer look beginning of next year.

Possibly naive question: is it possible to get best of both worlds? with one torrent and using message index etc. For example, a lot of torrents have multiple archives within them and a user can choose which ones they want to download, e.g. individual media files in some collection, separated by day (say).

oskarth · 2021-12-20T01:55:57Z

docs/raw/serving-community-history.md

+- Community owner nodes provide archives with historical messages **at least** every 30 days
+- Community owner nodes receive all community messages
+- Community owner nodes are honest
+


Might be worth adding a sentence or two on that some of the assumptions are less than ideal, and will be enhanced in future work (potentially linking to https://forum.vac.dev/t/status-communities-protocol-and-product-point-of-view/114/2 or some other GH issue, or leave links out if it feels more in line with general spec).

oskarth · 2021-12-20T01:59:44Z

docs/raw/serving-community-history.md

+If the community owner node goes offline, it MUST go through the following process:
+
+1. Community owner node restarts
+2. Community owner node requests messages from store nodes for the missed time range


Will a community owner always know the full list of channels in a community as soon as one is created?

oskarth · 2021-12-20T02:01:33Z

docs/raw/serving-community-history.md

+
+Community member nodes go through the following (high level) process to fetch and restore community message histories:
+
+1. User joins community and becomes community member


Agree, the "community spec" is different from this spec, but it acts a form of requirement. It'd be very useful to have a clear community spec here to refer these things unambiguously (owner, channels, members, etc etc)

oskarth · 2021-12-20T02:03:30Z

docs/raw/serving-community-history.md

+Community member nodes go through the following (high level) process to fetch and restore community message histories:
+
+1. User joins community and becomes community member
+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community


Agree.

In Waku v2 terms, it could be under a special content topic namespaced under a community (say), also indicating what data format is used (compressed magnet link or whatever), see https://rfc.vac.dev/spec/23/#content-topics

Since this spec is written to work for Waku v1, just any unique topic seems useful to start with, and this can be improved later on.

oskarth · 2021-12-20T02:06:44Z

docs/raw/serving-community-history.md

+2. By joining a community, member nodes automatically subscribe to special magnet link channel provided by the community
+3. Member node requests message history (last 30 days) of community channels from store nodes
+4. Member node receives magnet link message from store nodes
+5. Member node extracts magnet link from message and passes it to torrent client


Do we have a protobuf or JSON or similar for how this data will be represented? Might've missed this in spec.

Glad to see this is WakuMessages, because it'll make future compatibility infinitely easier! Magnet link could be its own field as well. By keeping this an open kv map (protobuf or json), we can extend it as there's a need, e.g. with any type of compression/time period or whatever we may want to communicate.

What happens e.g.if someone else posts a magnet link to this channel and it doesn't belong to that community? Do people just start seeding random content then? Seems like an attack vector...

oskarth · 2021-12-20T02:22:18Z

docs/raw/serving-community-history.md

+  uint64 from = 1
+  uint64 to = 2
+  string contentTopic = 3
+  repeated WakuMessage messages = 4 // `WakuMessage` is provided by 14/WAKU2-MESSAGE


It should be possible to do this with a simple function that just passes the payload and maybe maps content topic to content topic or so.

oskarth · 2021-12-20T02:26:28Z

docs/raw/serving-community-history.md

+
+All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.
+
+Only the community owner has permission to send messages with this topic.


How is this enforced? It isn't at a transport level. So how are clients verifying this?

Perhaps this can be phrased as:

"Only the community owner MAY post to the hidden channel. Other messages on this specified channel MUST be ignored by clients."

oskarth · 2021-12-20T02:27:50Z

docs/raw/serving-community-history.md

+
+## Canonical message histories
+
+Only community owners are allowed to distribute messages with magnet links via the magnet link channel. Community members MUST NOT be allowed to distribute magnet links. Since the magnet links are created from the community owner node's database (and previously distributed archives), the message history provided by the community owner becomes the canonical message history and single source of truth for the community.


I think this should be rephrased for clarity, it is important to for a spec reader to understand that anyone CAN post to this topic. There's no protocol level validation in terms of relaying messages or whatever.

The semantics we are pointing to here is that any messages from a bad source MUST NOT be accepted. This points to a validation process, that each client has to perform.

Otherwise what could easily happen is that some implementation, say js-waku, just starts seeding a magnet link on the assumption that the channel is "safe", and this could be god knows what that some troll decided to upload.

oskarth · 2021-12-20T02:32:17Z

docs/raw/serving-community-history.md

+
+### Multiple community owners
+
+It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.


While we can ignore it, it is still a possible attack vector that at minimum should be mentioned.

It isn't obvious to me that separating the two is going to always be straightforward. Example attack: community owner key gets compromised, and they start posting two different databases that are off-by-one that keeps churning local user db. Not a huge concern and perhaps not very likely, and if community is compromised a user can always leave etc etc.

oskarth · 2021-12-20T02:33:35Z

docs/raw/serving-community-history.md

+
+Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.
+
+Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).


single message is missing in one of the histories, the hashes presented in archive indices

Agree, and if we can make design robust to this it'd be useful. I suppose this is related to the whole archival index discussion? (I haven't kept up here in detail, just noticed a lot of back and forth).

…to latest findings

0x-r4bbit

Hey everyone!

I've updated the draft. The changes are in a separate commit so it's a bit easier to review.

0x-r4bbit · 2022-01-13T12:55:54Z

docs/raw/serving-community-history.md

+
+## Abstract
+
+Messages are stored permanently by store nodes ([11/WAKU-MAILSERVER](/spec/11), or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/)) for up to 30 days. Messages older than that are no longer provided by store nodes, making it impossible for other nodes to request historical messages older than that. This is especially problematic in the case of Status communities, where recently joined members of a community aren't able to request complete message histories of the community channels.


Correct, there's no limit but a max. number of days that messages are stored, which is configurable. It looked like currently 30 days is what's being used (and the default?), so I described it as such.

Will update this paragraph accordingly

0x-r4bbit · 2022-01-13T12:59:30Z

docs/raw/serving-community-history.md

+
+| Name                 | References |
+| -------------------- | --- |
+| Waku node            | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|


This was honestly taken from 10/WAKU-USAGE. Probably out of date. Will change this.

0x-r4bbit · 2022-01-13T13:09:32Z

docs/raw/serving-community-history.md

+| Waku node            | An Ethereum node with Waku V1 enabled, or a [10/WAKU2](https://rfc.vac.dev/spec/10/) node that implements [11/WAKU2-RELAY](https://rfc.vac.dev/spec/11/)|
+| Store node           | A Waku node that implements [11/WAKU-MAILSERVER](/spec/11) or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/) respectively |
+| Waku network         | A group of Waku nodes connected through the internet connection and forming a graph |
+| Community owner      | A Status user that owns a Status community |


Great points, will update.
I looks like linking to an existing concept isn't possible at the moment, because it turns out the original spec for communities has never landed: #151

0x-r4bbit · 2022-01-13T13:15:15Z

docs/raw/serving-community-history.md

+| Waku network         | A group of Waku nodes connected through the internet connection and forming a graph |
+| Community owner      | A Status user that owns a Status community |
+| Community member     | A Status user that is part of a Status community |
+| Community owner node | A Status node with message archive capabilities enabled, run by a community owner |


Hm.. wondering where you're getting at with this question.. A Status user is a Status account, a Status node is an application that runs a Status node (which a Status account can log into).

I'll add an entry for Status node as well.

0x-r4bbit · 2022-01-13T13:55:23Z

docs/raw/serving-community-history.md

+- Community owner nodes provide archives with historical messages **at least** every 30 days
+- Community owner nodes receive all community messages
+- Community owner nodes are honest
+


0x-r4bbit · 2022-01-17T11:30:00Z

docs/raw/serving-community-history.md

+
+All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.
+
+Only the community owner has permission to send messages with this topic.


0x-r4bbit · 2022-01-17T12:18:38Z

docs/raw/serving-community-history.md

+2. The member node requests messages for a time range of up to 30 days from store nodes (this is the case when a new community member joins a community)
+
+### Downloading message archive indices
+When member nodes receive a message with a `CommunityMessageArchiveIndex` ([6/PAYLOADS](/specs/6-payloads)) from the aforementioned channnel, they MUST extract the `magnet_uri` and pass it to their underlying BitTorrent client so they can fetch the latest message archive index.


I've renamed this to CommunityMessageArchive.

But yes, that's the payload. And it has the magnet_uri that needs to be passed to the bittorrent client.

0x-r4bbit · 2022-01-17T12:26:04Z

docs/raw/serving-community-history.md

+
+Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.
+
+This means there's a possibility member nodes will download the same data at least twice.


That's exactly what this consideration is pointing out.

We do have the index file now in the latest version of this spec which gives us metadata about available archives. One thing member nodes could do is check whether they have been online in the time range of the latest available archive and then decide to not download the data and just consider it "downloaded"

But that will conflict with the idea that the community owner node is the canonical history. If the member node has received more or different messages than the community owner in that time range, the histories won't be identical.

So I guess for now we just need to accept that there's a possibility that data is being download twice (live messages + archive via torrent)

0x-r4bbit · 2022-01-17T12:28:58Z

docs/raw/serving-community-history.md

+
+### Multiple community owners
+
+It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.


Or if we detected two owner nodes, we could randomly assign only one of the nodes to be able to produce history torrents, and expose a setting for the community owner to select a specific node to hold this responsibility?

I believe there's no way for us to guarantee/enforce this. Theoretically, community owners could run a version of a node that bypasses all of that and still just publishes magnet links on the special channel.

So I guess the easiest thing we can do to account for that is to set and store a "main owner" that other nodes will then use to verify that the magnet link message was signed by that main owner.

All other magnet link messages will be ignored.

0x-r4bbit · 2022-01-17T12:33:55Z

docs/raw/serving-community-history.md

+
+Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.
+
+Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).


We could also expose this setting to the community owner, to let the community owner select a different node to be the node that produces history perhaps?

Yes, I think what we can do is set a "main owner". So even if there are multiple ppl with private keys, only one main owner could be set. Obviously, with multiple owners having the private key and write privileges, each of them can change that value as they like.

This could still be problematic if they serve different archives. Then the question is: will member nodes simply ignore all the older archives in a given time range (because they might look completely different), or will they also download all of it and keep replacing all of it.

In other words: If member nodes detect that the history has changed, will they replace that entire history, or will they stick to only downloading the latest #n archives?

Draft: Specification for the Community History Problem (MVP)

f79be6f

0x-r4bbit commented Dec 13, 2021

View reviewed changes

John-44 reviewed Dec 13, 2021

View reviewed changes

staheri14 reviewed Dec 16, 2021

View reviewed changes

staheri14 reviewed Dec 17, 2021

View reviewed changes

oskarth reviewed Dec 20, 2021

View reviewed changes

Addressing review comments and rewriting parts of the spec according …

0e5bd17

…to latest findings

0x-r4bbit commented Jan 17, 2022

View reviewed changes

0x-r4bbit mentioned this pull request Jan 17, 2022

Raw/serving community history mvp v2 #164

Draft


		## Storing live messages

		Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.


		For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.

		The the community owner node MUST derive a magnet link from the newly created `WakuMessageArchiveIndex` so it can be distributed to community member nodes.


		Not only will multiple owners multiply the amount of archive index messages being distributed to the network, they might also contain different sets of magnet links and their corresponding hashes.

		Even if just a single message is missing in one of the histories, the hashes presented in archive indices will look completely different, resulting in the community member node to download the corresponding archive (which might be identical to an archive that was already downloaded, except for that one message).


		### Serving community history archives

		Community owner nodes go through the following (high level) process to provide community members with message histories (assumes community owner node is available 24/7):

	7. Community owner node creates message archive index and bundles the previously generated magnet link into it
	7. Community owner node creates message archive index and bundles the magnet link generated in step 6 into it

	8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 2) through the Waku network
	8. Community owner node creates magnet link from index and distributes it to community members via special channel created in step 3) through the Waku network

	4. Member node receives magnet link message from store nodes
	4. Member node receives the waku message that contains the message archival index magnet link from the special hidden channel

	5. Member node extracts magnet link from message and passes it to torrent client
	5. Member node extracts magnet link from the waku message and passes it to torrent client


		Community owner nodes MUST store live messages as [14/WAKU2-MESSAGE](https://rfc.vac.dev/spec/14/). This is required to provide confidentiality, authenticity, and integrity of message data distributed via the BitTorrent layer, and later validated by Status nodes when they unpack message history archives.

		Community owner nodes SHOULD remove those messages from their local databases after they have been turned into archives and distributed to the BitTorrent network.


		The `dn` parameter ("display name") in the resulting magnet link MAY be optional.

		The resulting magnet link MUST be bundled into a `WakuMessageArchiveIndex`, which is then later distributed to other Status nodes.


		## Bundling history archives into archive indices

		Community owner nodes MUST provide message archives for the entire community history. However, each individual archive only contains a subset of the complete history, that is, either data for a time range of seven days, or, a time range in which the node was offline. Therefore, message history archives need to be bundled into a `WakuMessageArchiveIndex`, which later distributed via the Waku network and allows receiving nodes to fetch archives for individual time ranges.


		The community owner node MUST create a `WakuMessageArchiveIndex` every time it creates a new `WakuMessageArchive`.

		For every created `WakuMessageArchive`, there MUST be a `WakuMessageArchiveMetadata` entry in the index map.


		All messages sent with this topic MUST be instances of `ApplicationMetadataMessage` ([6/PAYLOADS](/specs/6-payloads)) with a `payload` of `CommunityMessageArchiveIndex`.

		Only the community owner has permission to send messages with this topic.


		## Canonical message histories

		Only community owners are allowed to distribute messages with magnet links via the magnet link channel. Community members MUST NOT be allowed to distribute magnet links. Since the magnet links are created from the community owner node's database (and previously distributed archives), the message history provided by the community owner becomes the canonical message history and single source of truth for the community.


		Generally, fetching message archives is a tree step process:

		1. Receive message archive index signal, download index, then determine which message archives to download


		### Bandwidth consumption

		Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.


		Community member nodes will download the latest archive they've received from the archive index, which includes messages from the last seven days. Assuming that community members nodes were online for that time range, they have already downloaded that message data and will now download an archive that contains the same.

		This means there's a possibility member nodes will download the same data at least twice.


		### Multiple community owners

		It is possible for community owners to export the private key of their owned community and pass it to other users so they become community owners as well. This means, it's possible for multiple owners to exist.


		## Abstract

		Messages are stored permanently by store nodes ([11/WAKU-MAILSERVER](/spec/11), or [13/WAKU2-STORE](https://rfc.vac.dev/spec/13/)) for up to 30 days. Messages older than that are no longer provided by store nodes, making it impossible for other nodes to request historical messages older than that. This is especially problematic in the case of Status communities, where recently joined members of a community aren't able to request complete message histories of the community channels.


		The `to` field SHOULD contain a timestamp of the time range's the higher bound.

		The `contentTopic` field MUST contain the same `contentTopic` that the archive's `messages` have.

Draft: Specification for the Community History Problem (MVP) #162

Are you sure you want to change the base?

Draft: Specification for the Community History Problem (MVP) #162

Conversation

0x-r4bbit commented Dec 13, 2021

0x-r4bbit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

John-44 Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x-r4bbit commented Dec 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

John-44 Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x-r4bbit Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

John-44 Dec 13, 2021 •

edited

Loading

John-44 Dec 13, 2021 •

edited

Loading

0x-r4bbit Dec 13, 2021 •

edited

Loading


		### WakuMessageHistoryArchive

		The `from` field SHOULD contain a timestamp of the time range's lower bound.


		Due to the nature of distributed systems, there's no guarantee that a received message is the "last" message. This is especially true when member nodes request historical messages from store nodes.

		Therefore, member nodes MUST wait for 20 seconds after receiving the last `CommunityMessageArchiveIndex` before they start extracting the magnet link to fetch the latest archive index.


		Community member nodes go through the following (high level) process to fetch and restore community message histories:

		1. User joins community and becomes community member