How will Freenet integrate semantic metadata and AI models? #937
Replies: 6 comments 1 reply
-
We can schedule a talk about such topics. |
Beta Was this translation helpful? Give feedback.
-
Oh! My level in English and computer skills is far too low for that. I prefer to discuss here on this forum. This reflection is inspired by what I understood about freenet, but it is possible that I understood nothing. This reflection is inspired by the Open Document standard (LibreOffice, etc.). Any "document" should consist of at least 3 files:
Of course a document can be much more complex, including many files: in this case, it could be useful to split the manifest into a root manifest and several child manifests: one for text files, another for images, a third for metadata... Each file in the document can have its own metadata: for texts, images, videos... A metadata file for each file, or for each file section, if a file is a compilation of multiple entities (for example, an anthology). From what I understand, the next version of freenet will contain not only a store, but also applications that will work P2P on the network. I call "agent" any program, any person, or any network of people or programs interacting together on the network. There could be different types of agents: indexing agents Here is what the path of a document on the network could be: The author injects all the files corresponding to his document on the network, at a minimum: manifest.xml It points the uri of the manifest.xml file to a primary indexing agent, that is, an agent that indexes everything. The indexing agent checks if the manifest.xml file exists, checks if the files listed in the manifest exist, saves the metadata and performs some basic indexing tasks. It creates a new manifest including the uris of the files that actually exist, signs it, then makes the uri of this manifest available to other agents in its index. Selection agents download the manifest from its uri recorded in a primary index, then the files corresponding to their selection functions (the selection agents do not necessarily make a selection on all of the files in the document). They carry out their selection operations on the data or metadata: anti-spam filter, anti-pornography, anti-advertising, selection of files or documents according to their metadata... They sign the manifests which have passed their filters and save them in secondary indexes: these are the secondary indexes which will be used by most applications, or by other selection agents. It is important that the indexing and selection agents are distinct and that there are several specialized indexing and selection agents. Indeed, all agents will not have the same expectations in terms of selection and indexing, and it is desirable that these agents be organized in a modular way in order to be able to chain selection and indexing operations, as with UNIX filters. Finally, application agents retrieve the indexed and selected documents for their own use, along with the metadata. Metadata is extremely useful for applications. For example, for a forum application, simply add the predicate to the metadata of your document: uri:xxx is a response to uri:yyy The forum is thus automatically classified using metadata. Furthermore, selection agents are not necessarily programs, but can be human networks: a web of trust. Thus, a group can collectively decide which documents they wish to share, or with whom they wish to exchange, since all documents are signed and have metadata. Just publish (or secretly exchange) the uri and parameters of the selection agents. Finally, there is no real censorship possible on this type of network. Indeed, no selection agent is obligatory, whether the agent is a program or a web of trust; we select our own selection agents, or we construct others. Thus certain documents may be accessible to one agent, but not to another, depending on the selection agents chosen; however, no agent can block access to a document from other agents, since it can only control its own selection. In principle, the selection and indexing agents should be free software, in order to allow control of their behavior and to fork them if they ever take a direction that does not suit us. Indexing, selection or application agents can be AI. These AIs can produce small data models by serializing the document's data and/or metadata into vectors, then creating a new manifest where links to the model and its metadata are added. Then, this small data model can be used by other AI to create and/or use larger data models. AIs would be connected to each other through a web of trust, because of course there is no guarantee that an AI agent won't create fake small data models to corrupt larger models. Nevertheless, the idea would be to distribute at least part of the vectorization work, by creating intermediate vectorization files for each document; these small files would be easier to distribute in order to verify them and thus ensure the AI's web of trust. Freenet, with a uniform network of URIs, data, metadata and distributed applications could become a privileged ground for a network of AI cooperating with each other on a web of trust. |
Beta Was this translation helpful? Give feedback.
-
Since a picture is worth a thousand words, here is the simplified diagram of what I imagine: flowchart TB
id0{autor} -->
id1[/Markdown file with YAML metadata/] -->
id2[Markdown processor] -->
id3[/sitemap.xml/] & id4[/"metadata.rdf (description)"/] & id5[/content.html/] -->
id5a[("Store
(unsafe)")] -->
id7[Web of Trust] -->
id5b[("Store
(safe)")]
id9{reader} -->
id10[/"metadata.rdf (selection)"/] -->
id11[("Store
(safe)")] -->
id12[Web of Trust] -->
id9
The tests can be of any nature, ranging from spam filtering to consistency tests between data and metadata or data verification. In practice, a network of agents produces a new sitemap.xml, a new metadata.rdf and possibly a new content.html if the data needs to be brought into compliance. Each of these new resources are derived from the old ones and provide information in the metadata about the validated tests and the transformations carried out. New files are signed by the network of agents. If the user trusts this network of agents, he will use the data and metadata from the tests of the network of agents he trust, and not the original data and metadata, which are untrusted. This allows him to select the appropriate data and metadata for his application. We can extend the syntax of sitemap.xml to indicate the type of resources indicated by the uri, or use another xml language, in order to facilitate the testing work carried out by the agents on the web of trust. Principle of separation of resources: in order to facilitate the analysis and storage of resources, we will avoid using container files which contain all the resources, but we will separate the resources into distinct files, linked together by a sitemap.xml file or equivalent. This allows for easy updating of resources, by editing the sitemap.xml file to add or modify uris pointing to new or updated resources. This also allows resource sharing between all documents. If several documents use the same resource, this resource will have the same hash, and therefore the same uri, and will therefore be shared between all these documents, thanks to the sitemap. If resources were encapsulated in container files, sharing would be impossible and storage and bandwidth would be wasted. |
Beta Was this translation helpful? Give feedback.
-
This doesn't directly concern your post (how to implement such a model), but an ethical concern I feel is not getting much attention in the AI space right now. My belief is that any publicly shared data is going to be ingested into machine learning models, whether we want it to be or not. If it needs to be processed and aggregated into an intermediate form, people will do that. My concern is that AI is continuing the trend of centralized power and gatekeepers. In order to counter this asymmetry of power, I think it's important that we also apply AI in private, in a sandbox that the user controls, and in a way the user trusts is being done with his/her best interests in mind. I would encourage others to consider that lens as well: the permeable barrier between public and private, and how important it is for individuals to be able to choose when they enter and leave the public space. If you can't enter the public space: that's censoring and deplatforming. If you can't leave the public space: that's the absence of privacy. And increasingly this has other social credit score-like ramifications. |
Beta Was this translation helpful? Give feedback.
-
@sulivanShu Do you use Obsidian? The use case you're talking about looks a lot like something that a person would set up to publish from Obsidian. But I guess the static site publishers (Hugo et al) also use markdown with frontmatter. What you're describing should work in principle. As far as I understand the FreeNet architecture, validating data before inclusion and then further processing it into derivative forms are some of the core behaviors of contracts. The only part that's unclear to me is the access controls - under what conditions would apps be able to read each other's data? And under what conditions could one user read another's data?
Yes, you and I are on the same page here. Many platform stewards are pretending like they must choose between censorship and chaos, and that's simply not true if you embrace subjectivity and put the power in the users' hands. This mirrors how biological organisms are structured - we view reality through a subjective lens. You could have a lifecycle that looks like this: graph LR
a[Publish] --> b[Assertions/Tagging] --> c[Query] --> d[Reference] --> a
The Assertions/Tagging I think is equivalent to what you're calling indexing... it's an ecosystem whose job is to make certain qualifying statements about the content, and cryptographically sign those statements and put them on a public ledger. Querying is equivalent to "Selection". There's an analogous software deployment cycle which could be used to qualify and select software components in a decentralized way: graph LR
a[Develop] --> b[QA] --> c[Query] --> d[Deploy] --> a
Yes. In the above software deployment cycle, if the Develop and QA produce artifacts on public ledgers, and if the Query algorithms are also stored on a public ledger, it would enable a decentralized app store, censorship resistant, while still providing for the quality metrics that are important through the subjective lens of the individual. The individual can decide through the web of trust which organizations and approaches curate the kind of results they are looking for. In theory I think it should be possible to develop these kinds of patterns on FreeNet, or on anything that resembles a cryptographically signed public ledger. |
Beta Was this translation helpful? Give feedback.
-
"The only part that's unclear to me is the access controls - under what conditions would apps be able to read each other's data? And under what conditions could one user read another's data?" I talk about public data, so there is no access controls problem. When a secret is shared among a large number of people, protecting the secret is essentially illusory. |
Beta Was this translation helpful? Give feedback.
-
What semantic metadata and AI models have in common is that they represent data as vectors.
In the future, we can expect all data to have two forms of representation: a raw form in the form of usual files, and a vector form integrated into AI models.
The synthetic representation of data in the form of vectors is useful not only for AI applications, but also for more mundane applications such as search engines or subscriptions. For example, if we add Dublin Core or BIBFRAME metadata to a file, it becomes possible to find it using its metadata, or to subscribe to streams of new files responding to certain metadata criteria: author, subject, category, organization...
This raises the question of continuous integration of metadata, and its control by human or AI agents operating on webs of trust in order to avoid spam.
This also raises the question of the continuous integration of vector representations of public data on Freenet in general, and their manipulation by AI for extremely diverse activities (syntheses, research, etc.). If all Freenet data is identified by unique URNs (freenet:HASH) and associated with trusted metadata, this could make AI on Freenet easier to program and more powerful.
I am launching this discussion to find out if the planned architecture of freenet will allow this dual representation of data in raw form and in the form of vectors, in the form of explicit metadata or AI models, to manage the links between these different representations and to have programs to use them.
Beta Was this translation helpful? Give feedback.
All reactions