Merge pull request modelscope#15 from ZiTao-Li/zitao/dev_copilot

improve rag module
FredericW · May 16, 2024 · 06ce25d · 06ce25d
2 parents 123cd06 + 0c25ada
commit 06ce25d
Show file tree

Hide file tree

Showing 15 changed files with 403 additions and 693 deletions.
diff --git a/docs/sphinx_doc/en/source/tutorial/209-rag.md b/docs/sphinx_doc/en/source/tutorial/209-rag.md
@@ -0,0 +1,126 @@
+(209-rag-en)=
+
+# A Quick Introduction to RAG in AgentScope
+
+We want to introduce three concepts related to RAG in AgentScope: Knowledge, KnowledgeBank and RAG agent.
+
+### Knowledge
+The Knowledge modules (now only `LlamaIndexKnowledge`; support for LangChain will come soon) are responsible for handling all RAG-related operations.
+
+Here, we will use `LlamaIndexKnowledge` as an example to illustrate the operation within the `Knowledge` module.
+When a `LlamaIndexKnowledge` object is initialized, the `LlamaIndexKnowledge.__init__` will go through the following steps:
+  *  It processes documents and generates indexing for retrieval in `LlamaIndexKnowledge._data_to_index(...)`, which includes
+      * loading the documents `LlamaIndexKnowledge._data_to_docs(...)`;
+      * preprocessing the documents to nodes with preprocessing methods and embedding model `LlamaIndexKnowledge._docs_to_nodes(...)`;
+      * generating index with the processed nodes.
+  * If the indexing already exists, then `LlamaIndexKnowledge._load_index(...)` will be invoked to load the index and avoid repeating embedding calls.
+
+  A RAG module can be created with a JSON configuration to specify 1) data path, 2) data loader, 3) data preprocessing methods, and 4) embedding model (model config name).
+  A detailed example can refer to the following:
+  <details>
+  <summary> A detailed example of  RAG module configuration </summary>
+
+  ```json
+  [
+  {
+    "knowledge_id": "{your_knowledge_id}",
+    "emb_model_config_name": "{your_embed_model_config_name}",
+    "data_processing": [
+      {
+        "load_data": {
+          "loader": {
+            "create_object": true,
+            "module": "llama_index.core",
+            "class": "SimpleDirectoryReader",
+            "init_args": {
+              "input_dir": "{path_to_your_data_dir_1}",
+              "required_exts": [".md"]
+            }
+          }
+        }
+      },
+      {
+        "load_data": {
+          "loader": {
+            "create_object": true,
+            "module": "llama_index.core",
+            "class": "SimpleDirectoryReader",
+            "init_args": {
+              "input_dir": "{path_to_your_python_code_data_dir}",
+              "recursive": true,
+              "required_exts": [".py"]
+            }
+          }
+        },
+        "store_and_index": {
+          "transformations": [
+            {
+              "create_object": true,
+              "module": "llama_index.core.node_parser",
+              "class": "CodeSplitter",
+              "init_args": {
+                "language": "python",
+                "chunk_lines": 100
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+  ]
+  ```
+
+  </details>
+
+If users want to avoid the detailed configuration, we also provide a quick way in `KnowledgeBank` (see the following).
+  </br>
+
+### Knowledge Bank
+The knowledge bank maintains a collection of Knowledge objects (e.g., on different datasets) as a set of *knowledge*. Thus,
+different agents can reuse the RAG modules without unnecessary "re-initialization".
+Considering that configuring the RAG module may be too complicated for most users, the knowledge bank also provides an easy function call to create RAG modules.
+  * `KnowledgeBank.add_data_as_knowledge`: create RAG module. An easy way only requires to provide `knowledge_id`, `emb_model_name` and `data_dirs_and_types`
+  ```python
+  knowledge_bank.add_data_as_knowledge(
+        knowledge_id="agentscope_tutorial_rag",
+        emb_model_name="qwen_emb_config",
+        data_dirs_and_types={
+            "../../docs/sphinx_doc/en/source/tutorial": [".md"],
+        },
+    )
+  ```
+  More advance initialization, users can still pass a knowledge config as a parameter `knowledge_config`:
+  ```python
+  # load knowledge_config as dict
+  knowledge_bank.add_data_as_knowledge(
+      knowledge_id=knowledge_config["knowledge_id"],
+      emb_model_name=knowledge_config["emb_model_config_name"],
+      knowledge_config=knowledge_config,
+  )
+  ```
+* `KnowledgeBank.get_knowledge`: It accepts two parameters, `knowledge_id` and `duplicate`.
+  It will return a knowledge object with the provided `knowledge_id`; if `duplicate` is true, the return will be deep copied.
+* `KnowledgeBank.equip`: It accepts two parameters, `agent` and `duplicate`.
+ The function will first check if the agent has `rag_config`; if so, it will provide the knowledge according to the
+ `knowledge_id` in the `rag_config` and initialize the retriever(s) for the agent.
+
+
+
+### RAG agent
+RAG agent is an agent that can generate answers based on the retrieved knowledge.
+  * Agent using RAG: RAG agent requires `rag_config` in its configuration, and there is a list of `knowledge_id`.
+  * Agent can load specific knowledge from a `KnowledgeBank` by passing it into the `KnowledgeBank.equip` function.
+  * Agent can use the retrievers in the `reply` function to retrieve from the `Knowledge` and compose their prompt to LLMs.
+
+
+
+**Building RAG agent yourself.** As long as your agent config has the `rag_config` attribute as a dict and there is a list of `knowledge_id`, you can pass it to the `KnowledgeBank.equip`.
+Your agent will be equipped with a list of knowledge according to the list of `knowledge_id` and the corresponding retrievers.
+You can decide how to use the retriever and even update and refresh the index in your agent's `reply` function.
+
+
+[[Back to the top]](#209-rag-en)
+
+
+
diff --git a/docs/sphinx_doc/zh_CN/source/tutorial/209-rag.md b/docs/sphinx_doc/zh_CN/source/tutorial/209-rag.md
@@ -0,0 +1,121 @@
+(209-rag-zh)=
+
+# 简要介绍AgentScope中的RAG
+
+我们在此介绍AgentScope与RAG相关的三个概念：知识（Knowledge），知识库（Knowledge Bank）和RAG agent。
+
+### Knowledge
+知识模块（目前仅有“LlamaIndexKnowledge”；即将支持对LangChain）负责处理所有与RAG相关的操作。
+
+在这里，我们将使用`LlamaIndexKnowledge`作为示例，以说明在`Knowledge`模块内的操作。
+当初始化`LlamaIndexKnowledge`对象时，`LlamaIndexKnowledge.__init__`将执行以下步骤：
+  *  它处理文档并生成检索索引 (`LlamaIndexKnowledge._data_to_index(...)`中完成) 其中包括
+      * 加载文档 `LlamaIndexKnowledge._data_to_docs(...)`;
+      * 对文档进行预处理，使用预处理方法和向量模型生成nodes  `LlamaIndexKnowledge._docs_to_nodes(...)`;
+      * 生成处理后的节点的索引。
+  * 如果索引已经存在，则会调用 `LlamaIndexKnowledge._load_index(...)` 来加载索引，并避免重复的嵌入调用。
+
+ 用户可以使用JSON配置来创建一个RAG模块，以指定1）数据路径，2）数据加载器，3）数据预处理方法，以及4）嵌入模型（模型配置名称）。
+一个详细的示例可以参考以下内容：
+  <details>
+  <summary> 详细的配置示例 </summary>
+
+  ```json
+  [
+  {
+    "knowledge_id": "{your_knowledge_id}",
+    "emb_model_config_name": "{your_embed_model_config_name}",
+    "data_processing": [
+      {
+        "load_data": {
+          "loader": {
+            "create_object": true,
+            "module": "llama_index.core",
+            "class": "SimpleDirectoryReader",
+            "init_args": {
+              "input_dir": "{path_to_your_data_dir_1}",
+              "required_exts": [".md"]
+            }
+          }
+        }
+      },
+      {
+        "load_data": {
+          "loader": {
+            "create_object": true,
+            "module": "llama_index.core",
+            "class": "SimpleDirectoryReader",
+            "init_args": {
+              "input_dir": "{path_to_your_python_code_data_dir}",
+              "recursive": true,
+              "required_exts": [".py"]
+            }
+          }
+        },
+        "store_and_index": {
+          "transformations": [
+            {
+              "create_object": true,
+              "module": "llama_index.core.node_parser",
+              "class": "CodeSplitter",
+              "init_args": {
+                "language": "python",
+                "chunk_lines": 100
+              }
+            }
+          ]
+        }
+      }
+    ]
+  }
+  ]
+  ```
+
+  </details>
+
+如果用户想要避免详细的配置，我们也在`KnowledgeBank`中提供了一种快速的方式（请参阅以下内容）。
+  </br>
+
+### Knowledge Bank
+知识库将一组Knowledge模块（例如，来自不同数据集的知识）作为知识的集合进行维护。因此，不同的代理可以在没有不必要的重新初始化的情况下重复使用知识模块。考虑到配置RAG模块可能对大多数用户来说过于复杂，知识库还提供了一个简单的函数调用来创建RAG模块。
+
+* `KnowledgeBank.add_data_as_knowledge`: 创建RAG模块。一种简单的方式只需要提供knowledge_id、emb_model_name和data_dirs_and_types。
+  ```python
+  knowledge_bank.add_data_as_knowledge(
+        knowledge_id="agentscope_tutorial_rag",
+        emb_model_name="qwen_emb_config",
+        data_dirs_and_types={
+            "../../docs/sphinx_doc/en/source/tutorial": [".md"],
+        },
+    )
+  ```
+  对于更高级的初始化，用户仍然可以将一个知识模块配置作为参数knowledge_config传递：
+  ```python
+  # load knowledge_config as dict
+  knowledge_bank.add_data_as_knowledge(
+      knowledge_id=knowledge_config["knowledge_id"],
+      emb_model_name=knowledge_config["emb_model_config_name"],
+      knowledge_config=knowledge_config,
+  )
+  ```
+* `KnowledgeBank.get_knowledge`: 它接受两个参数，knowledge_id和duplicate。
+  如果duplicate为true，则返回提供的knowledge_id对应的知识对象；否则返回深拷贝的对象。
+* `KnowledgeBank.equip`: 它接受两个参数，`agent`和`duplicate`。
+该函数首先会检查代理是否具有rag_config；如果有，则根据rag_config中的knowledge_id提供相应的知识，并为代理初始化检索器。
+`duplicate` 同样决定是否是深拷贝。
+
+
+### RAG agent
+RAG agent是可以基于检索到的知识生成答案的agent。
+  * 让Agent使用RAG: RAG agent在其配置中需要·`rag_config`，其中有一个`knowledge_id`的列表
+  * Agent可以通过将其传递给`KnowledgeBank.equip`函数来从`KnowledgeBank`加载特定的知识。
+  * Agent 代理可以在`reply`函数中使用检索器(retriever)从`Knowledge`中检索，并将其提示组合到LLM中
+
+**Building RAG agent yourself.** 只要您的代理配置具有`rag_config`属性并且是字典型，里面有一个`knowledge_id`列表，您就可以将其传递给`KnowledgeBank.equip`,
+为它配置`knowledge_id`列表和相应的知识和检索器（retriever），您的代理将配备一系列知识。
+您可以在`reply`函数中决定如何使用检索器，甚至更新和刷新索引。
+
+[[Back to the top]](#209-rag-zh)
+
+
+
diff --git a/examples/conversation_with_RAG_agents/README.md b/examples/conversation_with_RAG_agents/README.md
@@ -7,7 +7,6 @@ you will obtain three different agents who can help you answer different questio
 * **What is this example for?** By this example, we want to show how the agent with retrieval augmented generation (RAG)
 capability can be used to build easily.
 
-**Notice:** This example is a Beta version of the AgentScope RAG agent. A formal version will soon be added to `src/agentscope/agents`, but it may be subject to changes.
 
 ## Prerequisites
 * **Cloning repo:** This example requires cloning the whole AgentScope repo to local.
@@ -23,35 +22,28 @@ capability can be used to build easily.
 **Note:** This example has been tested with `dashscope_chat` and `dashscope_text_embedding` model wrapper, with `qwen-max` and `text-embedding-v2` models.
 However, you are welcome to replace the Dashscope language and embedding model wrappers or models with other models you like to test.
 
-## Start AgentScope Consultants
+## Start AgentScope Copilots
 * **Terminal:** The most simple way to execute the AgentScope Consultants is running in terminal.
   ```bash
   python ./rag_example.py
   ```
-  Setting `log_retrieval` to `false` in `agent_config.json` can hide the retrieved information and provide only answers of agents.
+
 
 * **AS studio:** If you want to have more organized, clean UI, you can also run with our `as_studio`.
   ```bash
   as_studio ./rag_example.py
   ```
 
-### Customize AgentScope Consultants to other consultants
+### Agents in the example
+Customize AgentScope Consultants to other consultants
 After you run the example, you may notice that this example consists of three RAG agents:
-* `AgentScope Tutorial Assistant`: responsible for answering questions based on AgentScope tutorials (markdown files).
-* `AgentScope Framework Code Assistant`: responsible for answering questions based on AgentScope code base (python files).
-* `Summarize Assistant`: responsible for summarize the questions from the above two agents.
-
-These agents can be configured to answering questions based on other GitHub repo, by simply modifying the `input_dir` fields in the `agent_config.json`.
-
-For more advanced customization, we may need to learn a little bit from the following.
+* `Tutorial-Assistant`: responsible for answering questions based on AgentScope tutorials (markdown files).
+* `Code-Search-Assistant`: responsible for answering questions based on AgentScope code base (python files).
+* `API-Assistant`: responsible for answering questions based on AgentScope API documents (html files, generated by `sphinx`)
+* `Searching-Assistant`: responsible for general search in tutorial and code base (markdown files and code files)
+* `Agent-Guiding-Assistant`: responsible for referring the correct agent(s) among the above ones.
 
-**RAG modules:** In AgentScope, RAG modules are abstract to provide three basic functions: `load_data`, `store_and_index` and `retrieve`. Refer to `src/agentscope/rag` for more details.
+Besides the last `Agent-Guiding-Assistant`, all other agents can be configured to answering questions based on other GitHub repo by replacing the `knowledge`.
 
-**RAG configs:** In the example configuration (the `rag_config` field), all parameters are optional. But if you want to customize them, you may want to learn the following:
-*  `load_data`: contains all parameters for the the `rag.load_data` function.
-Since the `load_data` accepts a dataloader object `loader`, the `loader` in the config need to have `"create_object": true` to let a internal parse create a LlamaIndex data loader object.
-The loader object is an instance of `class` in module `module`, with initialization parameters in `init_args`.
+For more details about how to use the RAG module in AgentScope, please refer to the tutorial.
 
-* `store_and_index`: contains all parameters for the the `rag.store_and_index` function.
-For example, you can pass `vector_store` and `retriever` configurations in a similar way as the `loader` mentioned above.
-For the `transformations` parameter, you can pass a list of dicts, each of which corresponds to building a `NodeParser`-kind of preprocessor in Llamaindex.
diff --git a/examples/conversation_with_RAG_agents/configs/agent_config.json b/examples/conversation_with_RAG_agents/configs/agent_config.json
@@ -41,7 +41,7 @@
       "emb_model_config_name": "qwen_emb_config",
       "rag_config": {
           "knowledge_id": ["agentscope_api_rag"],
-          "similarity_top_k": 3,
+          "similarity_top_k": 2,
           "log_retrieval": true,
           "recent_n_mem": 1
       }
@@ -68,7 +68,6 @@
     "class": "DialogAgent",
     "args": {
       "name": "Agent-Guiding-Assistant",
-      "description": "Agent-Guiding-Assistant is an agent that decide which agent should provide the answer next. It can answer questions about specific functions and classes in AgentScope.",
       "sys_prompt": "You're an assistant guiding the user to specific agent for help. The answer is in a cheerful styled language. The output starts with appreciation for the question. Next, rephrase the question in a simple declarative Sentence for example, 'I think you are asking...'. Last, if the question is about detailed code or example in AgentScope Framework, output '@ Code-Search-Assistant you might be suitable for answering the question'; if the question is about API or function calls (Example: 'Is there function related...' or 'how can I initialize ...' ) in AgentScope, output '@ API-Assistant, I think you are more suitable for the question, please tell us more about it'; if question is about where to find some context (Example:'where can I find...'), output '@ Searching-Assistant, we need your help', otherwise, output '@ Tutorial-Assistant, I think you are more suitable for the question, can you tell us more about it?'. The answer is expected to be only one sentence",
       "model_config_name": "qwen_config",
       "use_memory": false

diff --git a/examples/conversation_with_RAG_agents/configs/knowledge_config.json b/examples/conversation_with_RAG_agents/configs/knowledge_config.json
@@ -39,7 +39,7 @@
   {
     "knowledge_id": "agentscope_api_rag",
     "emb_model_config_name": "qwen_emb_config",
-    "chunk_size": 2048,
+    "chunk_size": 1024,
     "chunk_overlap": 40,
     "data_processing": [
       {