Skip to content

Commit

Permalink
Support Pydantic based validation within the MarkdownJsonDictParser (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
DavdGao authored Jun 5, 2024
1 parent a0064ea commit 21c8826
Show file tree
Hide file tree
Showing 6 changed files with 214 additions and 6 deletions.
46 changes: 46 additions & 0 deletions docs/sphinx_doc/en/source/tutorial/203-parser.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
- [Dictionary Type](#dictionary-type)
- [MarkdownJsonDictParser](#markdownjsondictparser)
- [Initialization & Format Instruction Template](#initialization--format-instruction-template)
- [Validation](#validation)
- [MultiTaggedContentParser](#multitaggedcontentparser)
- [Initialization & Format Instruction Template](#initialization--format-instruction-template-1)
- [Parse Function](#parse-function-1)
Expand Down Expand Up @@ -77,6 +78,8 @@ AgentScope provides multiple built-in parsers, and developers can choose accordi
> In contrast, `MultiTaggedContentParser` guides LLM to generate each key-value pair separately in individual tags and then combines them into a dictionary, thus reducing the difficulty.

>**NOTE**: The built-in strategies to construct format instruction just provide some examples. In AgentScope, developer has complete control over prompt construction. So they can choose not to use the format instruction provided by parsers, customizing their format instruction by hand or implementing new parser class are all feasible.
In the following sections, we will introduce the usage of these parsers based on different target formats.

### String Type
Expand Down Expand Up @@ -300,6 +303,49 @@ This parameter can be a string or a dictionary. For dictionary, it will be autom
```
````
##### Validation
The `content_hint` parameter in `MarkdownJsonDictParser` also supports type validation based on Pydantic. When initializing, you can set `content_hint` to a Pydantic model class, and AgentScope will modify the `instruction_format` attribute based on this class. Besides, Pydantic will be used to validate the dictionary returned by LLM during parsing.
A simple example is as follows, where `"..."` can be filled with specific type validation rules, which can be referred to the [Pydantic](https://docs.pydantic.dev/latest/) documentation.
```python
from pydantic import BaseModel, Field
from agentscope.parsers import MarkdownJsonDictParser
class Schema(BaseModel):
thought: str = Field(..., description="what you thought")
speak: str = Field(..., description="what you speak")
end_discussion: bool = Field(..., description="whether the discussion is finished")
parser = MarkdownJsonDictParser(content_hint=Schema)
```
- The corresponding `instruction_format` attribute
````
Respond a JSON dictionary in a markdown's fenced code block as follows:
```json
{a_JSON_dictionary}
```
The generated JSON dictionary MUST follow this schema:
{'properties': {'speak': {'description': 'what you speak', 'title': 'Speak', 'type': 'string'}, 'thought': {'description': 'what you thought', 'title': 'Thought', 'type': 'string'}, 'end_discussion': {'description': 'whether the discussion reached an agreement or not', 'title': 'End Discussion', 'type': 'boolean'}}, 'required': ['speak', 'thought', 'end_discussion'], 'title': 'Schema', 'type': 'object'}
````
- During the parsing process, Pydantic will be used for type validation, and an exception will be thrown if the validation fails. Meanwhile, Pydantic also provides some fault tolerance capabilities, such as converting the string `"true"` to Python's `True`:
````
parser.parser("""
```json
{
"thought": "The others didn't realize I was a werewolf. I should end the discussion soon.",
"speak": "I agree with you.",
"end_discussion": "true"
}
```
""")
````
#### MultiTaggedContentParser
`MultiTaggedContentParser` asks LLM to generate specific content within multiple tag pairs. The content from different tag pairs will be parsed into a single Python dictionary. Its usage is similar to `MarkdownJsonDictParser`, but the initialization method is different, and it is more suitable for weak LLMs or complex return content.
Expand Down
47 changes: 47 additions & 0 deletions docs/sphinx_doc/zh_CN/source/tutorial/203-parser.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
- [字典类型](#字典dict类型)
- [MarkdownJsonDictParser](#markdownjsondictparser)
- [初始化 & 响应格式模版](#初始化--响应格式模版)
- [类型校验](#类型校验)
- [MultiTaggedContentParser](#multitaggedcontentparser)
- [初始化 & 响应格式模版](#初始化--响应格式模版-1)
- [解析函数](#解析函数-1)
Expand Down Expand Up @@ -75,6 +76,8 @@ AgentScope提供了多种不同解析器,开发者可以根据自己的需求

> **NOTE**: 相比`MarkdownJsonDictParser``MultiTaggedContentParser`更适合于模型能力不强,以及需要 LLM 返回内容过于复杂的情况。例如 LLM 返回 Python 代码,如果直接在字典中返回代码,那么 LLM 需要注意特殊字符的转义(\t,\n,...),`json.loads`读取时对双引号和单引号的区分等问题。而`MultiTaggedContentParser`实际是让大模型在每个单独的标签中返回各个键值,然后再将它们组成字典,从而降低了LLM返回的难度。
> **NOTE**:AgentScope 内置的响应格式说明并不一定是最优的选择。在 AgentScope 中,开发者可以完全控制提示构建的过程,因此,选择不使用parser中内置的相应格式说明,而是自定义新的相应格式说明,或是实现新的parser类都是可行的技术方案。
下面我们将根据不同的目标格式,介绍这些解析器的用法。

### 字符串(`str`)类型
Expand Down Expand Up @@ -297,6 +300,50 @@ AgentScope中,我们通过调用`to_content`,`to_memory`和`to_metadata`方
```
````
##### 类型校验
`MarkdownJsonDictParser`中的`content_hint`参数还支持基于Pydantic的类型校验。初始化时,可以将`content_hint`设置为一个Pydantic的模型类,AgentScope将根据这个类来修改`instruction_format`属性,并且利用Pydantic在解析时对LLM返回的字典进行类型校验。
该功能需要LLM能够理解JSON schema格式的提示,因此适用于能力较强的大模型。
一个简单的例子如下,`"..."`处可以填写具体的类型校验规则,可以参考[Pydantic](https://docs.pydantic.dev/latest/)文档。
```python
from pydantic import BaseModel, Field
from agentscope.parsers import MarkdownJsonDictParser
class Schema(BaseModel):
thought: str = Field(..., description="what you thought")
speak: str = Field(..., description="what you speak")
end_discussion: bool = Field(..., description="whether the discussion is finished")
parser = MarkdownJsonDictParser(content_hint=Schema)
```
- 对应的`format_instruction`属性
````
Respond a JSON dictionary in a markdown's fenced code block as follows:
```json
{a_JSON_dictionary}
```
The generated JSON dictionary MUST follow this schema:
{'properties': {'speak': {'description': 'what you speak', 'title': 'Speak', 'type': 'string'}, 'thought': {'description': 'what you thought', 'title': 'Thought', 'type': 'string'}, 'end_discussion': {'description': 'whether the discussion reached an agreement or not', 'title': 'End Discussion', 'type': 'boolean'}}, 'required': ['speak', 'thought', 'end_discussion'], 'title': 'Schema', 'type': 'object'}
````
- 同时在解析的过程中,也将使用Pydantic进行类型校验,校验错误将抛出异常。同时,Pydantic也将提供一定的容错处理能力,例如将字符串`"true"`转换成Python的`True`:
````
parser.parser("""
```json
{
"thought": "The others didn't realize I was a werewolf. I should end the discussion soon.",
"speak": "I agree with you.",
"end_discussion": "true"
}
```
""")
````
#### MultiTaggedContentParser
`MultiTaggedContentParser`要求 LLM 在多个指定的标签对中产生指定的内容,这些不同标签的内容将一同被解析为一个 Python 字典。使用方法与`MarkdownJsonDictParser`类似,只是初始化方法不同,更适合能力较弱的LLM,或是比较复杂的返回内容。
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
# released requires
minimal_requires = [
"docstring_parser",
"pydantic",
"loguru==0.6.0",
"tiktoken",
"Pillow",
Expand Down
4 changes: 4 additions & 0 deletions src/agentscope/exception.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ class JsonParsingError(ResponseParsingError):
"""The exception class for JSON parsing error."""


class JsonDictValidationError(ResponseParsingError):
"""The exception class for JSON dict validation error."""


class JsonTypeError(ResponseParsingError):
"""The exception class for JSON type error."""

Expand Down
64 changes: 59 additions & 5 deletions src/agentscope/parsers/json_object_parser.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# -*- coding: utf-8 -*-
"""The parser for JSON object in the model response."""
import inspect
import json
from copy import deepcopy
from typing import Optional, Any, List, Sequence, Union

from loguru import logger
from pydantic import BaseModel

from agentscope.exception import (
TagNotFoundError,
Expand Down Expand Up @@ -139,11 +141,22 @@ class MarkdownJsonDictParser(MarkdownJsonObjectParser, DictFilterMixin):
"""Closing end for a code block."""

_format_instruction = (
"You should respond a json object in a json fenced code block as "
"Respond a JSON dictionary in a markdown's fenced code block as "
"follows:\n```json\n{content_hint}\n```"
)
"""The instruction for the format of the json object."""

_format_instruction_with_schema = (
"Respond a JSON dictionary in a markdown's fenced code block as "
"follows:\n"
"```json\n"
"{content_hint}\n"
"```\n"
"The generated JSON dictionary MUST follow this schema: \n"
"{schema}"
)
"""The schema instruction for the format of the json object."""

required_keys: List[str]
"""A list of required keys in the JSON dictionary object. If the response
misses any of the required keys, it will raise a
Expand All @@ -164,7 +177,8 @@ def __init__(
The hint used to remind LLM what should be fill between the
tags. If it is a string, it will be used as the content hint
directly. If it is a dict, it will be converted to a json
string and used as the content hint.
string and used as the content hint. If it's a Pydantic model,
the schema will be displayed in the instruction.
required_keys (`List[str]`, defaults to `[]`):
A list of required keys in the JSON dictionary object. If the
response misses any of the required keys, it will raise a
Expand All @@ -177,7 +191,7 @@ def __init__(
- `str`, the corresponding value will be returned
- `List[str]`, a filtered dictionary will be returned
- `True`, the whole dictionary will be returned
keys_to_content (`Optional[Union[str, bool, Sequence[str]]`,
keys_to_content (`Optional[Union[str, bool, Sequence[str]]]`,
defaults to `True`):
The key or keys to be filtered in `to_content` method. If
it's
Expand All @@ -195,8 +209,23 @@ def __init__(
- `True`, the whole dictionary will be returned
"""
# Initialize the markdown json object parser
MarkdownJsonObjectParser.__init__(self, content_hint)
self.pydantic_class = None

# Initialize the content_hint according to the type of content_hint
if inspect.isclass(content_hint) and issubclass(
content_hint,
BaseModel,
):
self.pydantic_class = content_hint
self.content_hint = "{a_JSON_dictionary}"
elif content_hint is not None:
if isinstance(content_hint, str):
self.content_hint = content_hint
else:
self.content_hint = json.dumps(
content_hint,
ensure_ascii=False,
)

# Initialize the mixin class to allow filtering the parsed response
DictFilterMixin.__init__(
Expand All @@ -208,6 +237,21 @@ def __init__(

self.required_keys = required_keys or []

@property
def format_instruction(self) -> str:
"""Get the format instruction for the json object, if the
format_example is provided, it will be used as the example.
"""
if self.pydantic_class is None:
return self._format_instruction.format(
content_hint=self.content_hint,
)
else:
return self._format_instruction_with_schema.format(
content_hint=self.content_hint,
schema=self.pydantic_class.model_json_schema(),
)

def parse(self, response: ModelResponse) -> ModelResponse:
"""Parse the text field of the response to a JSON dictionary object,
store it in the parsed field of the response object, and check if the
Expand All @@ -224,6 +268,16 @@ def parse(self, response: ModelResponse) -> ModelResponse:
response.text,
)

# Requirement checking by Pydantic
if self.pydantic_class is not None:
try:
response.parsed = dict(self.pydantic_class(**response.parsed))
except Exception as e:
raise JsonParsingError(
message=str(e),
raw_response=response.text,
) from None

# Check if the required keys exist
keys_missing = []
for key in self.required_keys:
Expand Down
58 changes: 57 additions & 1 deletion tests/parser_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
"""Unit test for model response parser."""
import unittest

from pydantic import BaseModel, Field

from agentscope.models import ModelResponse
from agentscope.parsers import (
MarkdownJsonDictParser,
Expand All @@ -27,7 +29,7 @@ def setUp(self) -> None:
),
)
self.instruction_dict_1 = (
"You should respond a json object in a json fenced code block "
"Respond a JSON dictionary in a markdown's fenced code block "
"as follows:\n"
"```json\n"
'{"speak": "what you speak", '
Expand Down Expand Up @@ -59,6 +61,22 @@ def setUp(self) -> None:
'"end_discussion": true/false}'
)

self.instruction_dict_3 = (
"Respond a JSON dictionary in a markdown's fenced code block as "
"follows:\n"
"```json\n"
"{a_JSON_dictionary}\n"
"```\n"
"The generated JSON dictionary MUST follow this schema: \n"
"{'properties': {'speak': {'description': 'what you speak', "
"'title': 'Speak', 'type': 'string'}, 'thought': {'description': "
"'what you thought', 'title': 'Thought', 'type': 'string'}, "
"'end_discussion': {'description': 'whether the discussion "
"reached an agreement or not', 'title': 'End Discussion', "
"'type': 'boolean'}}, 'required': ['speak', 'thought', "
"'end_discussion'], 'title': 'Schema', 'type': 'object'}"
)

self.gt_to_memory = {"speak": "Hello, world!", "thought": "xxx"}
self.gt_to_content = "Hello, world!"
self.gt_to_metadata = {"end_discussion": True}
Expand Down Expand Up @@ -104,6 +122,44 @@ def setUp(self) -> None:
)
self.gt_code = """\nprint("Hello, world!")\n"""

def test_markdownjsondictparser_with_schema(self) -> None:
"""Test for MarkdownJsonDictParser with schema"""

class Schema(BaseModel): # pylint: disable=missing-class-docstring
speak: str = Field(description="what you speak")
thought: str = Field(description="what you thought")
end_discussion: bool = Field(
description="whether the discussion reached an agreement or "
"not",
)

parser = MarkdownJsonDictParser(
content_hint=Schema,
keys_to_memory=["speak", "thought"],
keys_to_content="speak",
keys_to_metadata=["end_discussion"],
)

self.assertEqual(parser.format_instruction, self.instruction_dict_3)

res = parser.parse(self.res_dict_1)

self.assertDictEqual(res.parsed, self.gt_dict)

res = parser.parse(
ModelResponse(
text="""```json
{
"speak" : "Hello, world!",
"thought" : "xxx",
"end_discussion" : "true"
}
```""",
),
)

self.assertDictEqual(res.parsed, self.gt_dict)

def test_markdownjsondictparser(self) -> None:
"""Test for MarkdownJsonDictParser"""
parser = MarkdownJsonDictParser(
Expand Down

0 comments on commit 21c8826

Please sign in to comment.