Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StructuredOutputParser can't handle nested triple backticks #6734

Closed
5 tasks done
hnykda opened this issue Sep 11, 2024 · 6 comments · Fixed by #7199
Closed
5 tasks done

StructuredOutputParser can't handle nested triple backticks #6734

hnykda opened this issue Sep 11, 2024 · 6 comments · Fixed by #7199
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@hnykda
Copy link

hnykda commented Sep 11, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { StructuredOutputParser } from 'langchain/output_parsers';
import { z } from 'zod';

async function main() {
  const c =
    '{  "revisedOutput": "long markdown text with example ```javascript\\nfunction(){}``` here"}';

  const parser = StructuredOutputParser.fromZodSchema(
    z.object({
      revisedOutput: z.string(),
    }),
  );
  // the following line will throw an error
  const response = await parser.parse(c);
  console.log(response);
}

main();

Error Message and Stack Trace (if applicable)

/Users/dan/work/varuna/code/varuna/node_modules/langchain/dist/output_parsers/structured.cjs:83
            throw new output_parsers_1.OutputParserException(`Failed to parse. Text: "${text}". Error: ${e}`, text);
                  ^

OutputParserException [Error]: Failed to parse. Text: "{  "revisedOutput": "long markdown text with example ```javascript\nfunction(){}``` here"}". Error: SyntaxError: Unexpected token 'j', "javascript"... is not valid JSON
    at StructuredOutputParser.parse (/Users/dan/work/varuna/code/varuna/node_modules/langchain/dist/output_parsers/structured.cjs:83:19)
    at main (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:15:33)
    at <anonymous> (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:19:1)
    at Object.<anonymous> (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:19:6)
    at Module._compile (node:internal/modules/cjs/loader:1546:14)
    at Object.transformer (/Users/dan/work/varuna/code/varuna/node_modules/tsx/dist/register-C1urN2EO.cjs:2:1122)
    at Module.load (node:internal/modules/cjs/loader:1317:32)
    at Module._load (node:internal/modules/cjs/loader:1127:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24) {
  llmOutput: '{  "revisedOutput": "long markdown text with example ```javascript\\nfunction(){}``` here"}',
  observation: undefined,
  sendToLLM: false
}

Description

  1. I am parsing a simple markdown, but that markdown might contain triple backticks.
  2. using StructuredOutputParser fails on that scenario, partly because this will just randomly split the string in the middle...

System Info

$ npm info langchain

[email protected] | MIT | deps: 14 | versions: 291
Typescript bindings for langchain
https://github.com/langchain-ai/langchainjs/tree/main/langchain/

keywords: llm, ai, gpt3, chain, prompt, prompt engineering, chatgpt, machine learning, ml, openai, embeddings, vectorstores

dist
.tarball: https://registry.npmjs.org/langchain/-/langchain-0.2.18.tgz
.shasum: f43e9aa8588048bfee38bf2d39e508c0410c4295
.integrity: sha512-7+5Y2FsdjlUBKJf1N+MyHn1PrGZQ9m87l3du7IeH4RiQp4j8Bz7mGV+Bjhr8FEXFjo85zLrbnpsN87pZCstubA==
.unpackedSize: 3.5 MB

dependencies:
@langchain/core: >=0.2.21 <0.3.0  langsmith: ~0.1.40                
@langchain/openai: >=0.1.0 <0.3.0 openapi-types: ^12.1.3            
@langchain/textsplitters: ~0.0.0  p-retry: 4                        
binary-extensions: ^2.2.0         uuid: ^10.0.0                     
js-tiktoken: ^1.0.12              yaml: ^2.2.1                      
js-yaml: ^4.1.0                   zod-to-json-schema: ^3.22.3       
jsonpointer: ^5.0.1               zod: ^3.22.4                      

maintainers:
- nfcampos <[email protected]>
- jacoblee93 <[email protected]>
- andrewnguonly <[email protected]>
- davidduong <[email protected]>
- hwchase17 <[email protected]>
- basproul <[email protected]>

dist-tags:
latest: 0.2.18    next: 0.3.0-rc.0  

published 6 days ago by jacoblee93 <[email protected]>
Copy link

dosubot bot commented Sep 11, 2024

Hey @hnykda! I'm here to help you with your LangChain.js issue. I'm a bot designed to assist with bug fixes, answer questions, and support contributors. Let's tackle this bug together! 🤖

The StructuredOutputParser in LangChain.js cannot handle nested triple backticks because the parse method splits the input text by triple backticks to extract the JSON content. This approach fails when there are nested triple backticks, as it incorrectly splits the text, leading to parsing errors.

Here is the relevant code snippet from the parse method:

async parse(text: string): Promise<z.infer<T>> {
  try {
    const json = text.includes("```")
      ? text.trim().split(/```(?:json)?/)[1]
      : text.trim();
    const escapedJson = json
      .replace(/"([^"\\]*(\\.[^"\\]*)*)"/g, (_match, capturedGroup) => {
        const escapedInsideQuotes = capturedGroup.replace(/\n/g, "\\n");
        return `"${escapedInsideQuotes}"`;
      })
      .replace(/\n/g, "");

    return await this.schema.parseAsync(JSON.parse(escapedJson));
  } catch (e) {
    throw new OutputParserException(
      `Failed to parse. Text: "${text}". Error: ${e}`,
      text
    );
  }
}

The split(/```(?:json)?/) line is where the issue arises, as it does not account for nested triple backticks [1].

For handling JSON content that includes nested triple backticks, you can use the parseJsonMarkdown utility function provided by LangChain.js. This function is designed to handle such cases:

export function parseJsonMarkdown(s: string, parser = parsePartialJson) {
  s = s.trim();
  const match = /```(json)?(.*)```/s.exec(s);
  if (!match) {
    return parser(s);
  } else {
    return parser(match[2]);
  }
}

This function uses a regular expression to detect and extract JSON content enclosed in triple backticks and then parses it using the parsePartialJson function [2].

You might want to consider using this utility function or modifying the StructuredOutputParser to handle nested triple backticks more effectively.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 11, 2024
@hnykda
Copy link
Author

hnykda commented Sep 11, 2024

Well, I know bot, that's what I linked.

I am sharing this because I think this is unexpected behavior, and took me an hour to figure what's happening because I didn't expect that langsmith is going to just split the string if there is something in between. I am not sure how to solve this, but maybe at least the error message could be improved, or one could try to first parse "as is" and only then remove those backticks.

@CarlosSLoureiro
Copy link

I'm having same issue. I'm trying to Build an Extraction Chain but sometimes even with same prompt it throws the parse error.

An not good way that I'm using to try to prevent while it isn't fixed is creating a retry logic. For example:

For example:

const tryExtract = async (prompt: string, retries = 0) => {
  try {
    return await extract(prompt); // Here is the Extraction Chain logic
  } catch (error) {
     if (retries <= 5) {
       return tryExtract(prompt, retries + 1);
     }
     throw error;
  }
}

tryExtract("prompt here").then(console.log);
```



@tlandart
Copy link

tlandart commented Oct 8, 2024

Hi, my group and I are taking a look at this (as part of a University of Toronto course project). We're currently investigating if the proposed solution works.

@johnguirgis
Copy link
Contributor

Hello, I'm working with @tlandart on this.

It seems @hnykda's idea could work, though we think it is probably cleaner to explicitly handle this case in the try block by only splitting when triple back ticks are outside of the curly brackets. It could look something like this:

...
try {
        const pattern = /```\s*\{/;
        const json = pattern.test(text)
            ? text.trim().split(/```(?:json)?/)[1]
            : text.trim();
        return await this.schema.parseAsync(JSON.parse(json));
    }
...

We have tested this approach with some possible values and it seems to behave as expected, but, if there is anything we have overlooked, please let us know.

We're first time contributors so if anyone has any comments on this it would be appreciated!

@hnykda
Copy link
Author

hnykda commented Oct 11, 2024

Looks valid to me. Include some tests to your PR, that will help? 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants