StructuredOutputParser can't handle nested triple backticks #6734

hnykda · 2024-09-11T13:59:03Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain.js documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain.js rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import { StructuredOutputParser } from 'langchain/output_parsers';
import { z } from 'zod';

async function main() {
  const c =
    '{  "revisedOutput": "long markdown text with example ```javascript\\nfunction(){}``` here"}';

  const parser = StructuredOutputParser.fromZodSchema(
    z.object({
      revisedOutput: z.string(),
    }),
  );
  // the following line will throw an error
  const response = await parser.parse(c);
  console.log(response);
}

main();

Error Message and Stack Trace (if applicable)

/Users/dan/work/varuna/code/varuna/node_modules/langchain/dist/output_parsers/structured.cjs:83
            throw new output_parsers_1.OutputParserException(`Failed to parse. Text: "${text}". Error: ${e}`, text);
                  ^

OutputParserException [Error]: Failed to parse. Text: "{  "revisedOutput": "long markdown text with example ```javascript\nfunction(){}``` here"}". Error: SyntaxError: Unexpected token 'j', "javascript"... is not valid JSON
    at StructuredOutputParser.parse (/Users/dan/work/varuna/code/varuna/node_modules/langchain/dist/output_parsers/structured.cjs:83:19)
    at main (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:15:33)
    at <anonymous> (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:19:1)
    at Object.<anonymous> (/Users/dan/work/varuna/code/varuna/app/dev/parse.tsx:19:6)
    at Module._compile (node:internal/modules/cjs/loader:1546:14)
    at Object.transformer (/Users/dan/work/varuna/code/varuna/node_modules/tsx/dist/register-C1urN2EO.cjs:2:1122)
    at Module.load (node:internal/modules/cjs/loader:1317:32)
    at Module._load (node:internal/modules/cjs/loader:1127:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24) {
  llmOutput: '{  "revisedOutput": "long markdown text with example ```javascript\\nfunction(){}``` here"}',
  observation: undefined,
  sendToLLM: false
}

Description

I am parsing a simple markdown, but that markdown might contain triple backticks.
using StructuredOutputParser fails on that scenario, partly because this will just randomly split the string in the middle...

System Info

$ npm info langchain

[email protected] | MIT | deps: 14 | versions: 291
Typescript bindings for langchain
https://github.com/langchain-ai/langchainjs/tree/main/langchain/

keywords: llm, ai, gpt3, chain, prompt, prompt engineering, chatgpt, machine learning, ml, openai, embeddings, vectorstores

dist
.tarball: https://registry.npmjs.org/langchain/-/langchain-0.2.18.tgz
.shasum: f43e9aa8588048bfee38bf2d39e508c0410c4295
.integrity: sha512-7+5Y2FsdjlUBKJf1N+MyHn1PrGZQ9m87l3du7IeH4RiQp4j8Bz7mGV+Bjhr8FEXFjo85zLrbnpsN87pZCstubA==
.unpackedSize: 3.5 MB

dependencies:
@langchain/core: >=0.2.21 <0.3.0  langsmith: ~0.1.40                
@langchain/openai: >=0.1.0 <0.3.0 openapi-types: ^12.1.3            
@langchain/textsplitters: ~0.0.0  p-retry: 4                        
binary-extensions: ^2.2.0         uuid: ^10.0.0                     
js-tiktoken: ^1.0.12              yaml: ^2.2.1                      
js-yaml: ^4.1.0                   zod-to-json-schema: ^3.22.3       
jsonpointer: ^5.0.1               zod: ^3.22.4                      

maintainers:
- nfcampos <[email protected]>
- jacoblee93 <[email protected]>
- andrewnguonly <[email protected]>
- davidduong <[email protected]>
- hwchase17 <[email protected]>
- basproul <[email protected]>

dist-tags:
latest: 0.2.18    next: 0.3.0-rc.0  

published 6 days ago by jacoblee93 <[email protected]>

The text was updated successfully, but these errors were encountered:

dosubot · 2024-09-11T14:01:08Z

Hey @hnykda! I'm here to help you with your LangChain.js issue. I'm a bot designed to assist with bug fixes, answer questions, and support contributors. Let's tackle this bug together! 🤖

The StructuredOutputParser in LangChain.js cannot handle nested triple backticks because the parse method splits the input text by triple backticks to extract the JSON content. This approach fails when there are nested triple backticks, as it incorrectly splits the text, leading to parsing errors.

Here is the relevant code snippet from the parse method:

async parse(text: string): Promise<z.infer<T>> {
  try {
    const json = text.includes("```")
      ? text.trim().split(/```(?:json)?/)[1]
      : text.trim();
    const escapedJson = json
      .replace(/"([^"\\]*(\\.[^"\\]*)*)"/g, (_match, capturedGroup) => {
        const escapedInsideQuotes = capturedGroup.replace(/\n/g, "\\n");
        return `"${escapedInsideQuotes}"`;
      })
      .replace(/\n/g, "");

    return await this.schema.parseAsync(JSON.parse(escapedJson));
  } catch (e) {
    throw new OutputParserException(
      `Failed to parse. Text: "${text}". Error: ${e}`,
      text
    );
  }
}

The split(/```(?:json)?/) line is where the issue arises, as it does not account for nested triple backticks ^[1].

For handling JSON content that includes nested triple backticks, you can use the parseJsonMarkdown utility function provided by LangChain.js. This function is designed to handle such cases:

export function parseJsonMarkdown(s: string, parser = parsePartialJson) {
  s = s.trim();
  const match = /```(json)?(.*)```/s.exec(s);
  if (!match) {
    return parser(s);
  } else {
    return parser(match[2]);
  }
}

This function uses a regular expression to detect and extract JSON content enclosed in triple backticks and then parses it using the parsePartialJson function ^[2].

You might want to consider using this utility function or modifying the StructuredOutputParser to handle nested triple backticks more effectively.

To continue talking to Dosu, mention @dosu.

hnykda · 2024-09-11T14:20:11Z

Well, I know bot, that's what I linked.

I am sharing this because I think this is unexpected behavior, and took me an hour to figure what's happening because I didn't expect that langsmith is going to just split the string if there is something in between. I am not sure how to solve this, but maybe at least the error message could be improved, or one could try to first parse "as is" and only then remove those backticks.

CarlosSLoureiro · 2024-09-13T13:53:59Z

I'm having same issue. I'm trying to Build an Extraction Chain but sometimes even with same prompt it throws the parse error.

An not good way that I'm using to try to prevent while it isn't fixed is creating a retry logic. For example:

For example:

const tryExtract = async (prompt: string, retries = 0) => {
  try {
    return await extract(prompt); // Here is the Extraction Chain logic
  } catch (error) {
     if (retries <= 5) {
       return tryExtract(prompt, retries + 1);
     }
     throw error;
  }
}

tryExtract("prompt here").then(console.log);
```

tlandart · 2024-10-08T20:50:24Z

Hi, my group and I are taking a look at this (as part of a University of Toronto course project). We're currently investigating if the proposed solution works.

johnguirgis · 2024-10-11T01:59:47Z

Hello, I'm working with @tlandart on this.

It seems @hnykda's idea could work, though we think it is probably cleaner to explicitly handle this case in the try block by only splitting when triple back ticks are outside of the curly brackets. It could look something like this:

...
try {
        const pattern = /```\s*\{/;
        const json = pattern.test(text)
            ? text.trim().split(/```(?:json)?/)[1]
            : text.trim();
        return await this.schema.parseAsync(JSON.parse(json));
    }
...

We have tested this approach with some possible values and it seems to behave as expected, but, if there is anything we have overlooked, please let us know.

We're first time contributors so if anyone has any comments on this it would be appreciated!

hnykda · 2024-10-11T06:03:31Z

Looks valid to me. Include some tests to your PR, that will help? 🤷

dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 11, 2024

anadi45 mentioned this issue Nov 3, 2024

fix: structured output parser #7144

Open

johnguirgis mentioned this issue Nov 13, 2024

fix(langchain): Fix structured parser with triple backticks, adds tests #7199

Merged

jacoblee93 closed this as completed in #7199 Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StructuredOutputParser can't handle nested triple backticks #6734

StructuredOutputParser can't handle nested triple backticks #6734

hnykda commented Sep 11, 2024 •

edited

Loading

dosubot bot commented Sep 11, 2024

hnykda commented Sep 11, 2024

CarlosSLoureiro commented Sep 13, 2024

tlandart commented Oct 8, 2024

johnguirgis commented Oct 11, 2024

hnykda commented Oct 11, 2024 •

edited

Loading

StructuredOutputParser can't handle nested triple backticks #6734

StructuredOutputParser can't handle nested triple backticks #6734

Comments

hnykda commented Sep 11, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

dosubot bot commented Sep 11, 2024

hnykda commented Sep 11, 2024

CarlosSLoureiro commented Sep 13, 2024

tlandart commented Oct 8, 2024

johnguirgis commented Oct 11, 2024

hnykda commented Oct 11, 2024 • edited Loading

hnykda commented Sep 11, 2024 •

edited

Loading

hnykda commented Oct 11, 2024 •

edited

Loading