Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(fetch): add fetching of raw text, pagination and keeping links in the markdown #130

Merged
merged 8 commits into from
Nov 29, 2024

Conversation

jackadamson
Copy link
Contributor

Description

Add:

  • Fetching of raw (non-markdownified) text
  • Pagination of long page responses
  • Use of the javascript readability implementation of nodejs is present

Server Details

  • Server: fetch
  • Changes to: tools

Motivation and Context

The current version cannot handle pages that aren't HTML, meaning that you can't fetch eg. markdown files or raw source code files. So this is added as a fallback, and an argument that can force the behavior (eg. if the model needs access to the raw HTML).

Supporting raw reads can quickly fill up the conversation, so set a maximum length (that the model can override), and allow specifying a start offset. When testing with asking Claude questions about a long documentation page, it paginated through the document until it found the answer and didn't read any further.

The library readabilipy which is used to simplify the HTML before converting to markdown is based on the more mature javascript implementation, by enabling use_readability it will use the NodeJS implementation if node is installed, but fall back to the python implementation. The main benefit is that it preserves links in the markdown, so if an answer isn't found on the page, but it links to where it can be found, the model can make a request for the new URL.

How Has This Been Tested?

Tested in Claude Desktop, including cases that:

  • Raw fetching:
    • Claude still typically uses the non-raw fetch for most cases to get the markdown
    • Providing a URL for a non-HTML page implicitly provides the raw text
    • If asking Claude about page metadata, it correctly uses the raw argument to get the raw HTML
  • Readability link viewing:
    • Works as before if node.js not installed
    • If node.js is installed, links are kept in the markdown
    • If you provide Claude a page with a question and the page recommends a different page for answering the question, it will request it instead
  • Pagination:
    • If Claude is asked to answer a question based on documentation, it will continue to fetch more of the document (if it's too long) until it has sufficient information, unless the page provides a link to where to look, in which case Claude will swap to the new page

Breaking Changes

No, there are no changes to the config.

Types of changes

  • New MCP Server
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Protocol Documentation
  • My server follows MCP security best practices
  • I have updated the server's README accordingly
  • I have tested this with an LLM client
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have documented all environment variables and configuration options

@jackadamson jackadamson merged commit e0234c7 into main Nov 29, 2024
21 of 23 checks passed
@jackadamson jackadamson deleted the jadamson/fetch-use-readabilityjs branch November 29, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants