Singlefile

БесплатноНе проверен

An MCP server for intelligent web content extraction from JavaScript-heavy sites using single-file and trafilatura. It enables AI agents to fetch, render, and p

автор: kwinsch

GitHub Embed

Описание

An MCP server for intelligent web content extraction from JavaScript-heavy sites using single-file and trafilatura. It enables AI agents to fetch, render, and paginate through clean article content and metadata.

README

A powerful Model Context Protocol (MCP) server that provides intelligent web content extraction using single-file and trafilatura. Perfect for AI agents that need to access and analyze web content from JavaScript-heavy sites.

GitHub Repository: https://github.com/kwinsch/singlefile-mcp

Features

🌐 Universal Web Content Access

JavaScript Support: Handles modern SPA/React/Vue apps that require browser rendering
Clean Content Extraction: Uses Mozilla's Readability algorithm via trafilatura
Rich Metadata: Extracts title, author, date, description, and more
Multiple Output Formats: Raw HTML or clean markdown-like content

📄 Smart Pagination & Token Management

Flexible Pagination: Offset/limit system like file reading tools
Token Limits: Configurable max tokens (up to 25,000)
Smart Truncation: Summary mode shows beginning + end, truncate mode cuts cleanly
Navigation Hints: Clear guidance on how to continue reading large documents

⚡ Performance & Control

Selective Loading: Block images/scripts for faster processing
Content Compression: Optional HTML compression
Timeout Protection: Configurable timeouts prevent hanging
Error Handling: Graceful degradation when extraction fails

Installation

Prerequisites

Python 3.8+
single-file CLI - Web page capture tool
Node.js 16+ (for single-file)
A supported browser (Chromium, Chrome, Edge, Firefox, etc.)

Install single-file CLI

The single-file CLI is essential for this MCP server to work. It uses a real browser engine to accurately capture JavaScript-rendered content.

npm install -g single-file-cli

Usage with Claude Code

Quick Install (from PyPI)

claude mcp add singlefile-mcp -s user -- uvx singlefile-mcp

This will automatically install and run the package from PyPI, similar to how Brave Search works!

Development Install (from local directory)

claude mcp add singlefile-mcp -s user -- uvx --from /path/to/single-file_mcp singlefile-mcp

Remove old server (if upgrading)

claude mcp remove single-file-fetcher --scope user

Optional: Add Brave Search MCP

claude mcp add brave-search -s user -- env BRAVE_API_KEY=YOUR_KEY npx -y @modelcontextprotocol/server-brave-search

API Reference

fetch_webpage

Fetch and process web content with intelligent extraction.

Parameters

Parameter	Type	Default	Description
`url`	string	required	URL of the webpage to fetch
`output_content`	boolean	`true`	Whether to return content in response
`extract_content`	boolean	`false`	Extract clean text content (recommended)
`include_metadata`	boolean	`true`	Include page metadata (title, author, etc.)
`block_images`	boolean	`false`	Block image downloads for faster processing
`block_scripts`	boolean	`true`	Block JavaScript execution
`compress_html`	boolean	`true`	Compress HTML output
`max_tokens`	number	`20000`	Maximum tokens in response (max: 25000)
`truncate_method`	string	`"truncate"`	How to handle large content: `"truncate"` or `"summary"`
`offset`	number	`0`	Character offset to start reading from
`limit`	number	`null`	Maximum characters to return

Examples

Basic content extraction:

fetch_webpage(
    url="https://example.com/article",
    extract_content=True,
    include_metadata=True
)

Paginated reading of large documents:

# Get overview
fetch_webpage(
    url="https://docs.example.com/guide",
    extract_content=True,
    limit=5000
)

# Continue reading from offset
fetch_webpage(
    url="https://docs.example.com/guide", 
    extract_content=True,
    offset=5000,
    limit=5000
)

Raw HTML for complex parsing:

fetch_webpage(
    url="https://app.example.com/dashboard",
    extract_content=False,
    block_scripts=False,
    max_tokens=15000
)

Practical Example: Research Workflow

Here's a real-world example combining Brave Search and Single-File MCP:

Step 1: Search for information

# Using Brave Search MCP
brave_web_search(
    query="artificial intelligence history timeline",
    count=5
)

Step 2: Fetch and analyze Wikipedia article

# Using Single-File MCP to extract content
fetch_webpage(
    url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
    extract_content=True,
    include_metadata=True,
    limit=5000  # Get first 5000 chars
)

Result:

Successfully fetched webpage: https://en.wikipedia.org/wiki/History_of_artificial_intelligence

## Metadata
**Title:** History of artificial intelligence - Wikipedia
**Description:** The history of artificial intelligence (AI) began in antiquity...
**Site:** wikipedia.org

## Extracted Content (chars 0-5000 of 45000)
*Note: More content available. Use offset=5000 to continue.*

# History of artificial intelligence

The history of artificial intelligence (AI) began in antiquity, with myths, 
stories and rumors of artificial beings endowed with intelligence...

[Clean, readable article content follows...]

Step 3: Continue reading with pagination

# Get next section
fetch_webpage(
    url="https://en.wikipedia.org/wiki/History_of_artificial_intelligence",
    extract_content=True,
    offset=5000,
    limit=5000
)

This workflow enables AI agents to:

Search for current information beyond their training data
Extract clean, structured content from any webpage
Process JavaScript-heavy sites that other tools can't handle
Paginate through long documents intelligently

Output Format

With Content Extraction

Successfully fetched webpage: https://example.com

## Metadata
**Title:** Example Article
**Author:** John Doe
**Date:** 2024-01-15
**Description:** An informative article about...
**Site:** example.com

## Extracted Content (chars 0-5000 of 12000)
*Note: More content available. Use offset=5000 to continue.*

# Article Title

This is the clean, readable content extracted from the webpage...

Pagination Info

When using offset/limit, responses include:

Current position: chars 1000-6000 of 12000
Navigation hint: Use offset=6000 to continue
Total size information

Use Cases

📚 Documentation Analysis

Perfect for reading large technical docs, API references, and guides that span multiple pages.

📰 News & Article Processing

Extract clean article content from news sites, blogs, and publications for analysis.

🔍 Research & Data Gathering

Gather structured data from websites, including metadata and clean text content.

🤖 AI Agent Integration

Enable AI agents to browse and understand web content, even from JavaScript-heavy applications.

⚖️ Legal Document Processing

Handle complex legal documents and government sites that require JavaScript rendering.

Technical Details

Content Extraction Pipeline

single-file: Renders JavaScript and saves complete webpage
trafilatura: Extracts main content using Mozilla Readability algorithm
Pagination: Applies offset/limit for manageable chunks
Token Management: Ensures responses fit within LLM context limits

Browser Engine

Uses a browser via single-file for full JavaScript support:

Works with any supported browser installed on your system
Waits for network idle before capture
Removes hidden elements and unused styles
Handles dynamic content loading

Metadata Extraction

Automatically extracts:

Page title and description
Author and publication date
Site name and language
Categories and tags (when available)

Error Handling

Network Issues: Graceful timeout with informative errors
JavaScript Errors: Continues processing even if some scripts fail
Large Content: Automatic truncation with clear indicators
Invalid URLs: Clear validation error messages

Development Setup

Clone the repository:

git clone https://github.com/kwinsch/singlefile-mcp.git
cd singlefile-mcp

Install dependencies:

pip install -r requirements.txt

Install in development mode:

pip install -e .

Test locally with Claude Code:

claude mcp add singlefile-mcp -s user -- uvx --from . singlefile-mcp

License

MIT License - see LICENSE file for details.

Dependencies

single-file - Core web page capture tool that handles JavaScript rendering
trafilatura - Content extraction using Mozilla's Readability algorithm
mcp - Model Context Protocol for AI integration

Acknowledgments

single-file by Gildas Lormeau - Excellent web page capture tool
trafilatura - Robust content extraction library
Model Context Protocol - Standardized AI integration protocol

from github.com/kwinsch/singlefile-mcp

Установка Singlefile

У этого сервера нет опубликованного пакета — он собирается из исходников. Открой репозиторий и следуй инструкции в README.

▸ github.com/kwinsch/singlefile-mcp

FAQ

Singlefile MCP бесплатный?

Да, Singlefile MCP бесплатный — установка в пару кликов через Unyly без оплаты.

Нужен ли API-ключ для Singlefile?

Нет, Singlefile работает без API-ключей и переменных окружения.

Singlefile — hosted или self-hosted?

Self-hosted: сервер запускается локально на твоей машине командой из раздела установки.

Как установить Singlefile в Claude Desktop, Claude Code или Cursor?

Открой Singlefile на unyly.org, выбери вкладку своего клиента (Claude Desktop, Claude Code, Cursor) и нажми Install — конфиг сгенерируется автоматически, без правки JSON.

Compare Singlefile with

SinglefilevsFetch SinglefilevsAWS KB Retrieval SinglefilevsSpring AI MCP Server Singlefilevsllm-analysis-assistant

Не уверен что выбрать?

Найди свой стек за 60 секунд

Автор?

Embed-бейдж для README

Похожее

Все в категории ai

Command Palette