5 分钟上手 Firecrawl
文章目录
- Firecrawl 是什么?
- 本地部署
- 验证
- mcp安装
- palyground
🔥 5 分钟上手 Firecrawl
Firecrawl 是什么?
一句话:
开源版的 “最强网页爬虫 + 清洗引擎”
• 自动把任意网页 → 结构化 Markdown / JSON
• 支持递归整站抓取、JS 渲染、PDF 解析、图片 alt 自动生成
• 提供 REST API,LangChain / LlamaIndex 官方集成
官方网站
可以在playground中进行测试
点击Get Code
可以获得调用模板代码
# Install with pip install firecrawl-py
import asyncio
from firecrawl import AsyncFirecrawlAppasync def main():app = AsyncFirecrawlApp(api_key='fc-d7310201c7684ec58408d62fac5d88b2')response = await app.scrape_url(url='https://blog.csdn.net/u012399690/article/details/149668148', formats= [ 'markdown' ],only_main_content= Trueparse_pdf= True,max_age= 14400000)print(response)asyncio.run(main())
本地部署
官方提供500 credits免费额度,对于经常需要使用或者隐私要求高的用户可以选择本地部署。
第一步:拉取代码
git clone https://github.com/mendableai/firecrawl.git
第二步:修改配置
cp apps/api/.env.example .env
按需修改,为了简单,可以关闭验证
最小配置
NUM_WORKERS_PER_QUEUE=4
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
USE_DB_AUTHENTICATION=false
🐳 启动
docker compose build # 第一次拉镜像
docker compose up -d # 后台跑
访问:
- API:
http://localhost:3002
- 队列管理:
http://localhost:3002/admin/@/queues
验证
cURL命令,可在终端中快速验证
curl -X POST http://localhost:3002/v0/scrape \-H 'Content-Type: application/json' \-d '{"url": "https://www.ithome.com/0/871/372.htm","formats": [ "markdown" ],"onlyMainContent": true,"parsePDF": true,"maxAge": 14400000}'
返回示例:
{"success": true,"data": {"content": "xxx","markdown": "xxx","linksOnPage": ["https://www.ithome.com/0/871/372.htm#","https://m.ithome.com/",],"metadata": {"ogImage": "https://img.ithome.com/m/images/logo.png","language": "zh","viewport": "width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no","description": "智谱发布新一代旗舰模型GLM-4.5,专为智能体应用打造,综合能力达到开源SOTA,实测国内最佳。采用混合专家架构,提供两种模式,高速低成本。API已上线开放平台BigModel.cn,也可在智谱清言和z.ai免费体验。#AI大模型# #智谱GLM4.5#","og:image": "https://img.ithome.com/m/images/logo.png","format-detection": "telephone=no","keywords": "智谱,GLM4.5,智能时代,人工智能","apple-itunes-app": "app-id=570610859, app-argument=ithome://news?id=871372&type=news","title": "智谱发布新一代旗舰开源模型 GLM-4.5,专为智能体应用打造 - IT之家","apple-mobile-web-app-status-bar-style": "white","apple-mobile-web-app-capable": "yes","theme-color": "#fff","favicon": "https://m.ithome.com/favicon.ico","scrapeId": "07988df7-f880-4d8e-85ee-c434a2a931c3","sourceURL": "https://www.ithome.com/0/871/372.htm","url": "https://www.ithome.com/0/871/372.htm","contentType": "text/html; charset=utf-8","proxyUsed": "basic","pageStatusCode": 200}},"returnCode": 200
}
示例
mcp安装
我们可以通过mcp客户端,和ai协同工作。以cheery studio为例
复制如下配置,或者在魔搭等mcp广场进行配置,然后一键同步。主要修改API_KEY
{"mcpServers": {"mcp-server-firecrawl": {"command": "npx","args": ["-y", "firecrawl-mcp"],"env": {"FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE"}}}
}
如果需要配置为自建服务
{"mcpServers": {"mcp-server-firecrawl": {"command": "npx","args": ["-y", "firecrawl-mcp"],"env": {"FIRECRAWL_API_URL": "http://localhost:3002","FIRECRAWL_API_KEY": "optional-if-you-enable-auth"}}}
}
cherry studio中进行调用
palyground
开源版并没有提供playground,只能进行api或者mcp调用。这里提供一个简单的html页面。
<!DOCTYPE html>
<html lang="zh-CN"><head><meta charset="UTF-8" /><title>Firecrawl 自建可视化 UI</title><meta name="viewport" content="width=device-width,initial-scale=1" /><link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet" /><link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.3/font/bootstrap-icons.css" rel="stylesheet" /><style>body {padding-top: 70px;background: #f8f9fa;}.card {box-shadow: 0 0.125rem 0.25rem rgba(0, 0, 0, 0.075);}.result-area {max-height: 400px;overflow-y: auto;font-family: SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono","Courier New", monospace;font-size: 0.8rem;}.config-panel {transition: all 0.3s ease;}.collapse:not(.show) {display: none;}</style>
</head><body><nav class="navbar navbar-expand navbar-dark bg-primary fixed-top"><div class="container-fluid"><a class="navbar-brand fw-bold" href="#"><i class="bi bi-fire"></i> Firecrawl UI</a><button class="btn btn-outline-light btn-sm" data-bs-toggle="modal" data-bs-target="#configModal"><i class="bi bi-gear"></i> 配置</button></div></nav><div class="container"><!-- 功能区 --><div class="card mb-3"><div class="card-header"><ul class="nav nav-tabs card-header-tabs" id="mainTabs" role="tablist"><li class="nav-item" role="presentation"><button class="nav-link active" id="scrape-tab" data-bs-toggle="tab" data-bs-target="#scrape-pane"type="button" role="tab" aria-controls="scrape-pane" aria-selected="true">📥 单页抓取</button></li><li class="nav-item" role="presentation"><button class="nav-link" id="crawl-tab" data-bs-toggle="tab" data-bs-target="#crawl-pane" type="button"role="tab" aria-controls="crawl-pane" aria-selected="false">🕸️ 整站抓取</button></li></ul></div><div class="card-body"><div class="tab-content" id="mainTabContent"><!-- 单页抓取面板 --><div class="tab-pane fade show active" id="scrape-pane" role="tabpanel" aria-labelledby="scrape-tab"><div class="mb-3"><label for="scrapeUrl" class="form-label">网页地址</label><input type="url" class="form-control" id="scrapeUrl" placeholder="https://docs.firecrawl.dev" /><div class="form-text">输入要抓取的单个网页地址</div></div><button class="btn btn-primary" id="scrapeBtn" onclick="handleScrape()"><i class="bi bi-download"></i> 立即抓取</button></div><!-- 整站抓取面板 --><div class="tab-pane fade" id="crawl-pane" role="tabpanel" aria-labelledby="crawl-tab"><div class="mb-3"><label for="crawlUrl" class="form-label">网站地址</label><input type="url" class="form-control" id="crawlUrl" placeholder="https://docs.firecrawl.dev" /><div class="form-text">输入要爬取的网站根地址</div></div><div class="mb-3"><label for="maxPages" class="form-label">最大页数</label><input type="number" class="form-control" id="maxPages" placeholder="10" min="1" max="100" value="10" /><div class="form-text">限制爬取的最大页面数量 (1-100)</div></div><button class="btn btn-warning" id="crawlBtn" onclick="handleCrawl()"><i class="bi bi-globe"></i> 开始爬取</button></div></div></div></div><!-- 结果区 --><div class="card mb-3"><div class="card-header d-flex justify-content-between align-items-center"><span>📝 结果预览</span><button class="btn btn-sm btn-outline-secondary d-none" id="copyBtn" onclick="copyResult()"><i class="bi bi-clipboard"></i> 复制</button></div><div class="card-body"><pre class="result-area border p-2 bg-light" id="result">
等待结果...</pre></div></div></div><!-- 配置弹框 --><div class="modal fade" id="configModal" tabindex="-1" aria-labelledby="configModalLabel" aria-hidden="true"><div class="modal-dialog"><div class="modal-content"><div class="modal-header"><h5 class="modal-title" id="configModalLabel"><i class="bi bi-gear"></i> 服务配置</h5><button type="button" class="btn-close" data-bs-dismiss="modal" aria-label="Close"></button></div><div class="modal-body"><div class="mb-3"><label for="baseUrl" class="form-label">Base URL</label><input type="url" class="form-control" id="baseUrl" placeholder="http://localhost:3002"value="http://localhost:3002" /><div class="form-text">Firecrawl 服务的基础地址</div></div><div class="mb-3"><label for="apiKey" class="form-label">API Key</label><input type="password" class="form-control" id="apiKey" placeholder="可选,无鉴权时留空" /><div class="form-text">如果服务需要鉴权,请输入 API Key</div></div></div><div class="modal-footer"><button type="button" class="btn btn-secondary" data-bs-dismiss="modal">取消</button><button type="button" class="btn btn-primary" onclick="saveConfig()" data-bs-dismiss="modal">保存配置</button></div></div></div></div><script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js"></script>""<script>const $ = (id) => document.getElementById(id);const base = () => $("baseUrl").value.replace(/\/$/, "");const key = () => $("apiKey").value;// 加载保存的配置document.addEventListener('DOMContentLoaded', function () {loadConfig();});function loadConfig() {const savedBaseUrl = localStorage.getItem('firecrawl_baseUrl');const savedApiKey = localStorage.getItem('firecrawl_apiKey');if (savedBaseUrl) $("baseUrl").value = savedBaseUrl;if (savedApiKey) $("apiKey").value = savedApiKey;}function saveConfig() {localStorage.setItem('firecrawl_baseUrl', $("baseUrl").value);localStorage.setItem('firecrawl_apiKey', $("apiKey").value);// 显示保存成功提示const toast = document.createElement('div');toast.className = 'toast align-items-center text-white bg-success border-0 position-fixed top-0 end-0 m-3';toast.style.zIndex = '9999';toast.innerHTML = `<div class="d-flex"><div class="toast-body"><i class="bi bi-check-circle"></i> 配置已保存</div><button type="button" class="btn-close btn-close-white me-2 m-auto" data-bs-dismiss="toast"></button></div>`;document.body.appendChild(toast);const bsToast = new bootstrap.Toast(toast);bsToast.show();// 3秒后自动移除setTimeout(() => {if (toast.parentNode) {toast.parentNode.removeChild(toast);}}, 3000);}async function request(path, body) {const headers = { "Content-Type": "application/json" };if (key()) headers["Authorization"] = `Bearer ${key()}`;return fetch(`${base()}${path}`, {method: "POST",headers,body: JSON.stringify(body),}).then((r) => r.json());}async function handleScrape() {const url = $("scrapeUrl").value;if (!url) return alert("请输入网址");const scrapeBtn = $("scrapeBtn");// 禁用按钮但保持原有样式scrapeBtn.disabled = true;$("result").textContent = "抓取中...";$("copyBtn").classList.add("d-none");try {const res = await request("/v0/scrape", {url,pageOptions: { onlyMainContent: true },});$("result").textContent =res.data?.markdown || JSON.stringify(res, null, 2);$("copyBtn").classList.remove("d-none");window.lastResult = res;} catch (error) {$("result").textContent = `抓取失败: ${error.message}`;} finally {// 恢复按钮状态scrapeBtn.disabled = false;}}async function handleCrawl() {const url = $("crawlUrl").value;const limit = parseInt($("maxPages").value) || 10;if (!url) return alert("请输入网址");const crawlBtn = $("crawlBtn");// 禁用按钮但保持原有样式crawlBtn.disabled = true;$("result").textContent = "整站爬取中,请稍等...";$("copyBtn").classList.add("d-none");try {const job = await request("/v0/crawl", { url, limit });if (!job.jobId) {$("result").textContent = JSON.stringify(job, null, 2);crawlBtn.disabled = false;return;}const poll = setInterval(async () => {const headers = { "Content-Type": "application/json" };if (key()) headers["Authorization"] = `Bearer ${key()}`;const response = await fetch(`${base()}/v0/crawl/status/${job.jobId}`, {method: "GET",headers,});const status = await response.json();$("result").textContent = JSON.stringify(status, null, 2);if (status.status === "completed") {clearInterval(poll);window.lastResult = status;$("copyBtn").classList.remove("d-none");crawlBtn.disabled = false;}if (status.status === "failed") {clearInterval(poll);crawlBtn.disabled = false;}}, 2000);} catch (error) {$("result").textContent = `爬取失败: ${error.message}`;crawlBtn.disabled = false;}}async function copyResult() {try {const dataStr = JSON.stringify(window.lastResult, null, 2);await navigator.clipboard.writeText(dataStr);// 显示复制成功提示const toast = document.createElement('div');toast.className = 'toast align-items-center text-white bg-success border-0 position-fixed top-0 end-0 m-3';toast.style.zIndex = '9999';toast.innerHTML = `<div class="d-flex"><div class="toast-body"><i class="bi bi-check-circle"></i> 结果已复制到剪贴板</div><button type="button" class="btn-close btn-close-white me-2 m-auto" data-bs-dismiss="toast"></button></div>`;document.body.appendChild(toast);const bsToast = new bootstrap.Toast(toast);bsToast.show();// 3秒后自动移除setTimeout(() => {if (toast.parentNode) {toast.parentNode.removeChild(toast);}}, 3000);} catch (error) {// 如果剪贴板 API 不可用,使用备用方法const textArea = document.createElement('textarea');textArea.value = JSON.stringify(window.lastResult, null, 2);document.body.appendChild(textArea);textArea.select();document.execCommand('copy');document.body.removeChild(textArea);alert('结果已复制到剪贴板');}}</script>
</body></html>