Collect structured data

Web scraping and data extraction skills

Find skills for crawling websites, extracting structured data, monitoring pages, and turning messy web content into agent-ready inputs.

Browse matching skills Read guide Resolve via Agent API

Try this task

I need my agent to scrape websites and extract structured data from pages.

Matched

Strong trust

Install ready

Auto allowed

500+ stars

Agent should be able to

+Crawl target URLs
+Extract tables and metadata
+Normalize messy page content

Resolve

Let the agent pick

Returns the best skill, alternatives, install handoff, risk summary, and safety gate.

Text plan

LLM-readable output

Plain text version for Codex, Claude Code, Cursor, and custom agent runtimes.

Browse

Human shortlist

Open the filtered registry view for this workflow and compare candidates manually.

Recommended stack

Turn this use case into a workflow

Scrape, clean, and reuse web data

Web data pipeline stack

A practical stack for agents that crawl public pages, extract clean content, normalize data, and hand it to downstream research or RAG workflows.

Workflow map

What to build with these skills

Extract product data from websites

Monitor competitor pages

Turn HTML into clean markdown

Feed crawled content into RAG

Best first installs

Start with high-signal skills

18 matched skills

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

24K stars92 trust94 auditVERIFIED

Install

Agent install candidate

Risk

Safe to try

Agent fit

Claude Code + OpenAI Agents

Updated

Jun 24, 2026

$ npx skills add apify/crawlee

Crawlee Python

VERIFIED

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Parsel, BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

9.2K stars90 trust94 auditVERIFIED

Install

Agent install candidate

Risk

Safe to try

Agent fit

Claude Code + OpenAI Agents

Updated

Jun 19, 2026

$ npx skills add apify/crawlee-python

Firecrawl

VERIFIED

The API to search, scrape, and interact with the web at scale. 🔥

139K stars91 trust94 auditVERIFIED

Install

Agent install candidate

Risk

Safe to try

Agent fit

Claude Code + CLI

Updated

Jun 26, 2026

$ npx skills add firecrawl/firecrawl

Skill shortlist

More options for this use case

Browse full marketplace

Scrapegraph AI

web-automationResearch

Python scraper based on AI

Review before production · Review the audit page, then allow agent install in a sandboxed workflow.

27K stars90 trust94 audit74 safety

Maxun

web-automationResearch

🔥 The open-source no-code platform for web scraping, crawling, search and AI data extraction • Turn websites into structured APIs in minutes 🔥

Low metadata risk · Allow agent install in a sandbox or low-risk workspace, then promote after one successful narrow task.

16K stars93 trust95 audit87 safety

Crawl4AI

web-automationResearch

Open-source LLM-friendly web crawler and scraper

Low metadata risk · Review the audit page, then allow agent install in a sandboxed workflow.

71K stars93 trust95 audit79 safety

WaterCrawl

web-automationCoding

Transform Web Content into LLM-Ready Data

Review before production · Require human approval before installing into a real workspace.

1.8K stars82 trust86 audit74 safety

Scrapling

web-automationCoding

Adaptive web scraping for agent data collection

Review before production · Require human approval before installing into a real workspace.

68K stars85 trust90 audit74 safety

Colly

web-automationCoding

Elegant Scraper and Crawler Framework for Golang

Low metadata risk · Review the audit page, then allow agent install in a sandboxed workflow.

25K stars92 trust95 audit83 safety

Lux

web-automationCoding

👾 Fast and simple video download library and CLI tool written in Go

Low metadata risk · Require human approval before installing into a real workspace.

31K stars86 trust89 audit65 safety

Newspaper

web-automationCoding

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Low metadata risk · Allow agent install in a sandbox or low-risk workspace, then promote after one successful narrow task.

15K stars93 trust94 audit90 safety

AnyCrawl

dataData

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

Low metadata risk · Allow agent install in a sandbox or low-risk workspace, then promote after one successful narrow task.

3.2K stars89 trust93 audit85 safety

Amazon Scraper

web-automationCoding

Free Trial Amazon Scraper API for extracting search, product, offer listing, reviews, question and answers, best sellers and sellers data.

Review before production · Review the audit page, then allow agent install in a sandboxed workflow.

3.0K stars86 trust91 audit71 safety

How To Scrape Amazon Product Data

web-automationCoding

The process of extracting product data from Amazon using Python, including titles, ratings, prices, images, and descriptions.

Review before production · Review the audit page, then allow agent install in a sandboxed workflow.

2.9K stars86 trust91 audit75 safety

Google Play Scraper

web-automationCoding

Node.js scraper to get data from Google Play

Review before production · Review the audit page, then allow agent install in a sandboxed workflow.

2.9K stars84 trust90 audit78 safety

QueryList

web-automationResearch

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

Review before production · Review the audit page, then allow agent install in a sandboxed workflow.

2.7K stars86 trust91 audit75 safety

EasySpider

web-automationCoding

A visual no-code/code-free web crawler/spider易采集：一个可视化浏览器自动化测试/数据采集/网页爬虫软件，可以无代码图形化的设计和执行爬虫任务。别名：ServiceWrapper面向Web应用的智能化服务封装系统。

Low metadata risk · Allow agent install in a sandbox or low-risk workspace, then promote after one successful narrow task.

44K stars93 trust94 audit90 safety

Ferret

web-automationCoding

Declarative web scraping

Review before production · Require human approval before installing into a real workspace.

6.0K stars85 trust92 audit68 safety

FAQ

How to choose skills for this workflow

These answers are written for both human builders and agents consuming the Registry API.

What are the best AI agent skills for web scraping?

Start by comparing Crawlee, Crawlee Python, Firecrawl. OpenAgentSkill ranks them by workflow fit, GitHub adoption, trust score, safety gate, and install readiness.

Can an AI agent use this page directly?

Yes. Use the linked Registry API prompt to query /api/skills/search with the task: "I need my agent to scrape websites and extract structured data from pages." and retrieve install handoff links for the top results.

Should I install every recommended skill?

No. Start with the highest-fit skill, test it in a sandbox workflow, and add companion skills only when the task needs extra coverage.