Is web scraping legal?

Scraping publicly available data is legal in most jurisdictions — the 2022 US hiQ v. LinkedIn ruling affirmed it does not violate the Computer Fraud and Abuse Act. Scraping behind login walls, violating GDPR for personal data, or overwhelming servers with requests creates real legal exposure. Respect rate limits and use data responsibly.

Should I learn Python or JavaScript for scraping?

Python dominates — BeautifulSoup, Scrapy, and Playwright all have excellent Python support and a larger community. JavaScript with Crawlee and Puppeteer is a strong second, and is the better pick if you build on Apify since Crawlee is the recommended Actor framework. For most beginners, Python's simpler syntax wins.

Do I actually need to write code to scrape a website?

Not for most use cases. No-code platforms like Apify host 30,000-plus pre-built Actors that handle infrastructure, anti-bot evasion, and proxy rotation — you paste a URL and get JSON. Low-code tools like Octoparse cover custom sites. Code-based frameworks are only required when no pre-built tool fits your target.

How much does web scraping cost in 2026?

Pre-built tools on Apify range from free to a few cents per page — its free tier includes 5 dollars per month in credits. Custom scrapers cost developer time plus residential proxies at 5 to 15 dollars per GB. For most business use cases, a 50 to 100 dollar monthly budget covers significant volume.

Why does so much scraping now need a real browser?

Roughly 70 percent of scraping targets in 2026 require JavaScript rendering, up from about a third five years ago. Social feeds, Google Maps, and React or Vue single-page apps load content after JavaScript runs, so HTTP-only tools see an empty shell. Browser-based tools like Playwright execute the JavaScript first.

When should I use an API instead of scraping?

Always prefer an official API when one exists — APIs return structured data in a documented format and are sanctioned. Scraping fills the gap when no API exists or the API is crippled, like Google's 5-review cap. Start with the API and fall back to scraping only when it does not provide what you need.

Web Scraping for Beginners: The Complete 2026 Guide

Copying data off a website by hand stops being an option somewhere around the fiftieth row. Web scraping automates that job: you point a tool at a page, tell it what you want, and it hands back structured results — JSON, CSV, or rows in a database — instead of you alt-tabbing for an afternoon. In 2026, that powers pricing intelligence, lead generation, market research, competitive analysis, and the data that trains AI models.

I run a Google Reviews Scraper on the Apify Store, so this guide is written from the inside: how scraping actually works, the tools worth knowing (with honest trade-offs), how modern sites fight back, what the law allows, and code you can run today.

How Web Scraping Works

Every website you visit is built from HTML — structured markup that your browser renders into what you see on screen. A web scraper does what your browser does, but instead of rendering the page visually, it extracts specific data points from the HTML structure.

The process has four steps:

1. Send a request. The scraper sends an HTTP request to a URL, just like your browser does when you navigate to a page. The server returns HTML (and often JavaScript, CSS, and other assets).

2. Render if necessary. Modern websites often load data dynamically using JavaScript — the initial HTML is just a shell, and the actual content loads after JavaScript executes. Simple scrapers can’t see this content. Browser-based scrapers (like Playwright or Puppeteer) run a real browser engine that executes JavaScript, giving them access to everything a human user would see.

3. Parse and extract. The scraper navigates the HTML structure (the DOM) to find the specific data you want. This might use CSS selectors (div.review-text), XPath expressions (//div[@class='review-text']), or in 2026 increasingly AI-based extraction where an LLM understands the page layout and pulls data without explicit selectors.

4. Structure and store. Raw extracted text gets transformed into structured data — typically JSON objects, CSV rows, or database records. A review becomes { "author": "Jane", "rating": 5, "text": "Great product..." } instead of a blob of HTML.

Static vs Dynamic Websites

This distinction determines which tools you need:

Static sites serve complete HTML — all the content is in the initial response. Blogs, news articles, documentation sites, and many e-commerce product pages are static. Tools like BeautifulSoup or Scrapy handle these efficiently with simple HTTP requests.

Dynamic sites load content via JavaScript after the initial page load. Social media feeds, Google Maps, single-page applications (React, Angular, Vue), and infinite-scroll pages are dynamic. These require browser-based scrapers like Playwright or Puppeteer that execute JavaScript before extracting data.

In 2026, roughly 70% of scraping targets require some form of JavaScript rendering, up from about a third five years ago. This is why browser-based tools have become the default starting point for most projects.

Do You Need to Code?

Not for most use cases. The web scraping ecosystem in 2026 offers three tiers:

No-code platforms — Point, click, paste a URL, get data. Platforms like Apify host thousands of pre-built scrapers (“Actors”) that handle the infrastructure, anti-bot evasion, proxy rotation, and output formatting for you. My Google Reviews Scraper, for example, returns structured review data from any Google Maps listing by just pasting a URL.

Low-code tools — Visual interfaces where you click on page elements to define what to extract. Tools like Octoparse and Browse AI let you build custom scrapers without writing code, though complex sites may require manual adjustments.

Code-based frameworks — Python libraries and Node.js frameworks for full control. Required when no pre-built tool covers your use case, when you need to handle complex authentication flows, or when you’re building scraping into a larger application.

Understanding the basics of how scraping works helps you make better decisions regardless of which tier you use — you’ll know when a no-code tool is sufficient and when you need to invest in custom development.

Web Scraping Tools Compared (2026)

No-Code Platforms

Apify — The largest marketplace of pre-built scrapers with 30,000+ Actors covering Google Maps, social media, e-commerce, job boards, and more. Run via web UI, API, or integrate with AI assistants via MCP (Model Context Protocol). Serverless hosting with scheduling, dataset storage, and integrations with Zapier, Make, and n8n. Free tier includes $5/month in credits. Best for teams that need data fast without building infrastructure.

Octoparse — Visual scraper with a point-and-click interface. Good for building custom scrapers without code. Handles pagination and form filling. Cloud-based runs available. Best for recurring scrapes of specific sites where no pre-built tool exists.

Browse AI — Focused on website monitoring and change detection. You teach it what to watch by clicking on page elements, and it alerts you when data changes. Best for price monitoring, stock tracking, and competitor surveillance.

Code-Based Frameworks

Playwright — Microsoft’s browser automation library. Controls Chromium, Firefox, and WebKit browsers. Handles JavaScript-heavy sites natively because it runs a real browser. Async architecture in Python and Node.js. In 2026, Playwright is the default choice for new scraping projects targeting dynamic sites. Auto-waiting for elements, built-in network interception, and multi-browser support make it significantly more reliable than earlier tools.

Scrapy — Python’s established scraping framework. Asynchronous via Twisted, built-in request queuing, middleware pipeline, and data export. Extremely efficient for static sites at scale — processes thousands of pages per minute. Does not handle JavaScript natively; requires the scrapy-playwright plugin for dynamic content, which adds complexity.

Crawlee — Node.js (and now Python) scraping library that wraps both HTTP crawling and Playwright browser automation in a unified API. Adds browser fingerprint randomization, session management, and proxy rotation out of the box. The recommended framework for Apify Actor development. If you’re choosing between Scrapy and Crawlee for a new project targeting modern SPAs, Crawlee is the safer default.

BeautifulSoup — Python library for parsing HTML. No networking capability — you provide the HTML string and it lets you search and extract data with CSS selectors or the element tree. Simple, well-documented, great for beginners learning the fundamentals. Best paired with requests or httpx for fetching pages. Not suitable for JavaScript-rendered content.

Choosing Your Tool: Decision Framework

Situation	Best Tool	Why
Need data now, no code	Apify (pre-built Actor)	Fastest path to results, no setup
Static site, Python project	Scrapy or BeautifulSoup + requests	Efficient, no browser overhead
JavaScript-heavy site	Playwright	Renders JS natively, reliable waits
Building on Apify platform	Crawlee	Native integration, built-in anti-detection
Custom scraper, modern stack	Crawlee (Node.js) or Playwright (Python)	Best balance of power and ergonomics
Learning web scraping	BeautifulSoup + requests	Simplest mental model, great docs
Monitoring for changes	Browse AI	Built for watch-and-alert workflows

Handling Anti-Bot Detection in 2026

Modern websites actively defend against scrapers. Understanding these defenses helps you choose the right tools and avoid wasting time on approaches that won’t work.

What Sites Check

TLS fingerprinting — When your scraper connects via HTTPS, the TLS handshake reveals a “fingerprint” based on cipher suites, extensions, and protocol parameters. Headless Chrome has a different TLS fingerprint than regular Chrome. Sophisticated sites (especially those behind Cloudflare) compare this fingerprint against known browser profiles.

JavaScript challenges — The site serves JavaScript that must execute correctly before the real content loads. Simple HTTP scrapers can’t execute JavaScript, so they fail immediately. Browser-based scrapers handle this natively, but headless mode can be detected via JavaScript API differences (e.g., navigator.webdriver being true).

Behavioral analysis — Sites track mouse movements, scroll patterns, and click timing. A scraper that instantly navigates to the exact element it needs, without any scrolling or mouse movement, looks nothing like a human user.

IP reputation — Data center IP addresses are flagged more aggressively than residential IPs. If your scraper runs from AWS or Google Cloud, the site already suspects it’s a bot before the first request completes.

CAPTCHAs and Turnstile — Cloudflare’s Turnstile has replaced traditional CAPTCHAs for many sites. It runs invisible challenges that score each request’s trustworthiness. Failing the score triggers a visual challenge.

How to Handle It

For most scraping tasks, use a managed platform (Apify, ScrapFly, ScrapingBee) that handles anti-detection for you. These services maintain pools of residential proxies, stealth browser configurations, and CAPTCHA-solving integrations. You focus on what data you want; they handle getting past the defenses.

If you’re building custom scrapers, the 2026 playbook looks like this:

Start with stealth browsers. Nodriver (async, native stealth) and SeleniumBase UC Mode (automatic fingerprint patching) are the leading options. Both handle Cloudflare JavaScript challenges and Turnstile in most cases.
Rotate residential proxies. Data center IPs get blocked fast. Residential proxy services provide IP addresses from real ISPs, which sites have much harder time distinguishing from real users. Budget $5-15/GB of traffic.
Randomize behavior. Add random delays between requests (1-5 seconds), scroll pages before extracting data, and vary your navigation patterns. Don’t hit 100 pages per second from a single IP.
Monitor success rates. Anti-bot systems update frequently. A scraper that works today might get blocked next week. Track your success rate per site and adjust when it drops below 90%.

What doesn’t work anymore: Simple User-Agent rotation hasn’t been sufficient since around 2022. Headers are necessary but nowhere near sufficient. Similarly, basic headless Chrome without fingerprint patching is detected by most Cloudflare-protected sites within a few requests.

Legal and Ethical Considerations

Web scraping occupies a legal space that varies by jurisdiction but has become clearer over the past few years. Key principles:

What’s Generally Allowed

Public data extraction — Information that anyone can see without logging in is generally scrapable. The 2022 US hiQ Labs v. LinkedIn ruling affirmed that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act. EU courts have reached similar conclusions for publicly available business data.

Your own data — Scraping your own reviews, your own product listings, or data you’ve published is unambiguously fine.

Business analytics — Using scraped data for internal competitive analysis, pricing intelligence, and market research is standard business practice across most industries.

What Requires Caution

Personal data under GDPR — If you’re scraping data that includes personal information about EU residents (names, email addresses, profile information), GDPR applies. You need a legal basis — typically “legitimate interest” for business analytics. Processing should be proportionate to the purpose and you must honor data subject access requests.

Terms of Service — Many websites prohibit scraping in their ToS. This is primarily enforceable as a contractual matter rather than a criminal one. Google’s ToS technically prohibit scraping, but enforcement is through technical measures (CAPTCHAs, rate limiting) rather than lawsuits against individual scrapers.

Rate limiting and server impact — Sending thousands of concurrent requests that degrade a site’s performance can cross legal and ethical lines. Always throttle your requests and use tools that handle rate limiting responsibly.

Behind authentication — Scraping data behind login walls or paywalls raises additional legal concerns. The CFAA in the US specifically addresses “unauthorized access,” and scraping content that requires authentication to view is a riskier proposition.

Practical Rules

Use established platforms like Apify that handle rate limiting and ethical scraping for you
Don’t republish scraped content verbatim — extracting for analysis is different from copying
Respect robots.txt as a signal of the site owner’s intent (though it’s not legally binding in most jurisdictions)
Keep records of your scraping purposes and data processing, especially if handling EU data
When in doubt, start with official APIs and fall back to scraping only when the API doesn’t provide what you need

Practical Example: Building a Review Monitoring System

Let’s walk through a real use case that combines multiple scraping concepts into a useful business tool.

The goal: Monitor Google reviews for your business and 4 competitors, track sentiment trends, and automatically generate social media content from your best reviews.

Step 1: Extract the Reviews

Use the Google Reviews Scraper to pull reviews for 5 businesses. At $0.10 per place (up to 50 reviews), one monitoring run costs $0.50.

For a detailed walkthrough of this tool, see my guide to scraping Google reviews.

Step 2: Schedule Regular Runs

Apify supports cron-style scheduling. Set the scraper to run weekly and store results in a named dataset. Each run appends new data, building a historical timeline of reviews.

Step 3: Analyze the Data

Export to a spreadsheet or use Python for analysis. The Google Reviews Scraper puts each review’s body in a reviewText field and its score in rating — run one place at a time and tag each dataset with the business yourself:

import json
from collections import Counter

with open('reviews.json') as f:
    reviews = json.load(f)

# Average rating across this place's reviews
avg = sum(r['rating'] for r in reviews) / len(reviews)
print(f"{avg:.1f}★ ({len(reviews)} reviews)")

# Most mentioned keywords in negative reviews
negative = [r['reviewText'] for r in reviews if r['rating'] <= 2]
words = ' '.join(w for w in negative if w).lower().split()
common = Counter(words).most_common(20)
print("Top complaints:", common)

Step 4: Turn Insights Into Content

Found 5-star reviews worth showcasing? Feed them through the Content to Social MCP Server to generate social media testimonial posts automatically. See my social media automation guide for the full workflow.

Found negative trends in competitor reviews? That’s marketing intelligence — their weakness is your positioning opportunity.

Practical Example: Lead Generation From Google Maps

Another high-value scraping use case is extracting business data from Google Maps for B2B lead generation. The workflow:

Search Google Maps for a business category in a target area (e.g., “dentists in Austin, TX”)
Extract business names, addresses, phone numbers, websites, ratings, and review counts
Enrich with email addresses using tools like Hunter.io or Apollo
Import into your CRM and start outreach

This approach generates targeted lead lists for a fraction of the cost of buying them from data providers. My complete Google Maps lead generation guide covers this in detail, including Python code, legal considerations, and CRM integration workflows.

Getting Started: Your First Scraping Project

If you’re new to web scraping, here’s the most productive path. There’s no fixed timeline — move to the next step when the current one stops surprising you.

Start here: use a pre-built tool. Pick a specific use case — monitoring reviews, tracking prices, building a lead list — and find a pre-built scraper on the Apify Store. Run it, examine the output, and get a feel for what structured scraping data looks like. No code needed, and it tells you fast whether scraping even solves your problem.

Then: learn the fundamentals. Install Python and try BeautifulSoup on a static website. The requests library fetches the HTML, BeautifulSoup parses it:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all headings
headings = [h.text for h in soup.find_all(['h1', 'h2', 'h3'])]
print(headings)

This teaches you how HTML parsing works — the foundation for everything else.

Next: handle dynamic sites. Install Playwright and scrape a JavaScript-heavy site. Reach for the Locator API (page.locator(...)) rather than the older query_selector handles — locators re-resolve on use and carry Playwright’s auto-waiting, so they don’t go stale when the page re-renders:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://example.com/dynamic-page')
    page.wait_for_selector('.data-loaded')

    # Now extract data after JS has rendered
    for item in page.locator('.item').all():
        print(item.text_content())

    browser.close()

When you’re ready: build something real. Combine what you’ve learned into a scraper that solves an actual problem for you or your business. Or build an Actor with Crawlee and publish it on Apify — the managed platform handles proxies, scheduling, and storage so you can focus on the scraping logic.

If you’re curious how AI is reshaping scraping and automation — particularly the Model Context Protocol that wires AI assistants directly into tools like scrapers — see my guide to deploying MCP servers in production.