Introduction

Have you ever wondered how search engines, aggregators, or large directories gather massive amounts of data automatically? If so, then you must likely happened to find concepts about "scrapping", "spiders", "web crawlers" and such. But what exactly are these terms?

If so, you’ve probably come across terms like web scraping, spiders, or web crawlers. But what do these actually mean?

In this post, I’ll explain what web scraping is, how to get started, and my personal favorite tool for building a fully functional web crawler.

What is web scraping?

web scrapping flow

Web scraping is an automated technique used to extract data from websites. The data is usually retrieved in HTML format and then transformed into structured formats such as JSON, spreadsheets, or databases.

In simpler terms, web scraping is a way to automatically collect data from websites by analyzing their HTML structure (or other sources like RSS, JSON, or XML feeds).

So you might be wondering… why would you even need this?

Why use web scraping?

Web scraping is widely used in many real-world scenarios:

Search engines indexing websites
Price comparison platforms tracking products
Job aggregators collecting postings from multiple sources
News aggregators gathering articles
Data analysis and research
Monitoring changes in websites
Building datasets for machine learning

Instead of manually copying information, scraping allows you to automate the entire process and keep data constantly updated.

So now that you know a little more about web scrapers, how do they actually work?

How web scraping works

At its core, web scraping is just automating what your browser already does:

Request a page
Receive HTML
Parse the content
Extract data
Store it
Repeat

That’s it. Everything else is just scaling this process.

Let's break it down.

1. Requesting the page

A scraper starts by sending an HTTP request to a website, just like your browser does when you open a page.

For example:

GET https://example.com/products

The server then responds with the HTML content of that page.

2. Receiving the HTML

The response usually looks like raw HTML:

<div class="product">
  <h2>Product name</h2>
  <span class="price">$19.99</span>
</div>

This is great for displaying content in a browser, but not very useful if we want structured data.

So the next step is to parse it.

3. Parsing the document

The scraper parses the HTML into a tree structure (DOM). Once we have that, we can locate elements using selectors like:

CSS selectors
XPath
DOM traversal

For example:

Product name → .product h2
Price → .product .price

This tells the scraper exactly where the data lives.

4. Extracting the data

Once the selectors are defined, we extract the values:

Product name: Product name
Price: $19.99

This is the actual "scraping" part.

We're transforming HTML into structured data.

5. Structuring the output

After extracting the data, we usually convert it into something structured like JSON:

{
  "name": "Product name",
  "price": 19.99
}

From here, the data can be stored in:

JSON files
CSV / Excel
Databases
APIs
Data pipelines

6. Repeating the process (Crawling)

Real-world scraping rarely involves a single page.

Usually, we want to:

Follow pagination
Visit detail pages
Traverse categories
Discover new links

For example:

/products?page=1
/products?page=2
/products?page=3

The scraper keeps visiting pages, extracting data, and storing results.

This loop is what turns a simple scraper into a web crawler.

In short, web scraping is just:

Request → Parse → Extract → Store → Repeat

This simple loop powers everything from small scripts to large-scale search engines.

Static vs Dynamic Scraping

dynamic and static websites

If you have a background in web development, you're probably familiar with the terms static and dynamic websites. But why do these actually matter when scraping?

This distinction is extremely important, because it directly affects what data you actually receive when requesting a webpage.

Static Websites

A static website usually serves all of its content directly in the initial HTML response. There is no additional data being fetched after the page loads.

That means what you request is exactly what you get.

When scraping static websites, the process is straightforward:

Request the page
Parse the HTML
Extract the data

Everything you see in your browser already exists in the HTML returned by the server.

This makes static websites the easiest type of targets for scraping.

Dynamic Websites

A dynamic website, on the other hand, loads a minimal HTML skeleton first, and then fetches most of its data using JavaScript.

This means the content is rendered after the initial request.

So what you see in your browser is not necessarily what the server originally sent.

Instead, the browser usually does something like:

Requests the page
Loads minimal HTML
Executes JavaScript
Fetches data from APIs
Renders the content dynamically

This is where scraping becomes more complicated.

Why is this an issue?

When you start scraping using a simple approach like:

Request page
Save HTML
Parse data

You may notice that the HTML you receive is not the same as what you see in your browser.

For example, inspecting the page in your browser might show something like this:

<div id="product">
    <p id="price">$6.7</p>
    <p id="name">Product</p>
</div>

But when you request that same page using a script, you might get:

<div id="product">
</div>

So… where did the data go?

That content was rendered dynamically using JavaScript after the page loaded. Your scraper only received the initial static HTML, not the rendered version.

This is one of the most common challenges when scraping modern websites.

There are several ways to deal with dynamic content:

Reverse-engineering API calls
Using headless browsers
Rendering JavaScript
Intercepting network requests

Each approach has its tradeoffs, and covering them properly could be a post on its own.

For now, we'll focus on static websites to understand the fundamentals first.

p.s. headless browsers are a big clue ;)

Common Scraping Challenges and Considerations

web scraping legal

Now that we understand the difference between static and dynamic websites, it's important to talk about some real-world challenges you’ll likely encounter when building scrapers.

I work almost daily with scrapers both at my job and in personal projects, and these are some of the most common issues I've faced so far.

Rate limiting and blocking

One of the first problems you'll encounter is getting blocked.

If you send too many requests too quickly, websites may:

Return 403 responses
Return 429 (Too Many Requests)
Temporarily block your IP
Serve captcha challenges
Return incomplete or empty responses

This happens because your scraper doesn't behave like a normal user.

Some common ways to mitigate this:

Add request delays
Use randomized intervals
Respect crawl rate limits
Rotate user agents
Retry failed requests
Implement backoff strategies

Slowing down your scraper often improves reliability more than making it faster.

Pagination and infinite scrolling

Data is rarely contained in a single page. You'll often need to deal with:

Page-based pagination
"Load more" buttons
Infinite scrolling
Cursor-based APIs

Sometimes pagination is obvious:
/products?page=1
/products?page=2
/products?page=3

Other times it's hidden behind JavaScript requests, which means you need to inspect network calls and reverse engineer them.

Dynamic content

As discussed earlier, modern websites frequently load data dynamically.

This means:

The HTML is empty
Data comes from APIs
JavaScript renders content
Elements appear after delays

Solutions typically include:

Calling the API directly
Using headless browsers
Intercepting network requests
Waiting for rendered content

Most of the time, calling the API directly is the cleanest solution.

Changing HTML structures

Websites change. A lot.

A scraper that works today might break tomorrow because:

Class names change
Layout changes
Elements move
IDs get randomized
Content structure changes

This is one of the biggest maintenance costs of scraping.

Some ways to make scrapers more resilient:

Avoid overly specific selectors
Prefer semantic structure over class names
Add fallbacks
Validate extracted data
Log unexpected layouts

Scrapers are not "write once" programs. They require maintenance.

Anti-bot protections

Some websites actively try to prevent scraping using:

Cloudflare protections
Captchas
Browser fingerprinting
JavaScript challenges
Token validation
Session-based access

These protections can make scraping significantly harder, and sometimes not worth the effort depending on the use case.

Understanding when to stop is also part of building scrapers.

Legal and ethical considerations

Scraping exists in a bit of a grey area depending on how it's used.

Some websites explicitly allow scraping. Others forbid it in their Terms of Service. Some provide APIs specifically to avoid scraping.

Before scraping a website, it's good practice to:

Check robots.txt
Read the Terms of Service
Prefer official APIs when available
Avoid scraping private or authenticated content
Avoid collecting sensitive data
Respect rate limits
Avoid putting load on servers

Just because something is technically possible doesn't mean it should be scraped.

A good rule of thumb:

Scrape responsibly, and behave like a normal user would.

In many cases, responsible scraping is simply:

Slower requests
Minimal load
Public data only
Respecting access boundaries

Scraping is an incredibly powerful tool, but like any powerful tool, it should be used carefully.

Why I use Scrapy?

With all that out of the way, let's get to my favorite part: Scrapy.

Scrapy is by far my favorite tool for web scraping. I've used it extensively in both personal projects and professional environments, and it has made building and maintaining scrapers significantly easier.

What I like the most about Scrapy is that it doesn't just help you scrape a single page — it provides a full framework for building scalable, maintainable, and production-ready web crawlers.

Instead of writing ad-hoc scripts for each website, Scrapy gives you structure, tools, and patterns that make scraping much more manageable.

It's such a powerful tool that I couldn't write a blog post about web scraping without talking about it.

What exactly is Scrapy?

As mentioned before, Scrapy is not just a scraping tool — it's a full crawling framework that provides everything needed to build scalable web scrapers.

Out of the box, Scrapy integrates:

Spiders — define how to crawl websites and extract data
Pipelines — process, clean, and validate extracted data
Middlewares — handle retries, headers, proxies, redirects, and more
Item feeds — export data locally or to external storage (JSON, CSV, S3, databases, etc.)
Request scheduling — manage concurrency and crawl order
Auto throttling — avoid overwhelming target websites
Built-in retries and error handling — improve scraper reliability

All of these components work together, allowing you to focus on extracting data instead of building scraping infrastructure from scratch.

Scrapy is also an open-source framework written in Python, which makes it both powerful and accessible. Since Python is widely used and beginner-friendly, it's easy to get started while still having the flexibility to build production-ready crawlers.

Scrapy is such an extensive framework that properly covering it would require a post on its own.

So for now, we'll focus on getting Scrapy running and building your first scraper.
Later, I'll dive deeper into Scrapy internals and more advanced crawling patterns.

Creating your first web scraper

For this example we'll use Python, so make sure you have Python installed on your system, along with pip.

You can download Python from the official website:
Official Python website

First, create a folder for your Scrapy project:

mkdir first-scraper

It's recommended to use a Python virtual environment so your dependencies are isolated, but this step is optional.

cd first-scraper
python -m venv .venv

Activate the virtual environment:

Unix / macOS

source .venv/bin/activate

Windows

.venv\Scripts\activate

Once inside your virtual environment (or globally if you skipped that step), install Scrapy using pip:

pip install scrapy

To verify that Scrapy was installed correctly, run:

scrapy

You should see the Scrapy help prompt in your terminal.

scrapy terminal

Now that Scrapy is installed, there are two ways you can start exploring it:

Using the interactive Scrapy shell
Creating a basic Scrapy spider

The Scrapy shell is great for experimenting and figuring out selectors, while spiders are used for actual crawling.

For these basic examples we'll be using the following page: https://quotes.toscrape.com/ this is a great website to learn the basics of scraping!

Using the shell

Start the Scrapy shell with the following command:

scrapy shell

You should see some logs, and at the end an interactive prompt in your terminal.

[]
2026-04-09 12:46:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10810f290>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x107bb6a90>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Now execute the following command:

>>> fetch('https://quotes.toscrape.com/')

This command tells Scrapy to request the page and store the result inside a response object.

Now try running:

>>> response.text

If everything went well, you should see the HTML response in your terminal. You have now successfully crawled your first website!

scrapy shell with HTML response

But now let's extract some actual data.

Let's assume we want to get the first quote from the website:

quotes to scrape

You might be wondering: how do we extract just that element?

For this, we'll use XPath, which is a query language used to navigate HTML and XML documents.

If you want to learn more about XPath, check this guide: https://www.w3schools.com/xml/xpath_intro.asp

html inspector console

If we inspect the HTML structure, we'll notice that each quote is wrapped in:

<div class="quote">

That makes it easy to select them. Try running:

>>> response.xpath('(//div[@class="quote"])[1]')

You'll see something like:

[<Selector query='(//div[@class="quote"])[1]' data='<div class="quote" itemscope itemtype...'>]

This is a Selector object. Selectors allow you to:

Extract raw HTML
Apply additional XPath queries
Use regex
Extract text

To get the HTML for that element:

>>> response.xpath('(//div[@class="quote"])[1]').get()

This returns the full HTML of the first quote.

Now let's break down the XPath:

//div → select all div elements
[@class="quote"] → where class equals "quote"
[1] → select only the first one

So this:

response.xpath('(//div[@class="quote"])[1]').get()

Means:

Get all div elements with class "quote", and return only the first one.

Now let's extract only the quote text:

>>> response.xpath('(//div[@class="quote"])[1]/span[@class="text"]/text()').get()

Output:

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

And that's it — you've just extracted your first piece of data using Scrapy.

The shell is extremely useful for experimenting with selectors before writing an actual spider.

Final Thoughts

Web scraping is a powerful technique that allows you to automatically collect and structure data from websites.
From simple scripts to full-scale crawlers, the core idea remains the same: request, parse, extract, and repeat.

In this post we covered:

What web scraping is
How scraping works
Static vs dynamic websites
Common scraping challenges
Introduction to Scrapy
Using the Scrapy shell to extract data

This is only scratching the surface.

Scrapy is an extremely powerful and extensive framework, and properly covering it would require a dedicated post. We haven't even touched things like:

Spiders
Pipelines
Middlewares
Auto throttling
Concurrency
Feed exports
Retry strategies
Production crawlers

But hopefully this gives you a solid starting point to begin experimenting with web scraping and Scrapy.

Recommended Resources

If you'd like to continue learning Scrapy, here are some great resources:

Official Scrapy documentation
https://docs.scrapy.org/
Scrapy tutorial (quotes to scrape)
https://docs.scrapy.org/en/latest/intro/tutorial.html
XPath cheatsheet
https://devhints.io/xpath
CSS selectors cheatsheet
https://devhints.io/css
quotes.toscrape practice site
https://quotes.toscrape.com/

These resources should help you move from experimenting in the shell to building full crawlers.

Closing

Web scraping can start as a small script, but it can quickly grow into complex crawlers and data pipelines.
Tools like Scrapy make that transition much easier by providing structure and scalability from the start.

If you're getting into scraping, my recommendation is simple:

Start small, experiment in the shell, build simple spiders, and gradually add complexity.

Scrapy makes that process surprisingly enjoyable.

I'll likely write a dedicated post diving deeper into Scrapy internals, pipelines, and building production-ready crawlers.

Scraping 101 - How to get into Scraping

Introduction

What is web scraping?

Why use web scraping?

How web scraping works

1. Requesting the page

2. Receiving the HTML

3. Parsing the document

4. Extracting the data

5. Structuring the output

6. Repeating the process (Crawling)

Static vs Dynamic Scraping

Static Websites

Dynamic Websites

Why is this an issue?

Common Scraping Challenges and Considerations

Rate limiting and blocking

Pagination and infinite scrolling

Dynamic content

Changing HTML structures

Anti-bot protections

Legal and ethical considerations

Why I use Scrapy?

What exactly is Scrapy?

Creating your first web scraper

Using the shell

Final Thoughts

Recommended Resources

Closing

Comments