Introduction
Have you ever wondered how search engines, aggregators, or large directories gather massive amounts of data automatically? If so, then you must likely happened to find concepts about "scrapping", "spiders", "web crawlers" and such. But what exactly are these terms?
If so, you’ve probably come across terms like web scraping, spiders, or web crawlers. But what do these actually mean?
In this post, I’ll explain what web scraping is, how to get started, and my personal favorite tool for building a fully functional web crawler.
What is web scraping?

Web scraping is an automated technique used to extract data from websites. The data is usually retrieved in HTML format and then transformed into structured formats such as JSON, spreadsheets, or databases.
In simpler terms, web scraping is a way to automatically collect data from websites by analyzing their HTML structure (or other sources like RSS, JSON, or XML feeds).
So you might be wondering… why would you even need this?
Why use web scraping?
Web scraping is widely used in many real-world scenarios:
- Search engines indexing websites
- Price comparison platforms tracking products
- Job aggregators collecting postings from multiple sources
- News aggregators gathering articles
- Data analysis and research
- Monitoring changes in websites
- Building datasets for machine learning
Instead of manually copying information, scraping allows you to automate the entire process and keep data constantly updated.
So now that you know a little more about web scrapers, how do they actually work?
How web scraping works
At its core, web scraping is just automating what your browser already does:
- Request a page
- Receive HTML
- Parse the content
- Extract data
- Store it
- Repeat
That’s it. Everything else is just scaling this process.
Let's break it down.
1. Requesting the page
A scraper starts by sending an HTTP request to a website, just like your browser does when you open a page.
For example:
GET https://example.com/products
The server then responds with the HTML content of that page.
2. Receiving the HTML
The response usually looks like raw HTML:
<div class="product">
<h2>Product name</h2>
<span class="price">$19.99</span>
</div>
This is great for displaying content in a browser, but not very useful if we want structured data.
So the next step is to parse it.
3. Parsing the document
The scraper parses the HTML into a tree structure (DOM). Once we have that, we can locate elements using selectors like:
- CSS selectors
- XPath
- DOM traversal
For example:
- Product name →
.product h2 - Price →
.product .price
This tells the scraper exactly where the data lives.
4. Extracting the data
Once the selectors are defined, we extract the values:
- Product name: Product name
- Price: $19.99
This is the actual "scraping" part.
We're transforming HTML into structured data.
5. Structuring the output
After extracting the data, we usually convert it into something structured like JSON:
{
"name": "Product name",
"price": 19.99
}
From here, the data can be stored in:
- JSON files
- CSV / Excel
- Databases
- APIs
- Data pipelines
6. Repeating the process (Crawling)
Real-world scraping rarely involves a single page.
Usually, we want to:
- Follow pagination
- Visit detail pages
- Traverse categories
- Discover new links
For example:
/products?page=1
/products?page=2
/products?page=3
The scraper keeps visiting pages, extracting data, and storing results.
This loop is what turns a simple scraper into a web crawler.
In short, web scraping is just:
Request → Parse → Extract → Store → Repeat
This simple loop powers everything from small scripts to large-scale search engines.
Static vs Dynamic Scraping

If you have a background in web development, you're probably familiar with the terms static and dynamic websites. But why do these actually matter when scraping?
This distinction is extremely important, because it directly affects what data you actually receive when requesting a webpage.
Static Websites
A static website usually serves all of its content directly in the initial HTML response. There is no additional data being fetched after the page loads.
That means what you request is exactly what you get.
When scraping static websites, the process is straightforward:
- Request the page
- Parse the HTML
- Extract the data
Everything you see in your browser already exists in the HTML returned by the server.
This makes static websites the easiest type of targets for scraping.
Dynamic Websites
A dynamic website, on the other hand, loads a minimal HTML skeleton first, and then fetches most of its data using JavaScript.
This means the content is rendered after the initial request.
So what you see in your browser is not necessarily what the server originally sent.
Instead, the browser usually does something like:
- Requests the page
- Loads minimal HTML
- Executes JavaScript
- Fetches data from APIs
- Renders the content dynamically
This is where scraping becomes more complicated.
Why is this an issue?
When you start scraping using a simple approach like:
- Request page
- Save HTML
- Parse data
You may notice that the HTML you receive is not the same as what you see in your browser.
For example, inspecting the page in your browser might show something like this:
<div id="product">
<p id="price">$6.7</p>
<p id="name">Product</p>
</div>
But when you request that same page using a script, you might get:
<div id="product">
</div>
So… where did the data go?
That content was rendered dynamically using JavaScript after the page loaded. Your scraper only received the initial static HTML, not the rendered version.
This is one of the most common challenges when scraping modern websites.
There are several ways to deal with dynamic content:
- Reverse-engineering API calls
- Using headless browsers
- Rendering JavaScript
- Intercepting network requests
Each approach has its tradeoffs, and covering them properly could be a post on its own.
For now, we'll focus on static websites to understand the fundamentals first.
p.s. headless browsers are a big clue ;)
Common Scraping Challenges and Considerations

Now that we understand the difference between static and dynamic websites, it's important to talk about some real-world challenges you’ll likely encounter when building scrapers.
I work almost daily with scrapers both at my job and in personal projects, and these are some of the most common issues I've faced so far.
Rate limiting and blocking
One of the first problems you'll encounter is getting blocked.
If you send too many requests too quickly, websites may:
- Return 403 responses
- Return 429 (Too Many Requests)
- Temporarily block your IP
- Serve captcha challenges
- Return incomplete or empty responses
This happens because your scraper doesn't behave like a normal user.
Some common ways to mitigate this:
- Add request delays
- Use randomized intervals
- Respect crawl rate limits
- Rotate user agents
- Retry failed requests
- Implement backoff strategies
Slowing down your scraper often improves reliability more than making it faster.
Pagination and infinite scrolling
Data is rarely contained in a single page. You'll often need to deal with:
- Page-based pagination
- "Load more" buttons
- Infinite scrolling
- Cursor-based APIs
Sometimes pagination is obvious:
/products?page=1
/products?page=2
/products?page=3
Other times it's hidden behind JavaScript requests, which means you need to inspect network calls and reverse engineer them.
Dynamic content
As discussed earlier, modern websites frequently load data dynamically.
This means:
- The HTML is empty
- Data comes from APIs
- JavaScript renders content
- Elements appear after delays
Solutions typically include:
- Calling the API directly
- Using headless browsers
- Intercepting network requests
- Waiting for rendered content
Most of the time, calling the API directly is the cleanest solution.
Changing HTML structures
Websites change. A lot.
A scraper that works today might break tomorrow because:
- Class names change
- Layout changes
- Elements move
- IDs get randomized
- Content structure changes
This is one of the biggest maintenance costs of scraping.
Some ways to make scrapers more resilient:
- Avoid overly specific selectors
- Prefer semantic structure over class names
- Add fallbacks
- Validate extracted data
- Log unexpected layouts
Scrapers are not "write once" programs. They require maintenance.
Anti-bot protections
Some websites actively try to prevent scraping using:
- Cloudflare protections
- Captchas
- Browser fingerprinting
- JavaScript challenges
- Token validation
- Session-based access
These protections can make scraping significantly harder, and sometimes not worth the effort depending on the use case.
Understanding when to stop is also part of building scrapers.
Legal and ethical considerations
Scraping exists in a bit of a grey area depending on how it's used.
Some websites explicitly allow scraping. Others forbid it in their Terms of Service. Some provide APIs specifically to avoid scraping.
Before scraping a website, it's good practice to:
- Check robots.txt
- Read the Terms of Service
- Prefer official APIs when available
- Avoid scraping private or authenticated content
- Avoid collecting sensitive data
- Respect rate limits
- Avoid putting load on servers
Just because something is technically possible doesn't mean it should be scraped.
A good rule of thumb:
Scrape responsibly, and behave like a normal user would.
In many cases, responsible scraping is simply:
- Slower requests
- Minimal load
- Public data only
- Respecting access boundaries
Scraping is an incredibly powerful tool, but like any powerful tool, it should be used carefully.
Why I use Scrapy?
With all that out of the way, let's get to my favorite part: Scrapy.
Scrapy is by far my favorite tool for web scraping. I've used it extensively in both personal projects and professional environments, and it has made building and maintaining scrapers significantly easier.
What I like the most about Scrapy is that it doesn't just help you scrape a single page — it provides a full framework for building scalable, maintainable, and production-ready web crawlers.
Instead of writing ad-hoc scripts for each website, Scrapy gives you structure, tools, and patterns that make scraping much more manageable.
It's such a powerful tool that I couldn't write a blog post about web scraping without talking about it.
What exactly is Scrapy?
As mentioned before, Scrapy is not just a scraping tool — it's a full crawling framework that provides everything needed to build scalable web scrapers.
Out of the box, Scrapy integrates:
- Spiders — define how to crawl websites and extract data
- Pipelines — process, clean, and validate extracted data
- Middlewares — handle retries, headers, proxies, redirects, and more
- Item feeds — export data locally or to external storage (JSON, CSV, S3, databases, etc.)
- Request scheduling — manage concurrency and crawl order
- Auto throttling — avoid overwhelming target websites
- Built-in retries and error handling — improve scraper reliability
All of these components work together, allowing you to focus on extracting data instead of building scraping infrastructure from scratch.
Scrapy is also an open-source framework written in Python, which makes it both powerful and accessible. Since Python is widely used and beginner-friendly, it's easy to get started while still having the flexibility to build production-ready crawlers.
Scrapy is such an extensive framework that properly covering it would require a post on its own.
So for now, we'll focus on getting Scrapy running and building your first scraper.
Later, I'll dive deeper into Scrapy internals and more advanced crawling patterns.
Creating your first web scraper
For this example we'll use Python, so make sure you have Python installed on your system, along with pip.
You can download Python from the official website:
Official Python website
First, create a folder for your Scrapy project:
mkdir first-scraper
It's recommended to use a Python virtual environment so your dependencies are isolated, but this step is optional.
cd first-scraper
python -m venv .venv
Activate the virtual environment:
Unix / macOS
source .venv/bin/activate
Windows
.venv\Scripts\activate
Once inside your virtual environment (or globally if you skipped that step), install Scrapy using pip:
pip install scrapy
To verify that Scrapy was installed correctly, run:
scrapy
You should see the Scrapy help prompt in your terminal.

Now that Scrapy is installed, there are two ways you can start exploring it:
- Using the interactive Scrapy shell
- Creating a basic Scrapy spider
The Scrapy shell is great for experimenting and figuring out selectors, while spiders are used for actual crawling.
For these basic examples we'll be using the following page: https://quotes.toscrape.com/ this is a great website to learn the basics of scraping!
Using the shell
Start the Scrapy shell with the following command:
scrapy shell
You should see some logs, and at the end an interactive prompt in your terminal.
[]
2026-04-09 12:46:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10810f290>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x107bb6a90>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
Now execute the following command:
>>> fetch('https://quotes.toscrape.com/')
This command tells Scrapy to request the page and store the result inside a response object.
Now try running:
>>> response.text
If everything went well, you should see the HTML response in your terminal. You have now successfully crawled your first website!

But now let's extract some actual data.
Let's assume we want to get the first quote from the website:

You might be wondering: how do we extract just that element?
For this, we'll use XPath, which is a query language used to navigate HTML and XML documents.
If you want to learn more about XPath, check this guide: https://www.w3schools.com/xml/xpath_intro.asp

If we inspect the HTML structure, we'll notice that each quote is wrapped in:
<div class="quote">
That makes it easy to select them. Try running:
>>> response.xpath('(//div[@class="quote"])[1]')
You'll see something like:
[<Selector query='(//div[@class="quote"])[1]' data='<div class="quote" itemscope itemtype...'>]
This is a Selector object. Selectors allow you to:
- Extract raw HTML
- Apply additional XPath queries
- Use regex
- Extract text
To get the HTML for that element:
>>> response.xpath('(//div[@class="quote"])[1]').get()
This returns the full HTML of the first quote.
Now let's break down the XPath:
//div→ select all div elements[@class="quote"]→ where class equals "quote"[1]→ select only the first one
So this:
response.xpath('(//div[@class="quote"])[1]').get()
Means:
Get all div elements with class "quote", and return only the first one.
Now let's extract only the quote text:
>>> response.xpath('(//div[@class="quote"])[1]/span[@class="text"]/text()').get()
Output:
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
And that's it — you've just extracted your first piece of data using Scrapy.
The shell is extremely useful for experimenting with selectors before writing an actual spider.
Final Thoughts
Web scraping is a powerful technique that allows you to automatically collect and structure data from websites.
From simple scripts to full-scale crawlers, the core idea remains the same: request, parse, extract, and repeat.
In this post we covered:
- What web scraping is
- How scraping works
- Static vs dynamic websites
- Common scraping challenges
- Introduction to Scrapy
- Using the Scrapy shell to extract data
This is only scratching the surface.
Scrapy is an extremely powerful and extensive framework, and properly covering it would require a dedicated post. We haven't even touched things like:
- Spiders
- Pipelines
- Middlewares
- Auto throttling
- Concurrency
- Feed exports
- Retry strategies
- Production crawlers
But hopefully this gives you a solid starting point to begin experimenting with web scraping and Scrapy.
Recommended Resources
If you'd like to continue learning Scrapy, here are some great resources:
-
Official Scrapy documentation
https://docs.scrapy.org/ -
Scrapy tutorial (quotes to scrape)
https://docs.scrapy.org/en/latest/intro/tutorial.html -
XPath cheatsheet
https://devhints.io/xpath -
CSS selectors cheatsheet
https://devhints.io/css -
quotes.toscrape practice site
https://quotes.toscrape.com/
These resources should help you move from experimenting in the shell to building full crawlers.
Closing
Web scraping can start as a small script, but it can quickly grow into complex crawlers and data pipelines.
Tools like Scrapy make that transition much easier by providing structure and scalability from the start.
If you're getting into scraping, my recommendation is simple:
Start small, experiment in the shell, build simple spiders, and gradually add complexity.
Scrapy makes that process surprisingly enjoyable.
I'll likely write a dedicated post diving deeper into Scrapy internals, pipelines, and building production-ready crawlers.
