Introduction
Have you ever thought about making your own database of potential businesses for lead generation or product price data so you can get your products at the cheapest price without any effort? Web scraping is what lets you do that without having to do any of the manual work yourself. Rust makes this easier by allowing you to handle errors explicitly and run tasks concurrently, letting you do things like attaching a web service router to your scraper or a Discord bot that outputs the data.
In this guide to Rust web scraping, we will write a Rust web scraper that will scrape Amazon for Raspberry Pi products and get their prices, then store them in a PostgresQL database for further processing.
Stuck or want to see the final code? The Github repository for this article can be found here.
Getting Started
Let's make a new project by using cargo shuttle init
. For this project we'll simply call it webscraper
- you'll want the none
option for the framework, which will spawn a new Cargo project with shuttle-runtime
added (as we aren't currently using a web framework, we don't need to pick any of the other options).
Let's install our dependencies with the following one liner:
We'll also want to install sqlx-cli
, which is a useful tool for managing our SQL migrations. We can install it by running the following:
If we then use sqlx migrate add schema
in our project folder, we'll then get our SQL migration file, which can be found in the migrations
folder! The file will be formatted with the date and time at which the migration was created, and then the name we gave it (in this case, schema
). For our purposes, here are the migrations we'll be using:
Before we get started, we'll want to make a struct that implements shuttle_runtime::Service
, which is an async trait. We'll also want to set our user agent so that there is less chance of us getting blocked. Thankfully, we can do all of this by returning a struct in our main function, like so:
Now that we're done, we can get started web scraping in Rust!
Making our Web Scraper
The first part of making our web scraper is making a request to our target URL so we can grab the response body to process. Thankfully, Amazon's URL syntax is quite simple, so we can easily customise the URL query parameters by adding the name of the search terms we want to look for. Because Amazon returns multiple pages of results, we also want to be able to set our page number as a mutable dynamic variable that will get incremented by 1 every time the request is successful.
As you may have noticed, we added a variable named retry_attempts
. This is because sometimes when we're scraping, Amazon (or any other site for that matter) may give us a 503 Service Unavailable, meaning that the scraping will fail. Sometimes this can be caused by server overload or us scraping too quickly, so we can model our error handling like this:
Assuming the HTTP request is successful, we'll get a HTML body that we can parse using scraper
.
If you go to Amazon in your browser and search for "raspberry pi", you'll then receive a product list. You can examine this product list by using the dev tools function on your browser (in this instance, it's the Inspect function in Firefox but you can also use Chrome Devtools, Microsoft Edge DevTools, etc...). It should look like the following:

You might notice that the div
element has a data attribute of data-component-type
for which the value of s-search-result
. This is helpful for us as no other page components other than the ones we want to scrape have that attribute! Therefore, we can scrape the data by selecting it as a CSS selector (see below for more information). We'll want to make sure we prepare our HTML by parsing it as a HTML fragment, and then we can declare our initial scraper::Selector
:
As you can see, the Selector
uses CSS selectors to be able to parse the HTML. In this case, we are specifically attempting to search for a HTML div
element that has a data attribute called "data-component-type" with a value of "s-search-result".
If you attempt to run our program now and html.select(&selector)
as per the scraper
documentation, you'll see that it returns an iterator over HTML elements. However, because the iteration count can also technically be zero, we'll want to make sure that there are actually things we can iterate over - so let's make sure we cover that point by adding an if statement to check for the iterator count:
In our final iteration of the app, this should just break the loop as this will normally signal that there's no more products we can retrieve as in the first case there should always be product results.
Now that we've done our respective error handling, we can iterate through the entries and create a Product, then append it to our vector of Products.
Note that in the above codeblock we use sleep from the standard library - if we attempt to use tokio::time::sleep
, the compiler returns an error about holding a non-Send
future across an await point.
Now that we've written our code for processing the data we've gathered from the web page, we can wrap what we've written so far in a loop, moving our Vec<Product>
and pagenum
declarations to an outer loop that will run infinitely. Next, we'll want to make sure we have somewhere to save our data! We'll want to use a batched transaction here, which thankfully we can do by using db.begin
and db.commit
. Check the code out below:
All we're doing here is just running a for loop over the list of scraped products and inserting them all into the database, then committing at the end to finalise it.
Now ideally we'll want the scraper to rest for some time so that the pages are given time to update - otherwise, if you comb the pages all the time you will more than likely end up with a huge amount of duplicate data. Let's say we wanted to wanted it to rest until midnight:
Now we're pretty much done!
Your final scraping function should look like this:
And we're done!
Deploying
If you initialised your project on the Shuttle servers, you can get started by using cargo shuttle deploy
(adding --allow-dirty
if on a dirty Git branch). If not, you'll want to use cargo shuttle project start --idle-minutes 0
to get your project up and running.
Finishing Up
Thanks for reading this article! I hope you have a more thorough understanding of how to start web scraping in Rust, using the Rust Reqwest and scraper crates.
Ways to extend this article:
- Add a frontend so you can show stats for your scraper bot
- Add a proxy for your web scraper
- Scrape more than one website