Web scraping is the automated process of extracting information from websites. It can help you collect data from real estate listings, flights, weather, product reviews, or anything publicly available— fast and easy.
Although web scraping has had a bad reputation, it is not illegal. Still, most websites will attempt to stop it with a combination of techniques.
To learn more about data scraping and how to bypass these anti-web scraping attempts, keep reading.
In this guide to web scraping, you’ll learn everything about how it works, the legal boundaries, the techniques used to stop it, and how to overcome them responsibly.
Table of Contents.
- What is Web Scraping and How it Works?
- What is Web Scraping Used For?
- Is Web Scraping Legal?
- How do web sites attempt to block web scraping?
- Ethical Web Scraping.
- Final Words.
1.What is Web Scraping and How it Works?
Web scraping is the process of automatically extracting data from websites and web applications. Without it, you would have to go into each website and manually pull data—a long and ineffective process.
But web scraping introduces automation and other elements to easily and quickly extract lots of data without much human intervention.
a. How Does Web Scraping work?
Although the implementation of web scraping could be far more complex, the typical elements are the initiator and the target. The initiator uses automatic data extraction software to scrape websites. This software can be accessed from cloud-based services, via APIs, or even by developers who write their web scraping code with Python. The targets are generally content, contact information, forms, or anything publicly available on websites.
The typical process is as follows:
- The initiator uses a piece of software referred to as a scraper bot (it can be either cloud-based, API or home-made). This bot initiates an HTTP GET request to a target website.
2. If the page exists, the target website would respond to the bot’s request with the HTTP/1.0 200 OK (the typical response to visitors.) When the bot gets the HTML code, it parses the document and collects its unstructured data.
3. The scraper bot extracts the raw data, stores it, and adds structure (indexes) to the data to whatever was specified by the initiator. The structured data is accessible through readable formats like XLS, CSV, SQL, or XML.
How is Googlebot Related?
Google uses a very similar technique known as Web Crawling.
Google uses its proprietary web crawler known as Googlebot to scan documents across the entire web continuously. The Googlebot scans (crawls) the whole web by sending massive requests to web servers. The web server responds with web page information, the Googlebot downloads a copy and stores it in the Google index. To pull this off across the world wide web, the Googlebot has the outstanding computing power to crawl the entire web.
But before the Googlebot goes all wild and starts collecting everybody’s data, before indexing a site, the bot checks the website’s ROBOTS.txt. The ROBOTS is a file that specifies the pages and files that can (or can’t) be crawled.
b. Web-Scraping vs. Web-Crawling?
Web scraping techniques search for specific data on particular sites and usually they don’t check the ROBOTS.txt file. They use bots to extract and structure data in formats like SQL, CVS, XML).
Web crawling techniques, on the other hand, are used on a massive scale. They visit all sites, build a list, index data (create a copy), and store it on a database. Web crawlers usually check for permissions on the ROBOTS.txt file.
c. Other Similar Terms.
There are other similar automatic data extraction techniques:
- Data Scraping: This refers to the method of automatically extracting information from a target data source. It is the umbrella term that covers other automatic data extraction mechanisms.
- Data Mining: The goal of data mining is also to extract information from data sets using automation. The data is also transformed into a comprehensive structure. But the difference is that data mining uses statistics and machine learning to identify patterns and find insights or anomalies within structured data sets.
- Screen Scraping: Refers to automatic collection of visual data from an application on a screen display and translation it to another application. Instead of parsing data as web scraping does, screen scraping reads text data from a display’s screen.
- Web Harvesting. Although web harvesting is often used interchangeably with web scraping, there are some slight differences. Web harvesting is a more general term for automatic web extraction mechanisms. Usually, web harvesting is used when an API is involved, and not by tracking the HTML code.
2. What is Web Scraping Used For?
Web scraping is fantastic if done responsibly. Generally, it can be used to research markets, such as gaining insights and learning about trends in a specific market. It is also popular in competition monitoring, to keep track of their strategy, prices, etc.
More specific use cases are:
- Online price change monitoring,
- Product reviews,
- SEO campaigns,
- Real estate listings,
- Tracking weather data,
- Tracking a website’s reputation,
- Monitoring availability and prices of flights,
- Test ads regardless of geography,
- Monitoring financial resources,
- … and more.
3. Is Web Scraping Legal?
Web scraping is legal, BUT it should be done responsibly and ethically.
Web content is publicly available for obvious reasons— so that visitors can access them. In fact, some websites like Government agencies and weather make some of their backend data accessible to the public via API.
But the largest percentage of websites, especially commercial ones, do not make their API publicly available. In these sites, the data is displayed only on-demand, as the visitor goes clicking through the website.
Web scraping makes this data accessible without the need for any API. As mentioned before, web scraping uses automation, a.k.a. Bots, to do what a typical visitor would do, but on a large scale and faster. Although it is often considered an unacceptable or inappropriate practice, web scraping is perfectly legal.
Still, there is a gray area to consider.
There are other things to consider here, which may fall into a different legal context.
- What you do with the data is what can get you into trouble. For example, reusing or reselling content or downloading copyright material is illegal.
- It is also unlawful to extract data that is not publicly available. Scraping data behind a login page, with user and password login, is against the law in the US, Canada, and most of Europe.
- Web scraping attacks. Depending on the context, sometimes web scraping is referred to as a scraping attack. When spammers use botnets (armies of bots) to target a website with large and fast requests, the entire website’s service may fail. Large-scale data scrapings may bring sites down.
- Massive web vulnerability scans. According to Krebsonsecurity, back in 2013, hackers used a botnet to scan and extract data from Google to discover websites using a vulnerability known as vBulletin.
4. How Do Websites Attempt to Block Web Scraping?
Companies want some of their data to be accessible by human visitors. The context is different when a scraping bot visits a site.
These companies on the “target” end would see this as “spying,” so they prefer to deter this traffic type. Another big reason, as mentioned previously, is that massive and fast data scraping can bring a website’s service to a halt.
- Unusual and high amounts of traffic from a single source. Target web servers behind filters, like WAFs (Web Application Firewalls), create blacklists of noisy IP addresses. The web-server detects the “unusual” rate and size of requests, blocks them, and blacklists the IP. The blacklist is shared with many intelligence services so that the “noisy IP” can also be recognized and blocked in other places. Some sites use a combination of WAF and CDNs (Content Delivery Networks) to filter entirely or reduce the noise from geo-based IPs. Other websites also use popular CAPTCHAs to identify humans from robots.
- Some websites can detect bot-like browsing patterns. Similar to the previous technique, websites also block based on requesting User-Agent (HTTP header). Bots do not use a regular browser. They use a headless browser that comes with a unique User-Agent identifier different from that of an ordinary human visitor.
- Websites also change the HTML markup often. Web scraping bots follow a consistent “HTML markup” route when traversing the content of a website. Some websites change HTML elements within the markup regularly and randomly to throw a bot off of its regular scrapping habit. Changing the HTML markup doesn’t stop web scraping but makes it far more challenging.
- To avoid bots using headless browsers, some websites require CAPTCHA challenges. Bots using headless browsers have a hard time solving these types of challenges. CAPTCHAs were made to be solved at a user level (via browser).
- Some sites are traps for scraping bots. Some websites are created only for trapping scraping bots. In cybersecurity, this is referred to as a honeypot. These honeypots are only visible by scraping bots (not by ordinary human visitors) and are built to lead web scrapers into a trap.
5. Ethical Web Scraping.
Web scraping should be done responsibly and ethically— using the data within the legal boundaries. Reading the Terms and Conditions should give you an idea of the restrictions you must adhere to. If you want to get an idea of the rules for a web crawler, check their ROBOTS.txt.
If web scraping is entirely disallowed or blocked, use their API (if it is available).
Also, be mindful of the target’s website bandwidth to avoid overloading a server with too many requests. Automate requests with a rate and right timeouts to avoid putting a strain on the target server is crucial. Simulating a real-time user should be optimal. Also, never scrape data behind login pages.
Follow the rules, and you should be ok.
Web Scraping Best Practices?
- Use a Proxy. A proxy is an intermediary server that forwards requests. When web-scraping with a proxy, you are routing your original request through it. So, the proxy maps the request with its own IP and forwards it to the target website. Use a proxy to:
- Eliminate the chances of getting your IP blacklisted or blocked. Always make requests through various proxies— Pv6 proxies are a good example. A proxy pool can help you perform larger volume requests without being blocked.
- Bypass geo-tailored content. A proxy in a specific region is useful to scrape data according to that particular geographical region. This is useful when websites and services are behind a CDN.
- Rotating Proxies. Rotating Proxies take (rotate) a new IP from the pool for every new connection. Bear in mind that VPNs are not proxies. Although they do something very similar which is provide anonymity, they work at different levels.
- Rotate UA (User Agents) and HTTP Request Headers. To rotate UAs and HTTP headers, you would need to collect a list of UA strings from real web browsers. Put the list in your web scraping code in Python and set requests to pick random strings.
- Don’t push the limits. Slow down the number of requests, rotate, and randomize. If you are making a large number of requests for a website, start by randomizing things. Make each request seem random and human-like. First, change the IP of each request with the help of rotating proxies. Also, use different HTTP headers to make it look like the requests are coming from other browsers.
Web scraping automatically extracts specific unstructured data from target websites, stores it, and creates a structure out of that data. Web scraping is perfect for real estate listings, monitor flights, SEO campaigns, and analyze competition, and more.
Although most commercial websites would often block web scraping attempts, the process of scraping data is perfectly legal (as long as you do it responsibly and ethically). The legal/illegal context starts blurring when you abuse web scraping with large and fast requests, scrape private or copyrighted data, or resell the results.
To web scrape responsible, treat the target website as it was your own. Do not overflow it with requests. A good practice is checking the Terms and Conditions or reading their ROBOTS.txt. Also, use a proxy to avoid getting blacklisted and always remember to rotate IPs.