Close Menu
Soshace Digital Blog

    Subscribe to Updates

    Get The Latest News, Updates, And Amazing Offers

    What's Hot
    Interview

    Interview with Alex

    Interview

    Interview with Sergey

    Node.js

    Node.js Lesson 4: NPM Package Structure

    Important Pages:
    • Home
    • About
    • Services
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram LinkedIn YouTube
    Today's Picks:
    • Scaling Success: Monitoring Indexation of Programmatic SEO Content
    • Leveraging Influencers: Key Drivers in New Product Launches
    • How Privacy-First Marketing Will Transform the Industry Landscape
    • The Impact of Social Proof on Thought Leadership Marketing
    • Balancing Value-Driven Content and Promotional Messaging Strategies
    • Top Influencer Marketing Platforms to Explore in 2025
    • Emerging Trends in Marketing Automation and AI Tools for 2023
    • Strategies to Mitigate Duplicate Content in Programmatic SEO
    Sunday, September 28
    Facebook X (Twitter) Instagram LinkedIn YouTube
    Soshace Digital Blog
    • Home
    • About
    • Services
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    Services
    • SaaS & Tech

      Maximizing Efficiency: How SaaS Lowers IT Infrastructure Costs

      August 27, 2025

      Navigating Tomorrow: Innovations Shaping the Future of SaaS

      August 27, 2025

      Maximizing Impact: Strategies for SaaS & Technology Marketing

      August 27, 2025
    • AI & Automation

      Enhancing Customer Feedback Analysis Through AI Innovations

      August 27, 2025

      Navigating the Impact of AI on SEO and Search Rankings

      August 27, 2025

      5 Automation Hacks Every Home Service Business Needs to Know

      May 3, 2025
    • Finance & Fintech

      Critical Missteps in Finance Marketing: What to Avoid

      August 27, 2025

      Analyzing Future Fintech Marketing Trends: Insights Ahead

      August 27, 2025

      Navigating the Complex Landscape of Finance and Fintech Marketing

      August 27, 2025
    • Legal & Compliance

      Exploring Thought Leadership’s Impact on Legal Marketing

      August 27, 2025

      Maximizing LinkedIn: Strategies for Legal and Compliance Marketing

      August 27, 2025

      Why Transparency Matters in Legal Advertising Practices

      August 27, 2025
    • Medical Marketing

      Enhancing Online Reputation Management in Hospitals: A Guide

      August 27, 2025

      Analyzing Emerging Trends in Health and Medical Marketing

      August 27, 2025

      Exploring Innovative Content Ideas for Wellness Blogs and Clinics

      August 27, 2025
    • E-commerce & Retail

      Strategic Seasonal Campaign Concepts for Online and Retail Markets

      August 27, 2025

      Emerging Trends in E-commerce and Retail Marketing Strategies

      August 27, 2025

      Maximizing Revenue: The Advantages of Affiliate Marketing for E-Commerce

      August 27, 2025
    • Influencer & Community

      Leveraging Influencers: Key Drivers in New Product Launches

      August 27, 2025

      Top Influencer Marketing Platforms to Explore in 2025

      August 27, 2025

      Key Strategies for Successful Influencer Partnership Negotiations

      August 27, 2025
    • Content & Leadership

      The Impact of Social Proof on Thought Leadership Marketing

      August 27, 2025

      Balancing Value-Driven Content and Promotional Messaging Strategies

      August 27, 2025

      Analyzing Storytelling’s Impact on Content Marketing Effectiveness

      August 27, 2025
    • SEO & Analytics

      Scaling Success: Monitoring Indexation of Programmatic SEO Content

      August 27, 2025

      Strategies to Mitigate Duplicate Content in Programmatic SEO

      August 27, 2025

      Effective Data Visualization Techniques for SEO Reporting

      August 27, 2025
    • Marketing Trends

      How Privacy-First Marketing Will Transform the Industry Landscape

      August 27, 2025

      Emerging Trends in Marketing Automation and AI Tools for 2023

      August 27, 2025

      Maximizing ROI: Key Trends in Paid Social Advertising

      August 27, 2025
    Soshace Digital Blog
    Blog / Python / Responsible Web Scraping: Gathering Data Ethically and Legally
    JavaScript

    Responsible Web Scraping: Gathering Data Ethically and Legally

    Denis KryukovBy Denis KryukovSeptember 11, 2019No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Responsible Web Scraping: Gathering Data Ethically and Legally
    Web scraping done right
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link
    artwork depicting web scraping process
    Becoming Dr. Lawful, that is

    Data is all around us: the things we buy, the articles we read, and the content we enjoy. The “digital footprints” that we’re all leaving are immense: we’re nearing a point when This insurmountable amount of information, therefore, has paved the way for some awesome technological innovations: natural language processing, for instance, thrives on the rich textual data available on the internet.

    This raises an interesting question: “How do we acquire this data?” The answer lies in web scraping which typically implies automated process of gathering data from the internet. Web scraping, however, isn’t always a “white hat” (i.e. abiding by the law and terms of use) practice — it can be utilized aggressively, causing potential damage to the donor resource.

    The focus of this article, therefore, is ethical web scraping — acquiring the data you need without becoming Dr. Evil. In this article, we’ll explore the definition of web scraping, how it works, its use cases, legal and ethical issues — and how to avoid them by scraping responsibly.

    Getting the definition right

    Before we begin, let’s define this term. Web scraping (also known as web harvesting or web data extraction) is the process of extracting data from web-based resources. This brief definition holds a few key points which can help us understand it even better:

    • Web-based resources refer to collections/networks of websites.
    • Data can refer to texts, images, videos, and so on.

    OK. So how does it work?

    The power of everything digital lies in the way data is organized — in a strict system, that is. The web is no exception: the “Markup” in “Hypertext Markup Language” refers to the way raw data is marked up and ready for access. In its basic form, web scraping is organized in the following manner:

    1. Make a request to the URL which contains the necessary HTML data.
    2. Using a scraping tool (e.g. Scrapy), parse the HTML → find the element with particular data you’re looking for (e.g. the picture’s alt text) → extract the data.
    schema showing Web scraping breakdown
    Web scraping breakdown

    Naturally, proficiency in web scraping comes with great benefits: for one, the entire internet pretty much becomes your database It’s tempting to think that scraping is equivalent to using APIs — after all, they produce similar functionality. However, it’s not so simple…

    The ethical and legal issues

    artwork depicting web scraping bots
    Your website is next!

    Site owners are perfectly justified when they try to protect themselves from scraping — otherwise, their websites are running these risks:

    • Content can be stolen, granting the thief a competitive advantage.
    • Website can crash (as scraping bots, when sending requests en masse, can accidentally create a DDoS-like attack on the web server)
    • Website’s algorithms and inner workings (especially if it’s a social media platform) can be exploited and manipulated.
    • and much, much more.

    Naturally, “evil” scrapers disregard the websites’ rules and terms of service completely — but they often fail to realize that their strategy won’t be profitable in the long run. As web scraping involves copying of files, it introduces a curious ethical (and almost philosophical) challenge:

    • In Scenario A, the thief steals an apple (or an Apple computer, for that matter). This particular apple is unique and singular — stealing it from its owner means it went from one person to another.
    • In Scenario B, the thief steals a copy of Adobe Photoshop. The software is unique as well — but it is simultaneously abstract, so it’s impossible to force the owner to give up their possession of the software.

    Some people, therefore, argue: “It is impossible to truly steal software: stealing involves the owner ultimately losing the possession, but you’re merely making a copy of it.”

    artwork depicting a stylized software piracy process
    “So what’s the difference?”, some might say

    Site owners, on the other hand, aren’t too keen on debating the ethics of web scraping, so many of them view all kinds of scraping as harmful. In theory, we could turn to some legal systems or web standards to define the boundaries — in practice, however, our laws are still adapting, so we’ve yet to get the definitive answer.

    In the United States, for instance, legal claims like copyright infringement can, in theory, protect the resource from being scraped: over the last two decades, many companies (including eBay, American Airlines, Associated Press, Facebook, LinkedIn, and more) have tried to safeguard their online property — but different judges ruled different decisions, rendering web scraping’s legal status uncertain.

    These legal cases showcase that web scraping is still very much in the grey zone. This lack of regulations is hurting both the webmasters and white-hat scrapers. When it comes to massive scraping projects, therefore, consulting lawyers is absolutely mandatory; for smaller projects, the resource’s terms of use can provide the guidelines detailing how to collect data without pissing everyone off.

    Read More:  Mastering Project Management: Effective Use of Gantt Charts

    Methods of web scraping prevention

    artwork depicting various web scraping prevention methods
    Hold it right there!

    For webmasters dissatisfied with how their resources are constantly being crawled, there are quite a few tricks and systems to safeguard their websites against scraping. Here are some of them:

    • Honeypots are computer security mechanism disguised as a real part of the website which is isolated and monitored. It lures bots, allowing webmasters to track the source of attack and block it.
    • By extension, it’s possible to block IP addresses of bots based on specific criteria (e.g. geolocation).
    • Some companies resort to disabling the APIs: as detailed in our article about bots in social media and messengers, Instagram chose exactly this option.
    • CAPTCHA services have been battling bots for a long time; Google’s recent reCAPTCHA v3 (I’m not a robot), however, seems to finally be able to put an end to botting once and for all.
    • Application firewalls and other commercial anti-bot services.
    • The website can also be obfuscated via small variations to the HTML/CSS code. This confuses bots as they’re designed to work with clear HTML structure.
    • robots.txt can be used to explicitly indicate that crawling is not allowed (or allowed only partially, or the crawl rate is limited).

    Guidelines for ethical and responsible web scraping

    artwork depicting Bart and Lisa Simpson
    Why can’t we be friends, why can’t be friends?..

    These guidelines come directly from the “Don’t be evil” approach which you can easily formulate by yourself; the main idea is this: in the long run, collaboration and respect between the scraper and the site owner trump greed and egoism.

    1. Prefer APIs over explicit scraping (when possible, of course; as discussed above, providing a working API is sometimes not an option for large companies).
    2. Each website is unique — articles like ours do provide you with a general understanding, but different webmasters impose different requirements. Speaking of requirements…
    3. The website’s Terms of Service is the source of everything you’d need to know when working with the given resource. The more instances of different ToS you’ll read, the better you’ll understand what generally constitutes as responsible scraping. Website’s robots.txt can also provide similar information.
    4. Set an adequate craw rate (i.e. the number of requests you send in X seconds). Adequate isn’t a fixed number; webmasters usually specify it in the robots.txt. If robots.txt doesn’t provide this information, it’s reasonable to set the crawl rate to 1 request per 10-15 seconds.
    5. When in doubt, ask. The scraping approach you opt for might look conservative to you, but the webmasters can disagree. Therefore, you can always contact the website’s team to ask for clarifications or even try and negotiate.
    6. Identify your scraper bot via a legitimate user agent string. This indicates that your intentions are, in fact, not malicious. In the user agent string, you can also link to a page explaining why you need the data.

    Web scraping use cases

    artwork depicting a web scraping bot
    Downloading…

    The issues, problems, and caveats we’ve outlined above do seem pretty challenging — but the power behind web scraping is absolutely worth it. If you ever yourself wondering, what is web scraping used for?, these use cases can give you a great answer:

    Aggregators collect categorized data from multiple sources and present it to their users. Platforms like TripAdvisor, booking.com, and Flightradar24 all utilize publicly available data and package it for the end-user’s convenience.

    Lead generators can be used to harness information about business and individuals from websites like LinkedIn; this information typically includes names, email addresses, phone numbers, etc.

    Scientific research thrives on rich data sets: the more information available, the better and more precise the research’s outcome will be. By extension, commercial research (the study of competitors and their products) also needs a lot of domain-specific information.

    The latest advancements in Machine learning have become possible, in part, thanks to the increasing amount of information available on the web — it can train machine learning models

    Web scraping tools and technologies

    artwork depicting stylized JavaScript and Python logos
    In this case, however, ‘Py’ reigns supreme over ‘JS’

    The tool can absolutely make or break the web scraping process, so it’s crucial to ensure that you’re using the right tool for the job. In terms of programming languages, Python and JavaScript are the most popular choices: even though they’re vastly different, their strengths and advantages ensure that you can use either of them and fine-tune to your needs.

    Honorable mentions: As regular scrapers can only interact with HTML, you might need to utilize software like Selenium to process JavaScript; Selenium, in essence, works like a “fake browser” and can operate where simpler parsers fail. Also, Puppeteer can be useful because it can automate regular user actions in the browser. Powered by node.js, it excels at dynamically loading pages — for instance, those utilizing Ajax — and Single Page Applications (web apps built with Angular, React, or Vue).

    Read More:  Express.js Lessons. Express: Basics and Middleware. Part 2.

    Web scraping libraries in JavaScript

    Naturally, the popularity of JavaScript (and, by extension, Node.js) allowed for a number of awesome web scraping libraries. Let’s take a closer look at them:

    Request is an HTTP client known for its simplicity and efficiency. It allows the developer to make HTTP calls quickly and easily, with an additional support for HTTPS redirects. Here’s how to get page content using Request:

    const request = require('request');
    
    request('http://stackabuse.com', function(err, res, body) {  
        console.log(body);
    });
    

    Osmosis can process AJAX/CSS content, log URLs, operate on cookies, and more. Despite its rich functionality, it doesn’t overburden you with numerous dependencies (unlike Cheerio). Here’s how you can gather some basic page information:

    osmosis
      .get(url)
      .set({
        heading: 'h1',
        title: 'title'
      })
      .data(item => console.log(item));
    

    Cheerio is an implementation of core jQuery which excels at parsing markup content; additionally, it offers an API to process and manipulate the data structure you acquire. Here’s how you can load the HTML object and find the parent of the first selected element:

    var request = require('request');
    var cheerio = require('cheerio');
    
    request('http://www.google.com/', function(err, resp, html) {
            if (!err){
              const $ = cheerio.load(html);
              console.log(html); 
          }
    });
    
    $('.testElement').parent().attr('id')
    //=> testParent
    

    Lastly, the Apify SDK is the most powerful tool that comes to rescue when other solutions fall flat during heavier tasks: performing a deep crawl of the whole web resource, rotating proxies to mask the browser, scheduling the scraper to run multiple times, caching results to prevent data prevention if code happens to crash, and more. Apify handles such operations with ease — but it can also help to develop web scrapers of your own. Here’s a simple Hello World implementation:

    /**
     * Run the following example to perform a recursive crawl of a website using Puppeteer.
     *
     * To run this example on the Apify Platform, select the `Node.js 8 + Chrome on Debian (apify/actor-node-chrome)` base image
     * on the source tab of your actor configuration.
     */
    
    const Apify = require('apify');
    
    Apify.main(async () => {
        const requestQueue = await Apify.openRequestQueue();
        await requestQueue.addRequest({ url: 'https://www.iana.org/' });
        const pseudoUrls = [new Apify.PseudoUrl('https://www.iana.org/[.*]')];
    
        const crawler = new Apify.PuppeteerCrawler({
            requestQueue,
            handlePageFunction: async ({ request, page }) => {
                const title = await page.title();
                console.log(`Title of ${request.url}: ${title}`);
                await Apify.utils.enqueueLinks({ page, selector: 'a', pseudoUrls, requestQueue });
            },
            maxRequestsPerCrawl: 100,
            maxConcurrency: 10,
        });
    
        await crawler.run();
    });
    

    Web scraping libraries & frameworks in Python

    Python has become the standard: boasting both simplicity and power (qualities that really shine during Python technical interviews), it’s the go-to language to power web scraping. As for the tools themselves, it may be tempting to compare them and claim Tool A reigns supreme! Everyone, abandon your Tools B and Tools C before it’s too late! However, we should keep in mind that every web scraping tool has its pros and cons, so do remember to ensure that you’ve picked the right tool for the job.

    Scrapy is pretty much Django for web scraping (to understand the comparison better, check our Django interview questions out!): with its batteries included approach, it’ll probably manage to satisfy your project’s requirements. The tricky part, however, is understanding when not to apply Scrapy: choosing it for a simple project would be similar to building a full-blown PWA in React for a simple blog. Scrapy, therefore, excels at large projects — it’s extremely well-optimized, CPU- and memory-wise. Here’s how to scrape famous quotes from a web resource we specify:

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        start_urls = [
            'http://quotes.toscrape.com/tag/humor/',
        ]
    
        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').get(),
                    'author': quote.xpath('span/small/text()').get(),
                }
    
            next_page = response.css('li.next a::attr("href")').get()
            if next_page is not None:
                yield response.follow(next_page, self.parse)
    

    Urllib is a handy little library for working with URLs, included in Python’s standard library. Its modules are designed to simplify all operations that are typically carried out with URLs. Here’s what they can do:

    • urllib.request opens and reads URL.
    • urllib.error provides the exceptions that can be raised by .request
    • urllib.parse parses URLs.
    • urllib.robotparser parses robots.txt files.

    Here’s how to open the main page of python.org and display its first 300 bytes:

    import urllib.request
    with urllib.request.urlopen('http://www.python.org/') as f:
        print(f.read(300))
    
    b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">nnn<html
    xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">nn<head>n
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />n
    <title>Python Programming '
    

    Conclusion

    All in all, web scraping is worth the effort — many companies have built successful businesses around the ability to gather data and serve it where it’s needed most. Chances are, one of your projects might capitalize on scraping capabilities — so once you finally start harvesting all the data available on the web, do make sure to do so responsibly! 🙂

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Denis Kryukov
    • Website

    Related Posts

    Mastering REST APIs: Essential Techniques for Programmers

    December 18, 2024

    Streamlining Resource Allocation for Enhanced Project Success

    December 18, 2024

    Crafting Interactive User Interfaces Using JavaScript Techniques

    December 17, 2024
    Leave A Reply Cancel Reply

    You must be logged in to post a comment.

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    Vue March 29, 2023

    An Introduction to Pinia: The Alternative State Management Library for Vue.js Applications

    The maintenance of the application’s data and ensuring that other components can access and modify the data as necessary depending on state management in Vue.js applications. Because of its component-based architecture, Vue.js comes with a straightforward state management system pre-installed. Yet, developers frequently require increasingly sophisticated state management systems as applications increase in size and complexity.

    Protecting Your API from Brute Forcing By Rate Limiting in NodeJS

    July 16, 2020

    9. Уроки Node.js. События, EventEmitter и Утечки Памяти

    September 13, 2016

    Ведение профиля на UpWork

    June 23, 2016

    Categories

    • AI & Automation
    • Angular
    • ASP.NET
    • AWS
    • B2B Leads
    • Beginners
    • Blogs
    • Business Growth
    • Case Studies
    • Comics
    • Consultation
    • Content & Leadership
    • CSS
    • Development
    • Django
    • E-commerce & Retail
    • Entrepreneurs
    • Entrepreneurship
    • Events
    • Express.js
    • Facebook Ads
    • Finance & Fintech
    • Flask
    • Flutter
    • Franchising
    • Funnel Strategy
    • Git
    • GraphQL
    • Home Services Marketing
    • Influencer & Community
    • Interview
    • Java
    • Java Spring
    • JavaScript
    • Job
    • Laravel
    • Lead Generation
    • Legal & Compliance
    • LinkedIn
    • Machine Learning
    • Marketing Trends
    • Medical Marketing
    • MSP Lead Generation
    • MSP Marketing
    • NestJS
    • Next.js
    • Node.js
    • Node.js Lessons
    • Paid Advertising
    • PHP
    • Podcasts
    • POS Tutorial
    • Programming
    • Programming
    • Python
    • React
    • React Lessons
    • React Native
    • React Native Lessons
    • Recruitment
    • Remote Job
    • SaaS & Tech
    • SEO & Analytics
    • Soshace
    • Startups
    • Swarm Intelligence
    • Tips
    • Trends
    • Vue
    • Wiki
    • WordPress
    Top Posts

    React Advanced: The Tiniest Recap — See What You’ve Missed

    Events October 31, 2019

    Developing the Proper Business Performance

    Entrepreneurs January 20, 2019

    Guide to Control Flow Statements in Java

    Java January 29, 2020

    Attending Tech Conferences: Pros vs Cons & Plan of Action

    Events July 18, 2019

    Subscribe to Updates

    Get The Latest News, Updates, And Amazing Offers

    About Us
    About Us

    Soshace Digital delivers comprehensive web design and development solutions tailored to your business objectives. Your website will be meticulously designed and developed by our team of seasoned professionals, who combine creative expertise with technical excellence to transform your vision into a high-impact, user-centric digital experience that elevates your brand and drives measurable results.

    7901 4th St N, Suite 28690
    Saint Petersburg, FL 33702-4305
    Phone: 1(877)SOSHACE

    Facebook X (Twitter) Instagram Pinterest YouTube LinkedIn
    Our Picks
    JavaScript

    Building React Components Using Children Props and Context API

    Git

    Git – Recommendations for “Soshace” Team

    JavaScript

    Introduction to the Best Code Playgrounds: JSFiddle, Codepen, and CodeSandbox

    Most Popular

    Interview with Sergey

    Interview

    Introduction to Vue.js Event Handling

    JavaScript

    19. Уроки Node.js. Безопасный Путь к Файлу в fs и path.

    Programming
    © 2025 Soshace Digital.
    • Home
    • About
    • Services
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.