Booth Web Scraping Policy and Guidelines

Summary

Policy and Guidelines for scraping public websites for research data.

Body

Overview

Data collection is crucial to research.  As information has become more digital and found on publicly accessible web sites, web scraping has become a key tool in research data collection.  This document outlines Booth’s policy on web scraping as well as standards/guideline for staff and faculty practicing web scraping techniques.

Definitions

Automated Data Collection- a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file through an automated tool or script.  Internet tools that systematically browses web pages for the purpose of collecting and/or indexing data may include:

  • Bots/Robots
  • Spiders
  • Web crawlers

Terms and Conditions/Terms of Service- End user or access agreement proposed by web site owner stating access rules, limitations, and legal obligations.

Crawl Rate- the rate of requests a web scraper subjects a web site to

Robot File- A file published by a web site owner stating limitations of web scraping activity.

Legislation

Computer Fraud and Abuse Act- USC Title 18 § 1030

Digital Millennium Copyright Act (DMCA)- USC Title 17

GDPR- General Data Protection Regulation

Data- web scraping Activities 

Web scraping is subject to Booth and the University of Chicago data governance policies and processes.  This technique of data collection requires Principal Investigators (PI’s) to address the need for:

  • Non-disclosure Agreements (NDA) and/or Data Use Agreements (DUA)
  • Data Retention Policies
  • Data Procurement and Management Processes Compliance

API

If data source provides an API for data access it should be used in the place of scraping.

Web Scraping Request

Before any web scraping activity can be undertaken, researchers must request authorization.

Prevent harm to web services

Booth faculty and staff have a moral and ethical responsibility when engaged in web scraping activities to protect the integrity and availability of the web sites.

  1. Only a single source IP address is authorized and masking of IP addresses is prohibited
  2. Crawl Rate/Delay- if one is not given, Booth standard is no more than 1 request every 10 seconds
  3. Crawl during the website’s off peak hours, if possible

Terms of Service/Terms of Conditions

Requester and Information Security must review individual website Terms and Conditions (T&C) for web scraping requirements.  If T&C or Robots.txt file (see below) prohibit scraping, request written permission from web site administrator must be attained before any scraping activities.  It is the requester’s responsibility to verify there are no publishing restrictions on collected data.  This information should be contained in the DUA.

Robots.txt

The robots file is usually found at http(s)://<web address>/robots.txt

Example: https://www.amazon.com/robots.txt

The requester must submit a copy of robots.txt with the web scraping request.  Scraping activities must be in compliance with the site’s robots.txt, specifically, any “Disallowed” are not authorized.

Copyright

Copyright infringement must be prevented during scraping.  While facts can be used without copyright infringement, caution should be taken to avoid downloading any “creative content” subject to DMCA.

GDPR

Lawful reason- if data potentially holds personal data of EU residents, web scraping request must include lawful reason for data collection statement.

  1. Consent- Do you have consent of the subjects to scrape their data?
  2. Legitimate Interest-  usually applies to government and Law Enforcement

Identification

If web scraping tool is able, use USER_AGENT file to identify the scraping.

  1. Include Researcher Name
  2. Include Researcher E-mail
  3. Identify project name
  4. All reported violations or complaints must be forwarded to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu

 

Blocking

If scraper is blocked by website PI must report blocking to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu.  An Incident Response action will take place to determine cause of blocking.

Violating website policies will likely result in active blocking of web crawlers.This would prevent data collection.

Intentional Violation of this Policy

Traffic detected that is considered in violation of acceptable use will be blocked by Booth IT or University ITS Security staff.All scraping traffic will be suspended from offending IP until an Incident Response process has been completed.

Web Site Security Controls

Circumventing web site security controls is strictly prohibited.

Mercury HPC Scraping Jobs

Only one concurrent scraping job per unique website is permitted on the Mercury HPC cluster.  Running multiple scraping threats on a single website is prohibited without express written authorization by the website owner.

Exceptions

                Crawl Rate- If an exception for an increased crawl rate is desired the researcher must submit a ticket with the helpdesk.  It is the responsibility of the researcher to establish business need as well as confirming an increased crawl rate will not violate website policies or expose web sites to undue risk.

Web Scraping of Chicago Booth Web Assets

Purpose:

This policy section outlines the guidelines for web scraping activities on Chicago Booth web assets. The purpose of this policy is to ensure that web scraping is conducted responsibly, respects the rights and privacy of users and website owners, and complies with all applicable laws and regulations.

Scope:

This policy applies to all individuals or entities engaging in web scraping activities on all Chicago Booth web sites or any associated services.

Definition:

Web scraping refers to the automated extraction of data from websites using bots, scripts, or other automated means.

Permission and Access:

All web scraping activities on Chicago Booth web sites require prior authorization from the website administrators and/or compliance with robot.txt published restrictions. Only authorized users or entities are allowed to perform web scraping.

Responsible Use:

Web scraping should be conducted responsibly and must not disrupt or interfere with the normal functioning of Chicago Booth sites or services. Excessive scraping that puts undue load on the servers is strictly prohibited.

Prohibited Content:

Web scraping of certain sensitive or private data, such as personal information, login credentials, financial data, or copyrighted material, is strictly prohibited.  Site restrictions published within a robot.txt file must be observers

Intellectual Property Rights:

Web scraping should not infringe upon the intellectual property rights of Chicago Booth or any third party. Respect copyright, trademarks, and any other intellectual property protections.

Privacy and User Data:

Web scraping must respect the privacy of users. Avoid scraping any personally identifiable information (PII) or sensitive user data without explicit consent.

Legal Compliance:

All web scraping activities must comply with local, national, and international laws and regulations, including but not limited to data protection laws, copyright laws, and anti-spam laws.

Attribution:

If data obtained through web scraping is used or shared publicly, proper attribution to Chicago Booth as the source of the data is required.

Frequency and Rate Limiting:

Web scraping should be performed at a reasonable frequency and with rate-limiting mechanisms to avoid overwhelming the servers.

Monitoring and Enforcement:

Chicago Booth reserves the right to monitor web scraping activities and enforce this policy. Violations may result in temporary or permanent suspension of scraping access.

Liability:

Individuals or entities engaged in web scraping activities are solely responsible for any damages, legal consequences, or liabilities resulting from their actions.

Changes to the Policy:

Chicago Booth Information Security may update or modify this web scraping policy from time to time. It is the responsibility of users to stay informed about any changes.

Contact Information:

For inquiries related to web scraping authorization or questions about this policy, please contact security@lists.chicagobooth.edu

Details

Details

Article ID: 14092
Created
Wed 8/14/24 1:49 PM
Modified
Wed 8/14/24 2:04 PM