Data collection is crucial to research. As information has become more digital and found on publicly accessible web sites, web scraping has become a key tool in research data collection. This document outlines Booth’s policy on web scraping as well as standards/guideline for staff and faculty practicing web scraping techniques.
Automated Data Collection- a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file through an automated tool or script. Internet tools that systematically browses web pages for the purpose of collecting and/or indexing data may include:
- Bots/Robots
- Spiders
- Web crawlers
Terms and Conditions/Terms of Service- End user or access agreement proposed by web site owner stating access rules, limitations, and legal obligations.
Crawl Rate- the rate of requests a web scraper subjects a web site to
Robot File- A file published by a web site owner stating limitations of web scraping activity.
Computer Fraud and Abuse Act- USC Title 18 § 1030
Digital Millennium Copyright Act (DMCA)- USC Title 17
GDPR- General Data Protection Regulation
Web scraping is subject to Booth and the University of Chicago data governance policies and processes. This technique of data collection requires Principal Investigators (PI’s) to address the need for:
- Non-disclosure Agreements (NDA) and/or Data Use Agreements (DUA)
- Data Retention Policies
- Data Procurement and Management Processes Compliance
If data source provides an API for data access it should be used in the place of scraping.
Before any web scraping activity can be undertaken, researchers must request authorization.
Booth faculty and staff have a moral and ethical responsibility when engaged in web scraping activities to protect the integrity and availability of the web sites.
- Only a single source IP address is authorized and masking of IP addresses is prohibited
- Crawl Rate/Delay- if one is not given, Booth standard is no more than 1 request every 10 seconds
- Crawl during the website’s off peak hours, if possible
Requester and Information Security must review individual website Terms and Conditions (T&C) for web scraping requirements. If T&C or Robots.txt file (see below) prohibit scraping, request written permission from web site administrator must be attained before any scraping activities. It is the requester’s responsibility to verify there are no publishing restrictions on collected data. This information should be contained in the DUA.
The robots file is usually found at http(s)://<web address>/robots.txt
Example: https://www.amazon.com/robots.txt
The requester must submit a copy of robots.txt with the web scraping request. Scraping activities must be in compliance with the site’s robots.txt, specifically, any “Disallowed” are not authorized.
Copyright infringement must be prevented during scraping. While facts can be used without copyright infringement, caution should be taken to avoid downloading any “creative content” subject to DMCA.
Lawful reason- if data potentially holds personal data of EU residents, web scraping request must include lawful reason for data collection statement.
- Consent- Do you have consent of the subjects to scrape their data?
- Legitimate Interest- usually applies to government and Law Enforcement
If web scraping tool is able, use USER_AGENT file to identify the scraping.
- Include Researcher Name
- Include Researcher E-mail
- Identify project name
- All reported violations or complaints must be forwarded to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu
If scraper is blocked by website PI must report blocking to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu. An Incident Response action will take place to determine cause of blocking.
Violating website policies will likely result in active blocking of web crawlers.This would prevent data collection.
Traffic detected that is considered in violation of acceptable use will be blocked by Booth IT or University ITS Security staff.All scraping traffic will be suspended from offending IP until an Incident Response process has been completed.
Circumventing web site security controls is strictly prohibited.
Only one concurrent scraping job per unique website is permitted on the Mercury HPC cluster. Running multiple scraping threats on a single website is prohibited without express written authorization by the website owner.
Crawl Rate- If an exception for an increased crawl rate is desired the researcher must submit a ticket with the helpdesk. It is the responsibility of the researcher to establish business need as well as confirming an increased crawl rate will not violate website policies or expose web sites to undue risk.
This policy section outlines the guidelines for web scraping activities on Chicago Booth web assets. The purpose of this policy is to ensure that web scraping is conducted responsibly, respects the rights and privacy of users and website owners, and complies with all applicable laws and regulations.
This policy applies to all individuals or entities engaging in web scraping activities on all Chicago Booth web sites or any associated services.
Web scraping refers to the automated extraction of data from websites using bots, scripts, or other automated means.
All web scraping activities on Chicago Booth web sites require prior authorization from the website administrators and/or compliance with robot.txt published restrictions. Only authorized users or entities are allowed to perform web scraping.
Web scraping should be conducted responsibly and must not disrupt or interfere with the normal functioning of Chicago Booth sites or services. Excessive scraping that puts undue load on the servers is strictly prohibited.
Web scraping of certain sensitive or private data, such as personal information, login credentials, financial data, or copyrighted material, is strictly prohibited. Site restrictions published within a robot.txt file must be observers
Web scraping should not infringe upon the intellectual property rights of Chicago Booth or any third party. Respect copyright, trademarks, and any other intellectual property protections.
Web scraping must respect the privacy of users. Avoid scraping any personally identifiable information (PII) or sensitive user data without explicit consent.
All web scraping activities must comply with local, national, and international laws and regulations, including but not limited to data protection laws, copyright laws, and anti-spam laws.
If data obtained through web scraping is used or shared publicly, proper attribution to Chicago Booth as the source of the data is required.
Web scraping should be performed at a reasonable frequency and with rate-limiting mechanisms to avoid overwhelming the servers.
Chicago Booth reserves the right to monitor web scraping activities and enforce this policy. Violations may result in temporary or permanent suspension of scraping access.
Individuals or entities engaged in web scraping activities are solely responsible for any damages, legal consequences, or liabilities resulting from their actions.
Chicago Booth Information Security may update or modify this web scraping policy from time to time. It is the responsibility of users to stay informed about any changes.
For inquiries related to web scraping authorization or questions about this policy, please contact security@lists.chicagobooth.edu