Booth Web Scraping Policy and Guidelines

itsec

Overview

Data collection is crucial to research. As information has become more digital and found on publicly accessible web sites, web scraping has become a key tool in research data collection. This document outlines Booth’s policy on web scraping as well as standards/guideline for staff and faculty practicing web scraping techniques.

Definitions

Automated Data Collection- a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file through an automated tool or script. Internet tools that systematically browses web pages for the purpose of collecting and/or indexing data may include:

Bots/Robots
Spiders
Web crawlers

Terms and Conditions/Terms of Service- End user or access agreement proposed by web site owner stating access rules, limitations, and legal obligations.

Crawl Rate- the rate of requests a web scraper subjects a web site to

Robot File- A file published by a web site owner stating limitations of web scraping activity.

Legislation

Computer Fraud and Abuse Act- USC Title 18 § 1030

Digital Millennium Copyright Act (DMCA)- USC Title 17

GDPR- General Data Protection Regulation

Data- web scraping Activities

Web scraping is subject to Booth and the University of Chicago data governance policies and processes. This technique of data collection requires Principal Investigators (PI’s) to address the need for:

Non-disclosure Agreements (NDA) and/or Data Use Agreements (DUA)
Data Retention Policies
Data Procurement and Management Processes Compliance

API

If data source provides an API for data access it should be used in the place of scraping.

Web Scraping Request

Before any web scraping activity can be undertaken, researchers must request authorization.

Prevent harm to web services

Booth faculty and staff have a moral and ethical responsibility when engaged in web scraping activities to protect the integrity and availability of the web sites.

Only a single source IP address is authorized and masking of IP addresses is prohibited
Crawl Rate/Delay- if one is not given, Booth standard is no more than 1 request every 10 seconds
Crawl during the website’s off peak hours, if possible

Terms of Service/Terms of Conditions

Requester and Information Security must review individual website Terms and Conditions (T&C) for web scraping requirements. If T&C or Robots.txt file (see below) prohibit scraping, request written permission from web site administrator must be attained before any scraping activities. It is the requester’s responsibility to verify there are no publishing restrictions on collected data. This information should be contained in the DUA.

Robots.txt

The robots file is usually found at http(s)://<web address>/robots.txt

Example: https://www.amazon.com/robots.txt

The requester must submit a copy of robots.txt with the web scraping request. Scraping activities must be in compliance with the site’s robots.txt, specifically, any “Disallowed” are not authorized.

Copyright

Copyright infringement must be prevented during scraping. While facts can be used without copyright infringement, caution should be taken to avoid downloading any “creative content” subject to DMCA.

GDPR

Lawful reason- if data potentially holds personal data of EU residents, web scraping request must include lawful reason for data collection statement.

Consent- Do you have consent of the subjects to scrape their data?
Legitimate Interest- usually applies to government and Law Enforcement

Identification

If web scraping tool is able, use USER_AGENT file to identify the scraping.

Include Researcher Name
Include Researcher E-mail
Identify project name
All reported violations or complaints must be forwarded to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu

Blocking

If scraper is blocked by website PI must report blocking to Booth IT Security and Data Governance immediately at research-data.requests@list.chicagobooth.edu. An Incident Response action will take place to determine cause of blocking.

Violating website policies will likely result in active blocking of web crawlers.This would prevent data collection.

Intentional Violation of this Policy

Traffic detected that is considered in violation of acceptable use will be blocked by Booth IT or University ITS Security staff.All scraping traffic will be suspended from offending IP until an Incident Response process has been completed.

Web Site Security Controls

Circumventing web site security controls is strictly prohibited.

Mercury HPC Scraping Jobs

Only one concurrent scraping job per unique website is permitted on the Mercury HPC cluster. Running multiple scraping threats on a single website is prohibited without express written authorization by the website owner.

Exceptions

Crawl Rate- If an exception for an increased crawl rate is desired the researcher must submit a ticket with the helpdesk. It is the responsibility of the researcher to establish business need as well as confirming an increased crawl rate will not violate website policies or expose web sites to undue risk.

Web Scraping of Chicago Booth Web Assets

Purpose:

This policy section outlines the guidelines for web scraping activities on Chicago Booth web assets. The purpose of this policy is to ensure that web scraping is conducted responsibly, respects the rights and privacy of users and website owners, and complies with all applicable laws and regulations.

Contact Information:

For inquiries related to web scraping authorization or questions about this policy, please contact security@lists.chicagobooth.edu

0 reviews

Print Article

Updating...

Booth Web Scraping Policy and Guidelines

Overview

Definitions

Legislation

Data- web scraping Activities

API

Web Scraping Request

Prevent harm to web services

Terms of Service/Terms of Conditions

Robots.txt

Copyright

GDPR

Identification

Blocking

Intentional Violation of this Policy

Web Site Security Controls

Mercury HPC Scraping Jobs

Exceptions

Web Scraping of Chicago Booth Web Assets

Purpose:

Scope:

Definition:

Permission and Access:

Responsible Use:

Prohibited Content:

Intellectual Property Rights:

Privacy and User Data:

Legal Compliance:

Attribution:

Frequency and Rate Limiting:

Monitoring and Enforcement:

Liability:

Changes to the Policy:

Contact Information:

Deleting...