Booth HPC Environments SLA Policy - v1.0

 

 

 

Research Support Service Level Agreement (SLA) Policy for High-Performance Computing (HPC) Environments

 

 

 

 

 

 

Last update 11/8/2024*

 

 

 

 

 

 

 

 

*If this policy is more than 365 days old refer to policy owner for updated version


 

Contents

Version Control 3

Purpose.. 3

Parties Involved.. 3

Scope of Services. 3

Services Availability.. 3

Support Hours. 4

Incident Response and Resolution Times. 4

User Responsibilities. 4

Performance Metrics. 5

Review and Reporting.. 5

Contact Information.. 5

 

 


 

Version Control

Version

Author

Approver

Notes

1.0

J Buenger

 

Initial policy creation

 

 

 

 

Purpose

This Service Level Agreement (SLA) outlines the scope of services, performance expectations, and support levels provided by the University of Chicago Booth Information Technology (IT) HPC Support Team to researchers utilizing the High-Performance Computing (HPC) resources.

Parties Involved

Scope of Services

The following services are covered under this SLA:

  • HPC cluster access and usage
  • Software installation and support
  • Job scheduling and resource allocation
  • Performance monitoring and tuning
  • Data storage and management
  • Troubleshooting and issue resolution
  • User training and documentation

Services Availability

HPC resources will be available to users as follows:

  • HPC cluster uptime: 99% annual availability
  • Scheduled maintenance windows: Weekly on Wednesdays from 5-8am, Otherwise, advanced notice of at least 7 days for quarterly extended outages.
  • Emergency maintenance: As required with notification provided as soon as possible

Support Hours

The Booth IT HPC Support Team will provide support during the following hours:

  • Normal support hours: Monday to Friday, 7:30 AM to 5:30 PM (excluding university holidays)

Incident Response and Resolution Times

The following response and resolution times [SM1] [JB2] are established for incidents based on their priority levels:

Priority Level

Description

Response Time

Resolution Time

Critical

Cluster-wide outages and critical failures

Within 1 hour (24x7)

Within 4 hours (24x7)

High

Major service degradation or job failures

Within 4 hours (24x7)

Within 1 business day (24x7)

Medium

General issues affecting individual users

Within 1 business day (M-F)

Within 22 business hours (M-F)

Low

Minor issues, requests for information

Within 2 business days (M-F)

Within 44 business hours (M-F)

User Responsibilities

Users of the HPC resources are expected to:

  • Adhere to the HPC Computing Cluster’s Architecture and Usage Limits [SM3] [JB4] 
  • Submit jobs that comply with the HPC Computing Cluster’s Running Programs guidelines[SM5] [JB6] 
  • Promptly report any issues or malfunctions
  • Participate in training sessions and stay informed about best practices

 

Performance Metrics

Performance of the HPC support services will be measured using the following metrics:

  • Incident response and resolution times
  • Uptime and availability of HPC resources
  • User satisfaction surveys
  • Compliance with scheduled maintenance windows

Review and Reporting

The SLA will be reviewed annually by the Booth IT leadership team Review will include evaluation of performance metrics, user feedback, and any necessary adjustments to the SLA.

Contact Information

For support and assistance, users can contact the Booth IT HPC Support Team as follows:

 

 


 [SM1]Are these times for business hours or 24X7 response?

 [JB2]See updates to chart

 [SM3]Do we have these?  I have not written them.

 [SM5]Do we have these documented?