Basic Computing Services (subMIT) Review

America/New_York
Building 24-506 (MIT)

Building 24-506

MIT

Summary

The Basic Computing Services conducted their fourth annual review, showing continuous growth since launching in December 2021. SubMIT serves as the Physics Department's analysis facility with 975 total and 50-90 weekly active users across all career stages and centers. 

Organizational transitions proceeded smoothly with David Walter assuming project leadership from Joshua Bendavid. The Steering Committee increased the contribution for the Hosting team from 0.5 to 0.75 FTE while maintaining 2.0 FTE for the Project team, funded by the Physics Department.

The past year brought major infrastructure achievements. The team completed the migration from CentOS 7 to AlmaLinux 9, modernized storage by transitioning from Gluster to CephFS while expanding capacity from 500TB to 1.5PB (45% used), and relocated servers to a better-supported computing room. Computing power expanded slightly including additional CPU and GPU resources. Current resources demonstrate good availability including 70 TB backed up home directories (5% used), 63 TB work space (15% used), and 44 TB fast NVMe storage (at full capacity). The CephFS provides space for large datasets with 126 active users and 14 groups. About 20% of users utilize GPU capabilities alongside CPU resources. The full adaptation to Control Groups improved reliability and mitigated resource abuse. Continuous monitoring and benchmarking ensure good performance of the system.

The team successfully implemented user requests including central MPI installation, Globus Endpoint for data transfers, group websites with granular access controls, and priority access policies. User support evolved substantially with A2rchi, a custom LLM hosted by SubMIT for data privacy and research. Multi-channel support includes email, Slack, in-person help, and anonymous feedback. The Cleo ticket system handles ~20 tickets monthly, with most resolved within 2 days. Community engagement remains strong through annual workshops featuring tutorials and user-contributed talks, monthly User Group meetings, and classroom usage.

Future plans focus on the expansion of the LDAP server management and streamlined account creation, automated infrastructure improvements including remote management and PXE boot, plus self-cleaning mechanisms for resource optimization.

SubMIT successfully meets all design specifications as a critical asset to MIT's physics research community. After four years, it has matured into a stable, high-performance analysis facility that has enabled cutting-edge research contributing to over 20 publications, while maintaining user-friendly accessibility. Future plans focused on continued optimization and expanded capabilities for growing computational demands.

General discussion:

  • Bolek

    • Concerns of adding more computers/racks, as the power consumption is gonna increase and might break the budget.

      • Reply: no plan to expand beyond the capacity of B24. We will buy new servers, but retire old hardware to ensure we stay within the power budget.

      • The basement has been fitted for SubMIT and Facilities will take care of the space for us.

    • The budget was not presented in numbers and more detail. Can we do this?

      • The budget numbers need some more detailed discussion. There are some developments that have so far not been properly accounted for and will need some follow-up offline.

      • The budget is being reviewed in the physics department.

There are minutes attached to this event. Show them.
    • 13:00 13:10
      Opening Remarks from the Steering Committee 10m
      Speaker: Christoph Paus (MIT)
    • 13:10 13:25
      Overview: The purpose and impact of SubMIT 15m
      • What is the problem we are trying to solve
      • System usage: total and weekly users, by department etc. …
      • Public presence: Web page, Paper on SubMIT, Publications with SubMIT, …
      Speaker: David Walter
    • 13:25 13:40
      User Workflows on SubMIT 15m
      • Account creation and login
      • Access through JupyterHub or terminal
      • Conda, Containers, singularity
      • Batch computing using slurm, htcondor
      • External resources and how to access them
      Speaker: Luca Lavezzo (MIt)
    • 13:40 13:55
      Hardware resources and performance 15m
      • Hardware resources, compute, network, ...
      • status, capacity, usage, ...
      • What resources make SubMIT attractive
      • Benchmarking of the system, analysis challenge
      Speaker: Mariarosaria D'Alfonso (Massachusetts Institute of Technology)
    • 13:55 14:05
      Break 10m
    • 14:05 14:20
      Previous and future upgrades 15m
      • What did we learn the last year(s)
      • Software upgrades policy
      • The future of SubMIT
        • Control groups (partially done): Limit abuse of the system, include CephFS machines in Slurm pool, …
        • Take control over LDAP server (This will allow us to…)
        • Removal of old data (/ceph)
      Speaker: Zhangqier Wang (Massachusetts Institute of Technology)
    • 14:20 14:35
      How SubMIT provides user support 15m
      • Communication channels: Stack, email, …
      • Chatbot
      • User's guide
      • Emails to tickets analysis
      • How the community evolves/grows
      Speaker: Marianne Moore (MIT)
    • 14:35 14:50
      Engagement with the user community 15m
      • SubMIT workshop, tutorials, user meetings
      • Classroom usage, workshops hold at MIT using SubMIT resources
      • Customization and user requests:
        • OpenMPI, Mathematica, Globus;
        • Groups: priority access on purchased hardware, storages, webpage
        • Dropbox like storage? We didn’t follow up on that
      • Current limitations and open challenges
        • Balancing restrictions/rules with fair share usage
      Speaker: Matthew Heine (Massachusetts Institute of Technology)
    • 14:50 15:00
      Discussion & feedback 10m