Basic Computing Services (subMIT) Review

America/New_York
Building 24-506 (MIT)

Building 24-506

MIT

Attendence: 

  • Steering Committee: Matt Cubstead, Mikhail Ivanov, Christoph Paus, Alexander Rudat
  • Project team: Mariarosaria D'Alfonso, Jan Eysermans, Xuejian (Jacob) Shen, David Walter

Summary

The Basic Computing Services conducted their fifth annual review, showing continuous growth since launching in December 2021. SubMIT serves as the Physics Department's analysis facility with 1,037 total and approximately 50–90 weekly active users across all career stages, departments, and centers. The system now underpins research contributing to 45 known publications (up from over 20 last year), trending strongly upward year over year, and is documented in a dedicated publication (EPJ Res. Infrastruct. 10, 2 (2026)) and an ACAT 2025 conference poster.

Organizational transitions again proceeded smoothly, with David Walter preparing to step down as project lead and Mariarosaria D'Alfonso taking over, while Jan Eysermans assumed the deputy role from Matthew Heine, who remains an active team member. On the Steering Committee, Alexander Rudat joined for MKI as Rob Simcoe stepped back. The Steering Committee maintained 2.0 FTE for the Project team and 0.75 FTE for the Hosting team, funded by the Physics Department.

The past year brought major infrastructure achievements. The team established an Ansible-based operational automation framework for graceful, rolling system updates, completed the full adoption of Control Groups v2 to improve reliability and mitigate resource abuse, and upgraded the software stack across AlmaLinux, CephFS, HTCondor, Slurm, NVIDIA drivers, and WordPress under a defined upgrade policy. Storage modernization continued: CephFS raw capacity reached 1.6 PB (69% used, 848 TB stored), serving 164 active users, alongside 70 TB of backed-up home space (5% used, quota doubled to 10 GB), 63 TB of work space (20% used, quota doubled to 100 GB), and ~100 TB of fast NVMe scratch storage (at full capacity). Compute grew to 4,296 CPUs and 66 GPUs (from 3,464 and 34), with eight recovered GTX 1080 nodes and 11 purchased 22 TB CephFS drives; about 20% of users utilize GPUs. A network management interface (IPMI) was begun, and continuous monitoring now includes per-user CPU and memory efficiency benchmarking. Security was a dominant theme, with the team rapidly mitigating a succession of public Linux privilege-escalation vulnerabilities.

The team successfully implemented user requests including the new gpu-express interactive queue for fast GPU testing, building on prior accommodations such as central OpenMPI, the Globus endpoint, group websites, JupyterHub customizations, and an MLOps server. User support continued to evolve with A2rchi, the self-hosted LLM that drafts ticket answers while preserving data privacy. Multi-channel support spans email, Slack, in-person office hours, and anonymous feedback, with the Cleo ticket system handling roughly 20 tickets monthly, most resolved within two days. Community engagement remained strong through the annual workshop, monthly User Group meetings dominated by AI/ML topics, and growing classroom usage across five courses and the Gaia astrophysics hackathon.

Future plans focus on token-based OSG and XRootD authentication, extended Slurm user monitoring, completing the remote management interface, continued Ceph performance optimization, and storage lifecycle management including reclamation of unused allocations.

SubMIT successfully meets all design specifications as a critical asset to MIT's physics research community. After five years, it has matured into a stable, high-performance analysis facility that has enabled cutting-edge research while maintaining user-friendly accessibility, even as the team navigates a sharp, AI-driven surge in memory and storage hardware costs. Future plans focused on continued optimization and expanded capabilities for growing computational demands.

General discussion

Alexander Rudat

  • When project team members leave, is there a transition period to teach new members?
    • There is not much repetitive work on SubMIT, problems are addressed as they arise. For this reason a transition period is not very useful. The focus is on documenting the knowledge such that it is available once it is needed. 
  • What makes SubMIT attractive over Engaging or other analysis facility?
    • Multiple factors: the flexibility to react to user needs; easy access to significant storage, GPUs, and external resources such as the Open Science Grid (OSG).
  • Do the publication-by-division fractions reflect the user basis
    • NUPAX has a higher fraction of publications, it could be that more papers are published in this division, but also most team members have been in NUPAX and information flow about published papers happens.
  • How is the evolution/direction of the system defined? How do you decide which updates/configurations/settings are performed? And how do you get informed of new updates/possibilities?
    • We get information about how to improve the system from different sources including: email lists or websites of major software, other computing centers, or our user community. 
    • We try to keep our software up to date, if new features become available we evaluate and adapt our system if decided. 
    • If problems arise we investigate the source and adapt the system such that it does not happen again, which also leads to an evolution and continuous improvements.
  • How do external users get access to SubMIT?
    • They have to be sponsored by a PI and the account is usually opened within minutes-hours, very streamlined and it has been done for ~300 accounts already.
  • Some of the GPU nodes are quite old, in particular the NVidia 1080GTX, is it worth maintining them? Are they used?
    • Yes these GPUs are still very useful for out users since they are easily accessible and can be used for testing/developing or to run lightweight workflows that require GPUs.
  • Do you know how much the SubMIT LLM (Archi) is used by the users?
    • We don't monitor this but it could be done. Our impression is that users prefer using mainstream LLMs (ChatGPT, Claude, Gemeni) for general problems where the Archi can not compete since it's an self-hosted open weights model. For SubMIT specific problems Archi would be suitable but we think most users prefer to write help tickets to the project team.
  • Do the groups that purchased the hard drives for CephFS get priority access?
    • Not in general, they get access to the storage to the amount what they have invested, i.e. if they purchased 6 drives a 22 TB )raw) they get 80TB (effective) of space (accounting for data redundancy, margin, etc.)
  • Do you have a weekend shift in the project team?
    • No, we provide support only during working hours, support from the project team outside these ours is on a purely voluntary basis

Mikhail Ivanov

  • <Some comments/questions I forgot>

  • (Asked after the meeting) Do we allow/encourage people to use LLM models on SubMIT to e.g. install and manage softwares, testing codes, launch jobs? Is it a straight-forward thing or any concerns that these bots can mess up things on SubMIT?

    • We allow users to install and use LLMs on SubMIT and don't put any restrictions at the moment. We monitor this and in case of negative impact we will investigate possible actions, but at the moment we don't see any problem.

    • The LLMs, when running on the user account, have the same rights as the user itself and can only do what the user can do, thus, it is to first order the responsibility of the user. 

There are minutes attached to this event. Show them.
    • 13:00 13:10
      Opening Remarks from the Steering Committee 10m
      Speaker: Christoph Paus (MIT)
    • 13:10 13:25
      The purpose and impact of SubMIT 15m
      • What is the problem we are trying to solve
      • How the project team is organized. …
      • Public presence: Web page, Paper on SubMIT, Publications with SubMIT, …
      Speaker: David Walter (Massachusetts Institute of Technology)
    • 13:25 13:40
      User Workflows on SubMIT 15m
      • Account creation and login
      • Access through JupyterHub or terminal
      • Conda, Containers, singularity
      • Batch computing using slurm, htcondor
      • External resources and how to access them
      Speaker: Jan Eysermans (Massachusetts Institute of Technology)
    • 13:40 13:55
      Hardware resources and performance 15m
      • Hardware resources, compute, network, GPUs, ...
      • status, capacity, usage, ...
      • What resources make SubMIT attractive
      • Benchmarking of the system, analysis challenge
      Speaker: Mariarosaria D'Alfonso (Massachusetts Institute of Technology)
    • 13:55 14:05
      Break 10m
    • 14:05 14:20
      How SubMIT provides user support 15m
      • Communication channels: Stack, email, …
      • Chatbot
      • User's guide
      • Emails to tickets analysis
      • How the community evolves/grows
      Speaker: Xuejian(Jacob) Shen (Massachusetts Institute of Technology)
    • 14:20 14:35
      Engagement with the user community 15m
      • Workshop
      • Users Group Meetings
      • Classroom Usage / User-Run Workshops
      • System-Level Customization / User Requests
      • Current Limitations & Open Challenges
      Speaker: Matthew Heine (Massachusetts Institute of Technology)
    • 14:35 14:45
      Discussion & feedback 10m