Zoom connection available at
https://mit.zoom.us/j/96743699673?pwd=b3h2Q3c3cVQwYW12blhMUG5SWXZCZz09
Attendence:
The Basic Computing Services conducted their fifth annual review, showing continuous growth since launching in December 2021. SubMIT serves as the Physics Department's analysis facility with 1,037 total and approximately 50–90 weekly active users across all career stages, departments, and centers. The system now underpins research contributing to 45 known publications (up from over 20 last year), trending strongly upward year over year, and is documented in a dedicated publication (EPJ Res. Infrastruct. 10, 2 (2026)) and an ACAT 2025 conference poster.
Organizational transitions again proceeded smoothly, with David Walter preparing to step down as project lead and Mariarosaria D'Alfonso taking over, while Jan Eysermans assumed the deputy role from Matthew Heine, who remains an active team member. On the Steering Committee, Alexander Rudat joined for MKI as Rob Simcoe stepped back. The Steering Committee maintained 2.0 FTE for the Project team and 0.75 FTE for the Hosting team, funded by the Physics Department.
The past year brought major infrastructure achievements. The team established an Ansible-based operational automation framework for graceful, rolling system updates, completed the full adoption of Control Groups v2 to improve reliability and mitigate resource abuse, and upgraded the software stack across AlmaLinux, CephFS, HTCondor, Slurm, NVIDIA drivers, and WordPress under a defined upgrade policy. Storage modernization continued: CephFS raw capacity reached 1.6 PB (69% used, 848 TB stored), serving 164 active users, alongside 70 TB of backed-up home space (5% used, quota doubled to 10 GB), 63 TB of work space (20% used, quota doubled to 100 GB), and ~100 TB of fast NVMe scratch storage (at full capacity). Compute grew to 4,296 CPUs and 66 GPUs (from 3,464 and 34), with eight recovered GTX 1080 nodes and 11 purchased 22 TB CephFS drives; about 20% of users utilize GPUs. A network management interface (IPMI) was begun, and continuous monitoring now includes per-user CPU and memory efficiency benchmarking. Security was a dominant theme, with the team rapidly mitigating a succession of public Linux privilege-escalation vulnerabilities.
The team successfully implemented user requests including the new gpu-express interactive queue for fast GPU testing, building on prior accommodations such as central OpenMPI, the Globus endpoint, group websites, JupyterHub customizations, and an MLOps server. User support continued to evolve with A2rchi, the self-hosted LLM that drafts ticket answers while preserving data privacy. Multi-channel support spans email, Slack, in-person office hours, and anonymous feedback, with the Cleo ticket system handling roughly 20 tickets monthly, most resolved within two days. Community engagement remained strong through the annual workshop, monthly User Group meetings dominated by AI/ML topics, and growing classroom usage across five courses and the Gaia astrophysics hackathon.
Future plans focus on token-based OSG and XRootD authentication, extended Slurm user monitoring, completing the remote management interface, continued Ceph performance optimization, and storage lifecycle management including reclamation of unused allocations.
SubMIT successfully meets all design specifications as a critical asset to MIT's physics research community. After five years, it has matured into a stable, high-performance analysis facility that has enabled cutting-edge research while maintaining user-friendly accessibility, even as the team navigates a sharp, AI-driven surge in memory and storage hardware costs. Future plans focused on continued optimization and expanded capabilities for growing computational demands.
<Some comments/questions I forgot>
(Asked after the meeting) Do we allow/encourage people to use LLM models on SubMIT to e.g. install and manage softwares, testing codes, launch jobs? Is it a straight-forward thing or any concerns that these bots can mess up things on SubMIT?
We allow users to install and use LLMs on SubMIT and don't put any restrictions at the moment. We monitor this and in case of negative impact we will investigate possible actions, but at the moment we don't see any problem.
The LLMs, when running on the user account, have the same rights as the user itself and can only do what the user can do, thus, it is to first order the responsibility of the user.