Weekly SubMIT project team meeting

America/New_York
ZOOM

ZOOM

David Walter (Massachusetts Institute of Technology), Jan Eysermans (Massachusetts Institute of Technology), Mariarosaria D'Alfonso (Massachusetts Institute of Technology)

Quick recap

The meeting focused on reviewing recent system issues and updates to the computational infrastructure. David reported on resolving SSH service drops on login nodes by implementing new memory and swap limits, restricting users to 50% of RAM and 5GB swap maximum, along with limiting CPU usage to half of available cores. The team discussed challenges with upgrading to Alma Linux 9.8 due to incompatible Nvidia drivers on older GPUs, with Alexander agreeing to investigate potential solutions. Jan provided updates on implementing token authentication for OSG and CMS, including resolving recent connection issues after a reboot of submit06 node. The team also addressed problems with submit67 being down due to likely kernel installation issues, and discussed the need to organize and document configuration files better on GitHub.

Next steps

Alexander

  • Investigate and resolve the issue with submit67 (kernel panic/disk failure) and attempt to bring it back online.
  • Look into the compatibility issue between Alma Linux 9.8 and Nvidia drivers for older GPUs (Summit60s, 75, 76, 77).

David

  • Share the minutes of the annual review in the project team chat and iterate with steering committee members for feedback.
  • Set up future steering committee meetings using Indico for automatic reminders and calendar invites.
  • Upload the minutes and slides to Indico after meetings.
  • Monitor the new resource limits on login nodes (submit0-submit8) and reassess if issues persist.
  • Propagate relevant updates to the documentation.

Jan

  • Investigate if SSH service using swap is normal and provide feedback.
  • Follow up with OSG to activate the generated token for authentication.
  • Investigate enabling token authentication for Condor and document the process.
  • Create a list of all necessary configuration files for Condor and Slurm, organize them on GitHub in the submit admin repository, and provide instructions.

Mariarosaria

  • Investigate and configure the Archie service to restart automatically after a reboot.
  • Post the error message regarding the compilation issue on the 1080 GPUs and create a ticket for Alexander.

Summary

System Issues and Updates Discussion

The team discussed recent system issues and updates. David reported resolving SSH service problems on login nodes by implementing new memory and CPU usage limits for users, setting swap limits, and unifying restrictions across all machines. The team also addressed GPU-related challenges, including compatibility issues with new Alma Linux 9.8 kernels and driver updates for older Nvidia cards, with Alexander agreeing to investigate the submit67 node failure and compile issues with 1080 GPUs. Maria Rosaria noted that the Archie service needs configuration for automatic restart after reboots, and the team discussed plans to upgrade all machines to Alma Linux 9.8 once GPU driver compatibility issues are resolved.

Token Authentication Implementation Updates

Jan reported implementing token authentication for OSG as requested, with the token generated on submit06 node awaiting activation on their side before full implementation. Jan also resolved an issue with job submissions after a recent reboot of node 06, which appeared to affect token permissions to the CMS pool, by reinstalling the token. Jan mentioned plans to investigate enabling token authentication for Condor users and to create a comprehensive list of configuration files for both Condor and Slurm systems to be organized on GitHub.

There are minutes attached to this event. Show them.