The meeting focused on reviewing recent system issues and updates to the computational infrastructure. David reported on resolving SSH service drops on login nodes by implementing new memory and swap limits, restricting users to 50% of RAM and 5GB swap maximum, along with limiting CPU usage to half of available cores. The team discussed challenges with upgrading to Alma Linux 9.8 due to incompatible Nvidia drivers on older GPUs, with Alexander agreeing to investigate potential solutions. Jan provided updates on implementing token authentication for OSG and CMS, including resolving recent connection issues after a reboot of submit06 node. The team also addressed problems with submit67 being down due to likely kernel installation issues, and discussed the need to organize and document configuration files better on GitHub.
The team discussed recent system issues and updates. David reported resolving SSH service problems on login nodes by implementing new memory and CPU usage limits for users, setting swap limits, and unifying restrictions across all machines. The team also addressed GPU-related challenges, including compatibility issues with new Alma Linux 9.8 kernels and driver updates for older Nvidia cards, with Alexander agreeing to investigate the submit67 node failure and compile issues with 1080 GPUs. Maria Rosaria noted that the Archie service needs configuration for automatic restart after reboots, and the team discussed plans to upgrade all machines to Alma Linux 9.8 once GPU driver compatibility issues are resolved.
Jan reported implementing token authentication for OSG as requested, with the token generated on submit06 node awaiting activation on their side before full implementation. Jan also resolved an issue with job submissions after a recent reboot of node 06, which appeared to affect token permissions to the CMS pool, by reinstalling the token. Jan mentioned plans to investigate enabling token authentication for Condor users and to create a comprehensive list of configuration files for both Condor and Slurm systems to be organized on GitHub.