Weekly SubMIT project team meeting

America/New_York
ZOOM

ZOOM

David Walter (Massachusetts Institute of Technology), Jan Eysermans (Massachusetts Institute of Technology), Mariarosaria D'Alfonso (Massachusetts Institute of Technology)
  • Monday, 1 June
    • 1
      Discussion
      Speakers: Alexander Avdoshkin (MIT), David Walter (Massachusetts Institute of Technology), Luca Lavezzo (MIT), Marianne Moore (MIT), Mariarosaria D'Alfonso (Massachusetts Institute of Technology), Matthew Heine (Massachusetts Institute of Technology), Xuejian(Jacob) Shen (Massachusetts Institute of Technology)
      • Main focus: Preparation for the annual review (steering committee), Wednesday, 5th floor IndyCor office, 1–3 PM, with Zoom +
        coffee/cookies.*
      • Annual Review Prep*

        - Slides: drafts uploaded; everyone to finalize today and cross-review others' slides for overlap/duplication. Emphasize
        what's new since last year (e.g. storage quota increase; Jan's HTCondor CPU/memory-efficiency tool deserves a dedicated
        slide). David to add a year-in-review summary pulled from Cleo tickets.
        - Overlap noted: Maria & Jan both cover user storage quotas (5/10/100 GB) — to be deconflicted offline.
        - Logistics: Matt can't attend (rescheduled clash) — David presents his slides; editable PowerPoint + Cleo ticket provided.
        Jacob (only other in-person) to arrive ~20–30 min early to help set up; David handles coffee. Christoph gives a similar
        overview/funding talk as last year.
        - Action (Jan): test X11/X-Win32 (Windows) tonight and X2Go — remove if dead, fix X2Go (Christoph wants it kept). Matt
        reassigning the relevant Cleo ticket to Jan.

      • Storage / Ceph (Maria)*

        - A2rchi crashes fixed via retry-on-fail (30s) — stable 10+ days.
        - Ceph perf testing (Analysis Grand Challenge, 1.7 TB): no improvement from new ROOT fix vs old, Ceph vs scratch (~10×
        slower than scratch). Will re-test write-to-new-directory; will report as a bullet only (stability recovered, perf
        unchanged, investigating). David: rebalancing/scrubbing still running (~weeks) may slightly affect results.
        - Disk-usage monitoring scripts (from Marian) copied over and committed to GitHub. CephFS ~70% full; old-user /data/user
        holds 24 TB (4 users); revisit reclaiming old user/group space at 75–80%. Action (Maria): open ticket summarizing
        storage-management findings by end of week. Also: orphaned groups (no users) need a manual cleanup process.

      • Tickets / Other*

        - Ronald Garcia's group requesting 8–10 TB backup space (OK to grant) + testing scratch for a large (hundreds of TB)
        workflow — Christoph to follow up on plans/possible funding.
        - OSG (Jan): following Tim's instructions, blocked on broken token retrieval on OSG's website — Jan to chase Tim, document
        in Cleo.
        - Inefficient users (Jan): contact Simon (½M jobs, 8% memory efficiency). Efficiency plots over 30/90 days look broken — Jan
        to check script.
        - Slurm monitoring (Jacob): building HTCondor-style efficiency monitoring; coordinate shared plotting tools with Jan;
        summarize status on one slide for review.
        - submit06 slowdown (Matt): mount-check cron jobs piled up (~50–60) after a timeout failed when the script went bash→Python
        (timeout didn't kill subprocess). Old crons removed; will fix timeout + add file-lock guard, possibly roll out a lock to all
        cron jobs after validating on one.

      • Software*

        - New kernel exploit mitigated by disabling the unused vulnerable package.
        - AlmaLinux 9.8 rollout via Ansible in progress (GPU nodes done, rest ongoing).
        - Ceph 20.2.2 in validation (~1–2 weeks out); David plans to apply the minor upgrade — Christoph urged caution.

        Next: finalize/cross-review slides today; steering meeting Wednesday.