Speakers:
Alexander Avdoshkin
(MIT),
David Walter
(Massachusetts Institute of Technology),
Luca Lavezzo
(MIT),
Marianne Moore
(MIT),
Mariarosaria D'Alfonso
(Massachusetts Institute of Technology),
Matthew Heine
(Massachusetts Institute of Technology),
Xuejian(Jacob) Shen
(Massachusetts Institute of Technology)
- Main focus: Preparation for the annual review (steering committee), Wednesday, 5th floor IndyCor office, 1–3 PM, with Zoom +
coffee/cookies.*
- Annual Review Prep*
- Slides: drafts uploaded; everyone to finalize today and cross-review others' slides for overlap/duplication. Emphasize
what's new since last year (e.g. storage quota increase; Jan's HTCondor CPU/memory-efficiency tool deserves a dedicated
slide). David to add a year-in-review summary pulled from Cleo tickets.
- Overlap noted: Maria & Jan both cover user storage quotas (5/10/100 GB) — to be deconflicted offline.
- Logistics: Matt can't attend (rescheduled clash) — David presents his slides; editable PowerPoint + Cleo ticket provided.
Jacob (only other in-person) to arrive ~20–30 min early to help set up; David handles coffee. Christoph gives a similar
overview/funding talk as last year.
- Action (Jan): test X11/X-Win32 (Windows) tonight and X2Go — remove if dead, fix X2Go (Christoph wants it kept). Matt
reassigning the relevant Cleo ticket to Jan.
- Storage / Ceph (Maria)*
- A2rchi crashes fixed via retry-on-fail (30s) — stable 10+ days.
- Ceph perf testing (Analysis Grand Challenge, 1.7 TB): no improvement from new ROOT fix vs old, Ceph vs scratch (~10×
slower than scratch). Will re-test write-to-new-directory; will report as a bullet only (stability recovered, perf
unchanged, investigating). David: rebalancing/scrubbing still running (~weeks) may slightly affect results.
- Disk-usage monitoring scripts (from Marian) copied over and committed to GitHub. CephFS ~70% full; old-user /data/user
holds 24 TB (4 users); revisit reclaiming old user/group space at 75–80%. Action (Maria): open ticket summarizing
storage-management findings by end of week. Also: orphaned groups (no users) need a manual cleanup process.
- Tickets / Other*
- Ronald Garcia's group requesting 8–10 TB backup space (OK to grant) + testing scratch for a large (hundreds of TB)
workflow — Christoph to follow up on plans/possible funding.
- OSG (Jan): following Tim's instructions, blocked on broken token retrieval on OSG's website — Jan to chase Tim, document
in Cleo.
- Inefficient users (Jan): contact Simon (½M jobs, 8% memory efficiency). Efficiency plots over 30/90 days look broken — Jan
to check script.
- Slurm monitoring (Jacob): building HTCondor-style efficiency monitoring; coordinate shared plotting tools with Jan;
summarize status on one slide for review.
- submit06 slowdown (Matt): mount-check cron jobs piled up (~50–60) after a timeout failed when the script went bash→Python
(timeout didn't kill subprocess). Old crons removed; will fix timeout + add file-lock guard, possibly roll out a lock to all
cron jobs after validating on one.
- Software*
- New kernel exploit mitigated by disabling the unused vulnerable package.
- AlmaLinux 9.8 rollout via Ansible in progress (GPU nodes done, rest ongoing).
- Ceph 20.2.2 in validation (~1–2 weeks out); David plans to apply the minor upgrade — Christoph urged caution.
Next: finalize/cross-review slides today; steering meeting Wednesday.