2025 storage outage

Dear SC Users,

our /work directories have been unresponsive for the last day, due to a problem with a storage server.

We regret to inform you about a possible data loss in approximately 20.000 files. Fortunately, that's less than 0.01% of all files in the system, so your individual files may not be affected at all. Unfortunately, we don't have a way to list those 20.000 affected files directly.

Since only files from storage server OST1 are affected, you can run the following command on login02/login01 to list all files from OST1: - lfs find --ost 1 /work/{my-work-directory}

For those files, please check manually whether they look right and whether they can be loaded with the respective software. For manually installed software such as conda environments, python modules or LLM models, it may be best to delete and reinstall them completely. Please find a list of recommendations at the end of this email.

Over the next few days, we're running a scan of the storage system to find all possibly corrupted files. You'll be notified individually when the results are in.

During the storage outage, the SLURM scheduler paused all running and queued jobs. We're now resuming all compute nodes, so the queued jobs should start again. Please check the results of all jobs of the past 30 hours, since the files may not have been written completely, or may show signs of data corruption.

We're available for in-person and video-call support in this matter. If you're interested, please write us at sc-request@uni-leipzig.de or respond to this mail.

Please check the Service Status at https://www.sc.uni-leipzig.de/00_Status/status/ for updates.

Best regards Your SC Team

Recommendations: Affected files should show control characters or other obvious garbage.

  • For text / config files: Run less {myfile}.txt and scroll through the file with e.g. space. Exit with 'q' key

  • For .csv files: Either open them with Jupyter/Pandas, or open them with less or any text editor

  • For Job Checkpoints: If unsure, delete them and restart the job. There's no easy way to check them

  • For LLMs: Delete and download them again

  • For Software installations such as Conda environments, python environments or other similar: Reinstall them again

  • For source code: Compare against versioning such as git, or try recompiling it

  • For simulation results, e.g. .pdb atomic trajectories: Try opening them with the appropriate software package and see if errors are shown