This page lists important points concerning different aspects of HPCaVe servers, which could help you optimize your computing usage.
- Test the scalability: before running your jobs on a lot of cores, make sure that the scalability is as expected. Even if you have a program which has been proved scalable on other systems, try to test as much as possible the scalability of your program on HPCaVe servers: for instance run your job on 1 core, then one full processor (8 cores on MeSU-alpha or 12 cores on MeSU-beta), then one node (16/24), and finally multiple nodes. This should assure you that no hardware-related performance issues happen.
- Implement checkpointing: If your jobs should run for a long time, think about implementing checkpointing. We can not guarantee a continuous availability of the servers (electricity or water problems, maintenances, unexpected crashes might occur). Making sure that your job frequently dumps its current state on the hard drive could eventually allow you to resume your work in case of interruption.
- Mind the wall times, and do not risk to be stopped at the time limit. Most queues will indeed interrupt your job after 12hours of computation (you can check that with qstat -q).
- Don’t request too much: As much as possible, only request the number of processors you actually need. Requiring too much resource will indeed place your job on “bigger” queues which have a lower priority, and thus execute less often.
- Launch your jobs during off-peak hours: The usage of the servers is lower at night. Do not hesitate to launch jobs at this period, during which you can have access to “bigger” queues
- Avoid canceling: Canceling a job before it gets executed is “expensive” as job priority increases over time, and the more you wait, the higher your job will be in the queue.
Files and data management
- Use /scratch volumes, which are optimized for performance. You can find more information on those file volumes here.
- Backup your data, as the /scratchalpha and /scratchbeta are not backed up, and that we can not guarantee a 100% failsafe back up of your /home directories.
- Backups are not an “undelete”, and your /home restoration in case of incident mainly depends on the time between the deletion of files and the backup request to HPCaVe admins, as well as the life span of the file (did it exist long enough?).
- Do not let your job write heavy output to your /home directory. Not only your quota will be exceeded quickly, but you will loose in performance from not using the /scratch directories
- Respect your quotas: 30 Go on home, 250 Go on /scratchalpha and 500 Go on /scratchbeta.
- Keep your data secure! HPCaVe systems are highly secured, but are often scanned on the network by would-be intruders. Although highly improbable, we can not guarantee that a data thievery won’t happen. Therefore, do not store nuclear launching codes on HPCaVe servers.
- Strong personal password: Your password is the same as your upmc account. Make sure it is strong/long/complex enough, and do not share it with others (DSI policies).
- Respect the DSI policy
- Keep the software on your laptops/PC up to date
- Keep in mind that the Network traffic is monitored by the DSI