- In your pbs scripts, change the line /usr/share/modules/init/sh with /etc/profile.d/modules.sh
- Submission on MeSU-alpha and MeSU-beta is possible (qsub -q alpha script.sh and qsub -qbeta script.sh)
- WARNING : scratchalpha is related to UV2000 and Scratchbeta is related to ICE XA
Incidents and messages
- 27/04/2018: we are still working on the machines thus we are in the inability to reboot MeSU at least before Friday the 4th. Until then you ‘ll not be able to connect to mesu.dsi.upmc.fr.
- 24/04/2018: Due to the incident which happened on the 16th of april we are inability to reboot MeSU at least before Friday the 27th. Until then you ‘ll not be able to connect to mesu.dsi.upmc.fr.
- 17/04/2018:We will not be capable to reboot MeSU before Friday the 20th or maybe even Monday the 23rd, as there will be another big power cut on Thursday the 19th, and some hardware seems to have been damaged by yesterday’s cut in the water supply.
Until then you ‘ll probably be able to connect to mesu.dsi.upmc.fr intermittently, but without home or scratchspaces on most frontal nodes.
Mesu5 is the frontal nodes on which the /home directories are present, and is the only one from which you can access your data. /scratch directories will however stay unavailable as long as the Infiniband is down.
- 17/04/2018:An unexpected problem occured yesterday on the campus water circuits, linked to the servers cooling circuits, so most machines have been automatically shut down to protect them from overheating. We are waiting for the approval of the appropriate service to restart, as circuits could again be impacted this afternoon.
- 05/04/2018: Since yesterday there are some malfunctions on scratchbeta that is why some jobs were crashed or on “Hold” status. It is due to the rack leader NFS malfunctionning. We are investigating . We might have to reboot the system. We will keep you informed.
- 21/03/2018 : As you already know we are facing issues concerning the rack leader which lead to nodes malfunctionning. Our providers are investigating and today they will intervene on Mesu. However you will still be able to workring but you will be facing some malfunctions.
- 19/03/2018 : Our providers are investigating on the issue concerning the rack leader which lead to nodes malfunctionning. You will still be able to work however you might face some malfunction. We are waiting for their diagnotic.We will keep you informed as soon as possible.
- 13/03/2018 : We are still encountering an issue concerning the rack leader which lead to nodes malfunctionning. A case is opened to our provider. You will still be able to work however you might face some malfunction.
- 07/03/2018 : On monday the 05/03/2018, we encountered an issue concerning the rack leader which lead to nodes malfunctionning. The issue is now fixed. The whole ICE side of the cluster has been rebooted.
- 15/02/2018 : Yesterday technical went well the switch has been changed and the authentication issue on UV is solved. We released and relaunched manually all impacted jobs directly, you can go back to normal on both ice and uv 2000.
- 13/02/2018 : Following the bad weather of recent days, issues appeared on the electrical grid of the Sorbonne Université. Therefore, we have been asked to shutdown until at least tomorrow. However we will take advantage of this break to change the Infiniband switch and we will keep you informed as soon as everything is set.
- 08/02/2018 : The data migration from the old fileserver to the new one is done. You can resume your activities and tell us if any permission is wrong. Think of using scratchspaces for your working directories because the quotas on home will prevent your jobs from succeeding if your output is bigger than your 30Go quota. Using those spaces will be a gain in performance.
- 06/02/2018 -12.00 : A technician intervention is planned for tomorrow in order to correct some issues. As mentionned before we have to migrate data(/home) from the old file server to a new one. Therefore we are in the obligation of interrupting the services.
- 06/02/2018 : There will be some disturbances with PBS. You might encounter issues when submitting your jobs such an error message saying “ERROR 111”.
- 05/02/2018 : The issue concerning parallel jobs on the ICE XA and Intel MPI is fixed, thus your multinode jobs should work properly from now on. Please don’t hesitate to confirm or inform that resolution.
- 05/02/2018 : We still encounter some issues thus in order to solve them we have to migrate data from the old file server to a new one . As a result login will probably be longer during the synchronisation process.On thursday (01/02/2018), it made the file server crash during the night, we monitored the process all the week-end to be sure it goes OK. An intervention is planned this week to fix some recurring issues, including the /home migration, the cluster will thus be rebooted. We will let you know which day as soon as we have the information.
- 02/02/2018: an unforeseen operating incident happened this morning but it is now resolved
- 30/01/2018 : We are facing issues with Intel Licences (mpi….) if you encounter a problem please contact at firstname.lastname@example.org.
- 26/01/2018 : The quota issue is solved. You can resume normal activities. There will still be other perturbations with PBS.
- 25/01/2018 – 15h47 : There is a problem with the quotas and we are investigating with customer service. It will have an effect on the launched jobs.
- 25/01/2018: The provider intervention for PBS issue is not completely solved. Thus our consultants and customer service for PBS will be working on it today so you might experienced some perturbations in the job submission. However, we have we passed security patches for better protection, and established the quotas planned in the HPCave charter (30 Go of space on your home) . Thus, if jobs starts in your home you will not have enough space and it will fail immediately. Use the work space in order to have proper fonctionning. You can clean up and move your data in your scratch spaces to fit in the quota.
- 23/01/2018: We have just been informed that an intervention has been planned for tomorrow the 24/01/2018 by the provider, in order to solve an encountered issue on MeSU . You will still be able to log in and browse your files but you will not be able to submit any jobs. There may also be brief stops on the frontals. If you are disconnected from one of the mesu, use the nominal addresses mesu1.dsi.upmc.fr mesu2.dsi.upmc.fr or mesu5.dsi. upmc.fr to reconnect. We will notify you at the end of this maintenance and we apologize for the inconvenience.
- 20/11/2017 : Mesu is back online
- 15/11/2017: A technician intervention planned on 16/11/2017 for mesu2 will cause a service interruption on PBS side . However, it will still be possible to log in and use the front ends as well as the data but impossible to launch PBS jobs.
- 10/26/2017 : HPCaVe platform is currently in operation, despite the incidents concerning the CDU. However, the environment for use and compilation has been improved, Indeed, some softwares and libraries were directly integrated in the operating system, in order to facilitate their uses. Now they are located on usr / lib or usr / lib64 identical to Unix Standard To use gcc, by default gcc link to 4.8, use gcc-6 for version 6. The 7 is being installed via module. However, some libraries and compilers in the MODULE AVAIL list are not yet available or may not work properly. They are followed by the mention -to_reinstall.
- 10/24/2017: We have noticed that the CDU issue is not solved there is still a light leak. We are waiting for the provider to find an answer.
- 10/18/2017: The filter has been changed.
- 10/16/2017: The CDU issue is still in progress. We are waiting for the new filter from the supplier.
- 10/11/2017 : The CDU has experienced light leaks at the restart of the servers, and a new filter unit is in transit from the suppliers, and should soon be installed.
- 10/10/2017 : Water is back , however this has had an impact on infrastructures . We are still facing issues therefore another intervention will be planned in the next few days until then the machines will not be available.
- 10/06/2017 : The previous problems, making HPCaVe servers unavailable for users, is a consequence of the UPMC water circuitry, which does not seem to supply sufficient water to the machines. Technicians from the water services should look into the problem on Monday.
- 10/05/2017 – 12h00 : The Infiniband switch incident was caused by the C.D.U. (cooling unit), and has happened multiple times since the beginning of the week. As the cooling unit is a critical part of the system, the servers will stay down until the problem can be solved.
- 10/04/2017 – 14h00 : During the interruption of service for users, progress is made on the reconfiguration of softwares from repositories, which should be deployed before the end of the week
- 10/04/2017 – 14h00 : Servers are currently inaccessible for users as an Infiniband switch has broken. SGI-HPE has been contacted, and a replacement part should be shipped soon.
- 10/03/2017 – 19h00 : A machine reboot will be planned on the 11th or 12th in order to redeploy corrected images on the servers.
- 10/03/2017 – 15h00 : Inventory of missing softwares on frontal nodes
Frontal nodes should have an uniformized environment before the end of the week.
- 09/28/2017 – 18h00 : Connexion with ssh to mesu.dsi.upmc.fr available
You can now connect with “ssh email@example.com”
- 09/28/2017 – 18h00 : Standard libraries re-deployed on compute nodes:
As of now, you should be able to run jobs compiled with gcc, gfortran or mpif90 from the module openmpi for instance. MeSU-beta should have a better behaviour for compiled jobs. Please keep us informed of any trouble on your side.
- 09/28/2017 – 17h00 : Common development tools re-installed on mesu2, 3, 4, 5:
cmake gcc48 gcc48-c++ gcc48-fortran gcc48-info gcc48-locale gcc-c++ gcc-fortran gcc-info gcc-locale gccmakedep glib2-devel glibc-devel gltt glu-devel gmp-devel gsl gsl-devel libstdc++-devel libstdc++48-devel