This is a new version of my now python script for monitoring the HLRN III.
I am usually on the Hannover partition, but it should work perfectly in Berlin too.
In this version, I keep all the features of the cluster version except the “gossip” section for checking on who is calculating what and how many resources are they using, since a supercomputer has too many users and we would need a terminal of a couple of square meters.
now version 0.5 specially modified for working on HLRN III supercomputer in Hannover and Berlin.
I already shared my now script intended for a small “in home” computing cluster. This time, I am sharing two other variations of that script, designed for the queuing systems of two different supercomputers: Mare Nostrum at BSC, Juropa at JSC and HLRN.
Both of them are the predecessors of the newer now script, and they only have the very basic features, but if you need an script that works right out of the box for any of the supercomputers mentioned above, this is the quick solution.
Performing job monitoring tasks while running jobs on a cluster/supercomputer can be done by several tools (such as qstat, showstart, …). Unfortunately the most of these standard tools are made by computer scientist for computer scientists, but now, there is an alternative: now.
At some point in my life, I started to get sick of qstats, greps, awks, … because it takes a couple of seconds every time one has to write them. If we multiply these seconds by the number of times I need them and by the number of computational members in my group, we get enough time to prepare a couple of new inputs, or write some post/comments in my blog.
So, I wrote my own job monitoring/visualization script, based on some ideas of my friend Iñaki during my time in the theoretical chemistry group in Donostia. That script, initially did nothing but execute these programs and display ONLY the information I need, and ALL the information I need. This information includes: Continue reading
Sometimes, specially in small computational groups, it is usual that the people uses the workstations for running calculations, and in case of a “trusted” network, were all the colleagues can log-in to each other’s computers, it is sometimes also the case, that everybody calculates everywhere. But at the moment of submitting a calculation, if your own PC is already busy, how do I find a computer in my network, which is not running anything at the moment?
You could ssh to each host and run <it>top</it> or something like that, but that is really slow when you have a lot of computers, and you have vs. to repeat the task very often.
The solution: “checkpc”
checkpc is a python script which can scan a whole network of computers and return the load of all PCs within it in less than 3 seconds. Continue reading
A typical problem in the life of a computational scientist: your calculations take longer than the wall time in the supercomputer. If the job can be restarted, we can work around that by re-sending it manually but that is quite tedious sometimes.
My solution: use the “resend” BASH script 😉 (which you can download here)
As in the case of the most of my scripts, there is a “-h” option. This helps if you don’t remember the syntax, this option will remind you about the few possibilities.
You need the job script you submit to the queue, and the input(s). The script keeps checking the qstat for the current user, and searching for the path from where the resend was executed. Whenever there is no job running or in queue with the same path, it submits another one. If there is a job with the same path, it waits for one minute and tries again.
The resend script is as easy to use as:
user> nohup resend -n 5 -f /path/to/script.cmd &
were “-n” option specifies the number of times the job will be resent (default: 3) and “-f” the path to the batch job script (default: ./job.cmd). It is useful to include the last option in order to be able identify the process (by i.e. “ps”) if necessary.
For killing the script, you can use:
this will break the internal loop and exit the script.
From the same directory, or just search the PID and kill it.
Another trick: if you already have executed the program, but you notice that you would like to run the job some more times, you can resend it again and the job will be sent as many times as the sum of resend’s both scripts request.