LBM error when running mpi in East with Nimrod

The experiment stops when ~ 90 jobs are done, 3 optimal points found.

From the error, it seems that it is communication error from the cluster.

Error log from Kepler

ptolemy.kernel.util.IllegalActionException: Nimrod job failed. Experiment name: mpi Jobname 1011
Error: Failed plan file line: – 5 – exec /bin/sh -c /bin/mkdir /home/hoang/mdo-flow-compliance/step1/9.4269441175058_29.0_139.8662186080265; /bin/pwd > /home/hoang/mdo-flow-compliance/step1/9.4269441175058_29.0_139.8662186080265/nimroddir ;cd /home/hoang/md… Traceback (most recent call last):
File “/home/ngdev/src/nimrod-trunk/level2/agent/Job.py”, line 1977, in Run
File “/home/ngdev/src/nimrod-trunk/level2/agent/Job.py”, line 1394, in ExecCmd
RuntimeError: ‘mkdir’ terminated with code 1
in .LBP_Topology_Optimization.LBM_C12_TCA.LBM_C12
with tag colour {sequenceID=0, metadata={Optimizer=object(org.monash.nimrod.optim.SimplexAlgorithm@6765f707), pointIndex=3, creator=object(org.monash.nimrod.optim.SimplexOptimActor {.LBP_Topology_Optimization.Simplex Optim Actor})}, parameters={r=9.4269441175058, re=139.8662186080265, s=29.0}, hashcode=-2058481515}
in .LBP_Topology_Optimization.LBM_C12_TCA.LBM_C12
at org.monash.nimrod.NimrodActor.NimrodGCommonFunctions.startAndWait(NimrodGCommonFunctions.java:208)
at org.monash.nimrod.GridJob.fire(GridJob.java:354)
at org.monash.nimrod.NimrodDirector.NimrodProcessThread.run(NimrodProcessThread.java:448)

Error log from Nimrod error log

[EAST-05:07832] opal_os_dirpath_create: Error: Unable to create the sub-directory (/scratch/1088260.1.all.q) of (/scratch/1088260.1.all.q/openmpi-sessions-hoang@EAST-05_0/11376/0/0), mkdir failed [1]
[EAST-05:07832] [[11376,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[EAST-05:07832] [[11376,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 399
[EAST-05:07832] [[11376,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 320
————————————————————————–
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here’s some additional information (which may only be relevant to an
Open MPI developer):

orte_session_dir failed
–> Returned value Error (-1) instead of ORTE_SUCCESS
————————————————————————–
[EAST-05:07832] [[11376,0],0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 128
————————————————————————–
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here’s some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_set_name failed
–> Returned value Error (-1) instead of ORTE_SUCCESS
————————————————————————–
[EAST-05:07832] [[11376,0],0] ORTE_ERROR_LOG: Error in file orterun.c at line 694
rm: cannot remove `Vortex00400000*’: No such file or directory
rm: cannot remove `3D_Trns00400000*’: No such file or directory
cp: cannot stat `lbm-data.tec’: No such file or directory
cp: cannot stat `lbm-output’: No such file or directory

Log in to node 05, looks like there is IO error, cannot ls scratch

About slump

Dr Slump ... :D.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: