「RELIONトラブルシューティング」の版間の差分
(→対処) |
(→症状) |
||
行36: | 行36: | ||
RELION3チュートリアルで最初の2D classificationを実行したらMPI関連のエラーが出て異常終了した。 | RELION3チュートリアルで最初の2D classificationを実行したらMPI関連のエラーが出て異常終了した。 | ||
− | + | stdoutへの出力は以下。DL-Boxはホストマシン名。 | |
+ | <pre> | ||
+ | -------------------------------------------------------------------------- | ||
+ | A requested component was not found, or was unable to be opened. This | ||
+ | means that this component is either not installed or is unable to be | ||
+ | used on your system (e.g., sometimes this means that shared libraries | ||
+ | that the component requires are unable to be found/loaded). Note that | ||
+ | Open MPI stopped checking at the first component that it did not find. | ||
+ | |||
+ | Host: DL-Box | ||
+ | Framework: ess | ||
+ | Component: pmi | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | A requested component was not found, or was unable to be opened. This | ||
+ | means that this component is either not installed or is unable to be | ||
+ | used on your system (e.g., sometimes this means that shared libraries | ||
+ | that the component requires are unable to be found/loaded). Note that | ||
+ | Open MPI stopped checking at the first component that it did not find. | ||
+ | |||
+ | Host: DL-Box | ||
+ | Framework: ess | ||
+ | Component: pmi | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | A requested component was not found, or was unable to be opened. This | ||
+ | means that this component is either not installed or is unable to be | ||
+ | used on your system (e.g., sometimes this means that shared libraries | ||
+ | that the component requires are unable to be found/loaded). Note that | ||
+ | Open MPI stopped checking at the first component that it did not find. | ||
+ | |||
+ | Host: DL-Box | ||
+ | Framework: ess | ||
+ | Component: pmi | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_base_open failed | ||
+ | --> Returned value Not found (-13) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_base_open failed | ||
+ | --> Returned value Not found (-13) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_base_open failed | ||
+ | --> Returned value Not found (-13) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | A requested component was not found, or was unable to be opened. This | ||
+ | means that this component is either not installed or is unable to be | ||
+ | used on your system (e.g., sometimes this means that shared libraries | ||
+ | that the component requires are unable to be found/loaded). Note that | ||
+ | Open MPI stopped checking at the first component that it did not find. | ||
+ | |||
+ | Host: DL-Box | ||
+ | Framework: ess | ||
+ | Component: pmi | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_base_open failed | ||
+ | --> Returned value Not found (-13) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "Not found" (-13) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "Not found" (-13) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "Not found" (-13) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "Not found" (-13) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | ------------------------------------------------------- | ||
+ | Primary job terminated normally, but 1 process returned | ||
+ | a non-zero exit code.. Per user-direction, the job has been aborted. | ||
+ | ------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | A requested component was not found, or was unable to be opened. This | ||
+ | means that this component is either not installed or is unable to be | ||
+ | used on your system (e.g., sometimes this means that shared libraries | ||
+ | that the component requires are unable to be found/loaded). Note that | ||
+ | Open MPI stopped checking at the first component that it did not find. | ||
+ | |||
+ | Host: DL-Box | ||
+ | Framework: ess | ||
+ | Component: pmi | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_base_open failed | ||
+ | --> Returned value Not found (-13) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "Not found" (-13) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | mpirun detected that one or more processes exited with non-zero status, thus causing | ||
+ | the job to be terminated. The first process to do so was: | ||
+ | |||
+ | Process name: [[9193,1],3] | ||
+ | Exit code: 1 | ||
+ | -------------------------------------------------------------------------- | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | stderrへの出力は以下。DL-Boxはホストマシン名。 | ||
<pre> | <pre> | ||
[DL-Box:00925] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 | [DL-Box:00925] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 |
2019年5月21日 (火) 09:09時点における版
目次
Motion Correction
ERROR: TIFF support was not enabled during compilation
症状
2019/5/14, v3.0.5, build from source in Ubuntu 16.04.6 LTS
RELIONチュートリアルでMotion correctionするとき、MoitonCor2のRELION実装使ったら表題のエラーが出た。
対処
- libtiffのdevel入れる必要がある。
- Issue上がってた。 https://github.com/3dem/relion/issues/383
- relion.gitのreadmeにも書いてある。 https://github.com/3dem/relion
- sudo apt install ... のところに列挙してなかったから見落としてた。マニュアルはちゃんと読もう。
# relionのルートディレクトリに移動 (cmakeとかsrcとかあるディレクトリ) $ rm -r build/ install/ $ sudo apt install libtiff5-dev $ mkdir build/ install/ $ cd build && cmake -DCMAKE_INSTALL_PREFIX=../install .. $ make -j10 $ make install
(普通は上記だけでlibtiffをリンクしたRELIONをビルドできるはずです。ただ、minicondaやanacondaを使ってEMAN2をビルドした環境とかでは、minicondaにCMakeの検索パスを持ってかれるというつらい現象が起きるかもしれません。)
2D classification
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
症状
2019/5/21, v3.0.5, build from source in Ubuntu 16.04.6 LTS
RELION3チュートリアルで最初の2D classificationを実行したらMPI関連のエラーが出て異常終了した。
stdoutへの出力は以下。DL-Boxはホストマシン名。
-------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: DL-Box Framework: ess Component: pmi -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: DL-Box Framework: ess Component: pmi -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: DL-Box Framework: ess Component: pmi -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: DL-Box Framework: ess Component: pmi -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: DL-Box Framework: ess Component: pmi -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[9193,1],3] Exit code: 1 --------------------------------------------------------------------------
stderrへの出力は以下。DL-Boxはホストマシン名。
[DL-Box:00925] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 [DL-Box:00926] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 [DL-Box:00927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 [DL-Box:00928] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [DL-Box:926] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! [DL-Box:925] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! [DL-Box:928] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! [DL-Box:927] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! [DL-Box:00929] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129 *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [DL-Box:929] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
ジョブのパラメータのうちMPIに関係するかもしれない部分は以下(I/Oとかも含め)
Combine iterations through disc? == No Use parallel disc I/O? == Yes Pre-read all particles into RAM? == Yes Which GPUs to use: == 0:1:2:3 Minimum dedicated cores per node: == 1 Number of MPI procs: == 5 Number of pooled particles: == 30 Number of threads: == 3 Copy particles to scratch directory: == Use GPU acceleration? == Yes
対処
Use parallel disc I/OをNoにしてみる
- 変化なし。
MPIのチェック
$ mpirun --version mpirun (Open MPI) 2.0.2
$ mpirun --np 5 --cpus-per-proc 3 ls -------------------------------------------------------------------------- The following command line options and corresponding MCA parameter have been deprecated and replaced as follows: Command line options: Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, -cpus-per-rank Replacement: --map-by <obj>:PE=N, default <obj>=NUMA Equivalent MCA parameter: Deprecated: rmaps_base_cpus_per_proc Replacement: rmaps_base_mapping_policy=<obj>:PE=N, default <obj>=NUMA The deprecated forms *will* disappear in a future version of Open MPI. Please update to the new syntax. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: DL-Box Open MPI uses the "hwloc" library to perform process and memory binding. This error message means that hwloc has indicated that processor binding support is not available on this machine. On OS X, processor and memory binding is not available at all (i.e., the OS does not expose this functionality). On Linux, lack of the functionality can mean that you are on a platform where processor and memory affinity is not supported in Linux itself, or that hwloc was built without NUMA and/or processor affinity support. When building hwloc (which, depending on your Open MPI installation, may be embedded in Open MPI itself), it is important to have the libnuma header and library files available. Different linux distributions package these files under different names; look for packages with the word "numa" in them. You may also need a developer version of the package (e.g., with "dev" or "devel" in the name) to obtain the relevant header files. If you are getting this message on a non-OS X, non-Linux platform, then hwloc does not support processor / memory affinity on this platform. If the OS/platform does actually support processor / memory affinity, then you should contact the hwloc maintainers: https://github.com/open-mpi/hwloc. This is a warning only; your job will continue, though performance may be degraded. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE:IF-SUPPORTED Node: DL-Box #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. --------------------------------------------------------------------------
...?
オプション与えずに実行すると普通に並列に走る
$ mpirun echo 'hello' hello hello hello hello hello hello
MPIプロセス10個でも走る(20個にしても走った)
$ mpirun --np 10 echo 'hello' hello hello hello hello hello hello hello hello hello hello
以下は走る
$ mpirun --np 5 --cpus-per-proc 1 echo 'hello' hello hello hello hello hello
以下はエラーで死亡。cpus-per-procが1より大きいとダメみたい。
$ mpirun --np 5 --cpus-per-proc 2 echo 'hello'
よくわからないがMPIプロセス1つあたり2個以上の... スレッド...?なのかよくわからないが、指定するとしっぱいするような気がしてきたので、RELIONの方でNumber of MPI procs 5, Number of threads 1に変更して計算してみる。
(結果) 改善せず。
あーわからん