RELIONトラブルシューティング

提供: Eospedia
2019年5月21日 (火) 09:08時点におけるKttn (トーク | 投稿記録)による版

移動: 案内検索

Motion Correction

ERROR: TIFF support was not enabled during compilation

症状

2019/5/14, v3.0.5, build from source in Ubuntu 16.04.6 LTS

RELIONチュートリアルでMotion correctionするとき、MoitonCor2のRELION実装使ったら表題のエラーが出た。

対処

# relionのルートディレクトリに移動 (cmakeとかsrcとかあるディレクトリ)

$ rm -r build/ install/

$ sudo apt install libtiff5-dev

$ mkdir build/ install/

$ cd build && cmake -DCMAKE_INSTALL_PREFIX=../install ..

$ make -j10

$ make install

(普通は上記だけでlibtiffをリンクしたRELIONをビルドできるはずです。ただ、minicondaやanacondaを使ってEMAN2をビルドした環境とかでは、minicondaにCMakeの検索パスを持ってかれるというつらい現象が起きるかもしれません。)


2D classification

[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129

症状

2019/5/21, v3.0.5, build from source in Ubuntu 16.04.6 LTS

RELION3チュートリアルで最初の2D classificationを実行したらMPI関連のエラーが出て異常終了した。

エラー全文は以下。DL-Boxはホストマシン名。

[DL-Box:00925] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
[DL-Box:00926] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
[DL-Box:00927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
[DL-Box:00928] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[DL-Box:926] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[DL-Box:925] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[DL-Box:928] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[DL-Box:927] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[DL-Box:00929] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 129
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[DL-Box:929] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!


ジョブのパラメータのうちMPIに関係するかもしれない部分は以下(I/Oとかも含め)

Combine iterations through disc? == No
Use parallel disc I/O? == Yes
Pre-read all particles into RAM? == Yes
Which GPUs to use: == 0:1:2:3
Minimum dedicated cores per node: == 1
Number of MPI procs: == 5
Number of pooled particles: == 30
Number of threads: == 3
Copy particles to scratch directory: ==
Use GPU acceleration? == Yes

対処

Use parallel disc I/OをNoにしてみる

  • 変化なし。

MPIのチェック

$ mpirun --version
mpirun (Open MPI) 2.0.2
$ mpirun --np 5 --cpus-per-proc 3 ls
--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:

  Command line options:
    Deprecated:  --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, -cpus-per-rank
    Replacement: --map-by <obj>:PE=N, default <obj>=NUMA

  Equivalent MCA parameter:
    Deprecated:  rmaps_base_cpus_per_proc
    Replacement: rmaps_base_mapping_policy=<obj>:PE=N, default <obj>=NUMA

The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  DL-Box

Open MPI uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.

On OS X, processor and memory binding is not available at all (i.e.,
the OS does not expose this functionality).

On Linux, lack of the functionality can mean that you are on a
platform where processor and memory affinity is not supported in Linux
itself, or that hwloc was built without NUMA and/or processor affinity
support. When building hwloc (which, depending on your Open MPI
installation, may be embedded in Open MPI itself), it is important to
have the libnuma header and library files available. Different linux
distributions package these files under different names; look for
packages with the word "numa" in them. You may also need a developer
version of the package (e.g., with "dev" or "devel" in the name) to
obtain the relevant header files.

If you are getting this message on a non-OS X, non-Linux platform,
then hwloc does not support processor / memory affinity on this
platform. If the OS/platform does actually support processor / memory
affinity, then you should contact the hwloc maintainers:
https://github.com/open-mpi/hwloc.

This is a warning only; your job will continue, though performance may
be degraded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE:IF-SUPPORTED
   Node:        DL-Box
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

...?


オプション与えずに実行すると普通に並列に走る

$ mpirun echo 'hello'
hello
hello
hello
hello
hello
hello


MPIプロセス10個でも走る(20個にしても走った)

$ mpirun --np 10 echo 'hello'
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello


以下は走る

$ mpirun --np 5 --cpus-per-proc 1 echo 'hello'
hello
hello
hello
hello
hello

以下はエラーで死亡。cpus-per-procが1より大きいとダメみたい。

$ mpirun --np 5 --cpus-per-proc 2 echo 'hello'


よくわからないがMPIプロセス1つあたり2個以上の... スレッド...?なのかよくわからないが、指定するとしっぱいするような気がしてきたので、RELIONの方でNumber of MPI procs 5, Number of threads 1に変更して計算してみる。

(結果) 改善せず。


あーわからん