TE配列の分類を行うDeepTEの使い方

最近は転移因子(TE、トランスポゾン)の話が連日続いているが、今回は前回の記事で紹介した機械学習によるTE分類ツールの1つ、DeepTE (Yan et al., 2020) を実際に使ってみる。

Yan, H., Bombarely, A., & Li, S. (2020). DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics, 36(15), 4269-4275.

DeepTEとは

TEはいくつかのスーパーファミリーによる分類がなされてきているが、配列のパターンに基づいたものである。そこでこのDeepTEの開発者らはそれらのパターンを分類する分類器の研究をおこなった。DeepTEはその名の通りディープラーニング(深層学習)を適用したもので、事前に学習したモデルから、複雑なパターンを分類することが可能である。CNN(Convolutional Neural Network、畳み込みニューラルネットワーク)によってそれぞれのTEスーパーファミリーの分類を行っている。詳しいモデルの中身については本文を参照して欲しい。

インストール

conda、CUDA、cuDNN等TensorFlowが実行できる環境がすでにインストールされているUbuntu 20.04を前提として行う(32スレッド、64GB RAM、GeForce GT710)。機械学習関連のツールをインストールしたことがある人なら揃っているかもしれないが、そうでない人にとっては一苦労かもしれない。

私もディープラーニングを使うとは思っていなかったのでしょぼいGPUしかないが、計算速度があまりにも遅いようならば購入を考える。

参考: Ubuntu18.10にopenposeをインストール(環境構築)

基本的にはGitHubにすべて書かれている。

$ git clone https://github.com/LiLabAtVT/DeepTE.git
$ cd DeepTE

$ conda create -n deepTE python=3.6
$ conda activate deepTE
$ conda install tensorflow-gpu=1.14.0
$ conda install biopython
$ conda install keras=2.2.4
$ conda install numpy=1.16.0

ただ、tensorflow-gpuが正常に動作するかどうかは確かめたほうがいい。

$ python
> import tensorflow as tf
> tf.test.is_gpu_available()

これでFalseではなくTrueが表示されれば問題なくGPUが認識されている。もしFalseならCUDA関連のインストールを見直すべきだろう。Tensolflowのバージョンが1.14.0なのでCUDAは10.0が必要だとか、cuDNNはv7.4だとかまあ色々ある。

私のラボのワークステーションは当初機械学習とかディープラーニングとかするつもりもなく、グラボはGT710しか積んでいなかった。RAMは1GBしかなく、動作するかどうかわからなかったので上記の方法で環境構築をしたのち、動作はするにはしたが、完了しなかった。

エラーはRAM不足によるもの。当たり前か。

現在グラボも高騰しているし、今年の予算はほぼ使い切ってしまったので、Google Colab. を使うことに。

Google Colab. における初期設定

リソースは予めGPUを使用するように設定しておく。

pipを利用して、モジュールのインストールを行う。

!pip install biopython
!pip install keras
!pip install numpy
!pip install itertools
!pip install tensorflow

Google Driveに接続し、目的のディレクトリに移動する。

from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/DeepTE')

※必要に応じて、DeepTEをCloneする。

!git clone https://github.com/LiLabAtVT/DeepTE.git

テストデータの実行

テストデータで実行してみる。

!python DeepTE.py -d sample -o sample/ -i example_data/input_test.fasta -sp M -m M

ここでは後生動物についての解析を行った。モデルのダウンロードに少々時間がかかるが、最初インストールしたらその後の解析はそのダウンロードしたディレクトリを指定してやれば良い(後述)。

結果については -o で指定したディレクトリ、”sample”以下に保存されている。

ClassIII_Helitron_13	ClassIII_Helitron
ClassIII_Helitron_14	ClassIII_Helitron
ClassIII_Helitron_15	ClassIII_Helitron
ClassIII_Helitron_16	ClassIII_Helitron
ClassIII_Helitron_17	ClassIII_Helitron
ClassIII_Helitron_18	ClassIII_Helitron
ClassIII_Helitron_19	ClassIII_Helitron
ClassIII_Helitron_20	ClassIII_Helitron
ClassIII_Helitron_21	ClassIII_Helitron
ClassIII_Helitron_22	ClassIII_Helitron
ClassIII_Helitron_23	ClassIII_Helitron
ClassIII_Helitron_24	ClassIII_Helitron
ClassIII_Helitron_26	ClassIII_Helitron
ClassIII_Helitron_28	ClassIII_Helitron
ClassIII_Helitron_29	ClassIII_Helitron
ClassIII_Helitron_30	ClassIII_Helitron
...

左が入力データのFASTA名、右が分類結果だ。テストデータについて概ね分類ができていることが伺えた。

実データの実行

今回は前回の記事、REDでゲノム中から抽出してきたデータについて軽く整形を行い、分割をしてからInputを食わせた。

!python DeepTE.py -d polypterus -o polypterus -i polypterus_RED_01.fasta -sp M -m_dir ../download_M_model_dir/Metazoans_model/

こんな具合に、一度ダウンロードしたモデルについて、そのディレクトリを-m_dir オプションで指定してやることで、毎回ダウンロードしてやらなくても済むようになる。

start time is 1631087254.149431
There was an error opening the file!
--- 0.00179290771484375 seconds ---
./RED_out/1-100/short_Redout.part_002.fa
short_Redout.part_002.fa
start time is 1631087257.1582732
Step1: transfer fasta data to CNN input data
Step2: classify TEs
Step2: 2) domain information is not exist
2021-09-08 07:47:37.445267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:37.464009: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:37.464970: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:37.466368: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:37.467148: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:37.467922: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:38.000121: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:38.001055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:38.001818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-08 07:47:38.002634: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-09-08 07:47:38.002710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10819 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7
2021-09-08 07:47:38.619417: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 235699200 exceeds 10% of free system memory.
2021-09-08 07:47:38.903277: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 235699200 exceeds 10% of free system memory.
2021-09-08 07:48:30.456350: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 327680000 exceeds 10% of free system memory.
2021-09-08 07:48:30.522503: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-08 07:48:32.411012: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8004
2021-09-08 07:49:14.205145: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 235699200 exceeds 10% of free system memory.
2021-09-08 07:49:14.275653: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 235699200 exceeds 10% of free system memory.
Step3: generate final output
--- 362.22264194488525 seconds ---

こんな具合の出力が出てきて、最後まで実行されればOK。アウトプットのディレクトリに解析データが出てくるのでそれを利用しよう。

実データでGPUメモリ上限によるエラーが生じる場合

実はポリプテルスのTEデータをそのままぶちこんだところ、エラーを吐いて強制終了させられた。RAMリソースモニターを見ていたところ、明らかにメモリを食いつぶしている。

こればかりはもうどうしようもないので、細かくインプット配列を分割してやることで事なきを得た。

TE配列の分類を行うDeepTEの使い方

DeepTEとは

インストール

Google Colab. における初期設定

テストデータの実行

実データの実行

実データでGPUメモリ上限によるエラーが生じる場合

コメントを残すコメントをキャンセル

Recent Posts

【Qiskit】マルチオミクス解析を量子機械学習でやる①[環境構築・基礎]

Cell Hashing による複数サンプルデータ読み込み

肺の起源と進化について

NCBIからリファレンスゲノムを取得する方法〜クォリティなどのチェックも〜

VGP(脊椎動物ゲノムプロジェクト)のポリシー改定について

カテゴリ

TE配列の分類を行うDeepTEの使い方

DeepTEとは

インストール

Google Colab. における初期設定

テストデータの実行

実データの実行

実データでGPUメモリ上限によるエラーが生じる場合

関連記事:

コメントを残す コメントをキャンセル

人気の記事

Recent Posts

【Qiskit】マルチオミクス解析を量子機械学習でやる①[環境構築・基礎]

Cell Hashing による複数サンプルデータ読み込み

肺の起源と進化について

NCBIからリファレンスゲノムを取得する方法〜クォリティなどのチェックも〜

VGP(脊椎動物ゲノムプロジェクト)のポリシー改定について

カテゴリ

コメントを残すコメントをキャンセル