ポーズ検出用のTensorflowHub Model

機械学習色々やってると、数字だけでなくビジュアル的に成果が見たくなってきました。
そこで、ポーズ検出をやってみようかなと思ってます。
上手くいけば3Dモデル動かすとこまで行ってみたいな。

TensorflowHubのModelを使用する
TensorflowHubのModelを分類する

TensorflowHubのModelを使用する

学習から始めようかと思ったんですが、画像/動画系の学習は結構大変。
教師型学習になってくると思うけど、画像に対する正解のパターンを用意して学習させるとなると結構なマシンパワーと時間がかかりそう。
今回はチューニングまで行うつもりが無いのでGoogle様が用意してくれているTensorflowHubのモデルを使用することにします。
これはGoogleが提供している学習済みモデルデータを一般ユーザーが使用できるというものです。
いきなり推定から入ることができるので楽にテストすることができます。

今回はチューニングするつもりもないので、予想精度はGoogleを信じます。

TensorflowHubのModelを分類する

ポーズ検出用のモデルは複数あるので、どれを使うのかを整理してみたいと思います。

まずは、2次元検出用のモデルになります。
検出対象が一人か複数かで分かれてきますが、今回やりたいのは2次元ではなく3次元なので今回は見送ります。

Model	FPS	Inputs	Outputs
movenet/singlepose/lightning	>50	A frame of video or an image represented as an int32 tensor of shape: 192x192x3 RGB with values in [0, 255]	A float32 tensor of shape [1, 1, 17, 3]
movenet/singlepose/thunder	>30	A frame of video or an image represented as an int32 tensor of shape: 256x256x3 RGB with values in [0, 255]	A float32 tensor of shape [1, 1, 17, 3].
movenet/multipose/lightning	>30	A frame of video or an image represented as an int32 tensor of dynamic shape: 1xHxWx3, where H and W need to be a multiple of 32 and the larger dimension is recommended to be 256. The channel order is RGB with values in [0, 255].	A float32 tensor of shape [1, 6, 56]. The model can detect up to 6 people in the image frame simultaneously.

3次元のもでるはこっち。実行するPCのスペックによってモデルが分かれてきますが、スペックが高い方が精度が高いというのは自然な考えでしょう。
このモデルはTensorflowというよりはMediaPipeというプロジェクトで実装されている模様です。

Model
blazepose_3d/landmark/full	A middle ground between performance and accuracy
blazepose_3d/landmark/heavy	our largest model intended for high accuracy, regardless of size
blazepose_3d/landmark/lite	our smallest model that is less accurate but smaller in model size and minimal memory footprint

MediaPipeのプロジェクトとして、体全体のモデルの他に「手」の状態を判断するためのモデルもあります。
MediaPipeのプロジェクトには他にも顔などパーツ単位の検出モデルがあります。

Model
handpose_3d/detector/full
handpose_3d/detector/lite

TensorflowHub中心に整理してみましたが、MediaPipeのほうを整理した方が良いかもしれない。