tensorflowでDecision Forest（意思決定フォレスト）を使ってみる

何となくイメージがついてきたのでチュートリアルをやってみます。

TensorFlow意思決定フォレストを使用してモデルを構築、トレーニング、評価します

データ読み込み
学習させてみる

データ読み込み

サンプルではpandas使ってますが、私としてはそういう部分はpandasではなくほかの処理で実装していく方針で考えているのでmake_csv_datasetでそのまま読み込んでみます。
データシャッフルを無効化し、ラベルとして「species」を指定します。

import tensorflow as tf
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv
dataset = tf.data.experimental.make_csv_dataset("/tmp/penguins.csv",batch_size=1,shuffle=False,label_name="species")
print(dataset)
------------------------------------------------------------------
<PrefetchDataset element_spec=(OrderedDict([('island', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('bill_length_mm', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('bill_depth_mm', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('flipper_length_mm', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('body_mass_g', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('sex', TensorSpec(shape=(1,), dtype=tf.string, name=None)), ('year', TensorSpec(shape=(1,), dtype=tf.int32, name=None))]), TensorSpec(shape=(1,), dtype=tf.string, name=None))>

データを確認すると、きちんと読み込まれているように見えているけど、「bill_length_mm」とか数値部分が文字列として読み込まれています。
データタイプもstringタイプになってますね・・・
「(‘bill_length_mm’, TensorSpec(shape=(1,), dtype=tf.string, name=None))」

def show_batch(dataset):
  for batch,label  in dataset.take(1):
    print(label)
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

show_batch(dataset)
--------------------------------------------------------------------
tf.Tensor([b'Adelie'], shape=(1,), dtype=string)
island              : [b'Torgersen']
bill_length_mm      : [b'39.1']
bill_depth_mm       : [b'18.7']
flipper_length_mm   : [b'181']
body_mass_g         : [b'3750']
sex                 : [b'male']
year                : [2007]

CSVを見ると、欠損値として「NA」が含まれていました。そのため、オプションとして欠損値を「NA」として指定します。
また、labelは数値にしたいので「Adelie」とかを数値に置き換えます。
数値として認識されました。
あと、num_epochsを設定しておかないと学習時に無限ループに陥ります。

import tensorflow as tf
dataset = tf.data.experimental.make_csv_dataset("penguins.csv",
    batch_size=1,
    shuffle=False,
    label_name="species",
    num_epochs=1,
    na_value='NA',
    ignore_errors=True)

def show_batch(dataset):
  for batch,label  in dataset.take(1):
    print(label)
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

show_batch(dataset)
-------------------------------------------------------------
tf.Tensor([0], shape=(1,), dtype=int32)
island              : [b'Torgersen']
bill_length_mm      : [39.1]
bill_depth_mm       : [18.7]
flipper_length_mm   : [181]
body_mass_g         : [3750]
sex                 : [b'male']
year                : [2007]

学習させてみる

とりあえず実行してみます。

model_1 = tfdf.keras.RandomForestModel()

model_1.compile(
    metrics=["accuracy"])

with sys_pipes():
  model_1.fit(x=dataset)

本当はデータセット分けなきゃダメだと思いますが、今回はテストということで同じデータセットで評価してみます。
値の妥当性は置いとくとして、学習はされている模様。

evaluation = model_1.evaluate(dataset, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")
---------------------------------------------------------
344/344 [==============================] - 1s 2ms/step - loss: 0.0000e+00 - accuracy: 0.9942

loss: 0.0000
accuracy: 0.9942