Machine learning (Keras) and Steem #1: User vs bid-bot binary classification

Repository

https://github.com/keras-team/keras

What Will I Learn?

  • Collect and preprocess data
  • Visualize dataset
  • Build neural network model
  • Measure model performance

Requirements

  • python
  • basic concepts of machine learning

Tools:

It looks like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.

Difficulty

  • Intermediate

Tutorial Contents

  • Problem description
  • Collecting data
  • Visualization of the dataset
  • Building model
  • Measuring model performance
  • Conclusions
Problem description

The purpose of this tutorial is to learn how to build a machine learning model that will be able to distinguish an user from a bid-bot.

Collecting data

A list of bots has been collected from https://steembottracker.com and can be found here. Now we need list of users. Let's choose those users who added a post in the category #utopian-io in July 2018. We will use the following script and database SteemSQL.

SELECT DISTINCT author
FROM Comments (NOLOCK)
WHERE depth = 0 AND
      category = 'utopian-io' AND
      YEAR(created) = 2018 AND
      MONTH(created) = 7

The obtained list is several times bigger than the bot list, but we will not modify it, because this is very common in machine learning. Rarely data is perfect. The list can be found here: users.txt

Now we want to extract statistics for each account from the list. We will use the beem library.
We will use the following attributes:

  • number of followers
  • number of followings
  • followings / followers ratio
  • reputation
  • number of users that muted given account
  • effective STEEM POWER (with delegation)
  • own STEEM POWER (without delegation)
  • effective STEEM POWER / own STEEM POWER ratio
from beem.account import Account

bid_bots = set(map(str.strip, open('bid_bots.txt', 'r').readlines()))
users = (map(str.strip, open('users.txt', 'r').readlines()))

with open ('data.csv' , 'w') as f:
    f.write(','.join(['name', 'followers', 'followings', 'foll. ratio', 'muters', 'rep', 'eff. sp', 'own sp', 'sp ratio', 'is bot?']) + '\n')
    for name in bid_bots.union(users):
        account = Account(name)
        foll = account.get_follow_count()
        row = (name,
             foll['follower_count'],
             foll['following_count'],
             foll['following_count'] / foll['follower_count'],
             len(account.get_muters()),
             account.get_reputation(),
             account.get_steem_power(),
             account.get_steem_power(onlyOwnSP=True),
             account.get_steem_power() / account.get_steem_power(onlyOwnSP=True),
             1 if name in bid_bots else 0)

        f.write(','.join(map(str, row)) + '\n')

The result file has been saved as data.csv. A sample of the file is shown below. The last column determines whether an account is a bot, and all previous ones are the attributes which we will use to teach our model.

                 name  followers  followings  foll. ratio  muters        rep       eff. sp        own sp     sp ratio  bid-bot?
0             beggars       1144          66     0.057692       4  58.485711    457.254297    457.254297     1.000000         0
1     anthonyadavisii       3477        2295     0.660052      19  59.185663    575.470517   1210.696343     0.475322         0
2           amvanaken       3026         175     0.057832      10  60.324424    491.175030    716.217811     0.685790         0
3       shaikhraz1986        229          11     0.048035       3  46.766451     15.051121     12.462512     1.207712         0
4            harpagon        306          39     0.127451       0  55.577631    389.858866    389.858866     1.000000         0
5               famil        661         274     0.414523       3  58.858699      7.859833    207.836914     0.037817         0
6           swapsteem        197          31     0.157360       0  39.812361     15.441267      0.907584    17.013591         0
7            kr-nahid        773         366     0.473480       6  58.179101    130.890891    130.890891     1.000000         0
8        thefairypark        153           1     0.006536       0  32.825241     15.000498      0.559973    26.787898         0
9             eosbake        115           0     0.000000       0  25.000000      3.002430      3.002430     1.000000         0
10      official-hord        619         139     0.224556       0  59.131711     65.469489     65.469489     1.000000         0
..                ...        ...         ...          ...     ...        ...           ...           ...          ...       ...
Visualization of the dataset

Let us first look at the dataset. We can expect that:

  • more people mute the bots
  • bots have a higher effective STEEM POWER
  • bots have most of the STEEM POWER from delegation
  • bots have a lower average reputation (because they do not add posts)
  • bots observe fewer users

Let's check if our assumptions are confirmed.

for c in columns:
    print('%s|%.3f|%.3f' % (c, df[df['bid-bot?'] == 0][c].mean(), df[df['bid-bot?'] == 1][c].mean()))

It turns out that the presumptions turn out to be true. We can see great differences in average values here.

AttributeUser average valueBid-bot average value
followers1078.9741926.841
followings382.238204.068
foll. ratio0.3890.170
muters3.52412.330
rep55.06545.564
eff. sp5494.488224684.003
own sp2110.63913383.966
sp ratio18.165126.915

However, calculating the average is not enough. Let's look at some charts.

followers + followings:

sns.FacetGrid(df1, hue="bid-bot?", size=7).map(plt.scatter, "followers", "followings")
plt.legend(['user', 'bid-bot'])

muters + rep:

sns.FacetGrid(df1, hue="bid-bot?", size=7).map(plt.scatter, "muters", "rep")
plt.legend(['user', 'bid-bot'])

foll. ratio + sp ratio:

sns.FacetGrid(df1, hue="bid-bot?", size=7).map(plt.scatter, "foll. ratio", "sp ratio")
plt.legend(['user', 'bid-bot'])

Visualization gives us a much better picture of the situation. We can see that different classes are pretty separable.

Building model

We now have to separate the attributes (X) and the decision class (y), and then split the data into training and test data. The former are used to train the neural network, while the test data are used to check the effectiveness of the classification (it would not be a good idea to teach and test with the same data).

X_cols = ['followers', 'followings', 'foll. ratio', 'muters', 'rep', 'eff. sp', 'own sp', 'sp ratio']
y_cols = ['bid-bot?']
X = df[X_cols].apply(pd.to_numeric)
y = df[y_cols].apply(pd.to_numeric)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Our model is a neural network, consisting of:

  • the input layer to which the input is transferred
  • the hidden layer
  • the output layer which returns the classification result

The code building the neural network model is relatively simple.

model = Sequential()
model.add(Dense(24, activation='relu', input_dim=8))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


model.add - adds new layer
Dense - fully connected layer
relu - ReLU activation function
sigmoid - sigmoid activation function
binary_crossentropy - measures the performance of a classification, where output is a probability between 0 and 1
adam - Adaptive Moment Estimation optimizer, basically RMSProp with momentum
accuracy - percentage of correctly classified inputs

We can display the characteristics of the neural network.

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_10 (Dense)             (None, 24)                216       
_________________________________________________________________
dense_11 (Dense)             (None, 12)                300       
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 13        
=================================================================
Total params: 529
Trainable params: 529
Non-trainable params: 0

We can see that we have 3 layers:

  • the first has 24 neurons
  • the other has 12 neurons
  • the third has one neuron

Of course, you can set different values here or even add more layers and check what the results will be.

Now is the time to train the neural network. The epochs parameter means how many iterations will be performed, during which the entire data set is used, and the batch_size parameter determines how many data will be used simultaneously.


model.fit(X_train, y_train,epochs=50, batch_size=1, verbose=1)
Measuring model performance

Let's see a summary of the recent learning iterations.

...
Epoch 45/50
305/305 [==============================] - 0s 2ms/step - loss: 3.0651 - acc: 0.8098
Epoch 46/50
305/305 [==============================] - 0s 1ms/step - loss: 3.0651 - acc: 0.8098
Epoch 47/50
305/305 [==============================] - 0s 1ms/step - loss: 3.0651 - acc: 0.8098
Epoch 48/50
305/305 [==============================] - 0s 1ms/step - loss: 3.0651 - acc: 0.8098
Epoch 49/50
305/305 [==============================] - 0s 2ms/step - loss: 3.0651 - acc: 0.8098
Epoch 50/50
305/305 [==============================] - 0s 1ms/step - loss: 3.0651 - acc: 0.8098

Frankly speaking, it does not look good. The loss value should be as close as possible to 0, while the acc value should be as close as possible to 1.

Let's look at the confusion matrix.

y_pred = model.predict_classes(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

It turns out that the classification process has failed completely. All records (except 1) were classified as user. By the way, we see that the acc metric is not entirely reliable if we are dealing with unbalanced classes.

[[101   1]
 [ 30   0]]

We need to make some modifications so that the network correctly classifies the objects. The choice is very wide:

  • add new layers
  • change the number of neurons in the layers.
  • change the optimizer.
  • increase the number of epochs

However, there is not much data, so a simple network should be good enough. Maybe the problem is in the data itself? Let's try to make some preprocessing before adding them to the neural network. As we have seen in the previous charts, the attribute value ranges are quite large. However, for a neural network it is better if these ranges are as close as possible to [0, 1]. That is why we will normalize the input data. We will first try StandardScaler, which standardizes features by removing the mean and scaling to unit variance.

X = pd.DataFrame(StandardScaler().fit_transform(X))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model.fit(X_train, y_train,epochs=50, batch_size=1, verbose=1)
y_pred = model.predict_classes(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (4, 4))
sns.heatmap(cm, annot=True, cmap="Greens")
print('precision:', precision_score(y_test, y_pred),
      '\nrecall:', recall_score(y_test, y_pred),
      '\nf1:', f1_score(y_test, y_pred))

As we can see the results are much better - the loss has decreased, the acc has increased, the confusion matrix looks much better. However, these results are still far from ideal.

Epoch 50/50
305/305 [==============================] - 0s 1ms/step - loss: 0.1310 - acc: 0.9410

Let us now take a different approach to normalization: QuantileTransformer.

X = pd.DataFrame(StandardScaler().fit_transform(X))

The results are already acceptable.

We were using the number of iterations equal to 50 all the time, let's see what the result will be if we reduce the number of iterations to 10.

model.fit(X_train, y_train,epochs=10, batch_size=1, verbose=1)

It turns out that the result is basically the same and a large number of iterations were not necessary.

Let's look what is the difference between raw data and normalized data.

Raw dataStandardScalerQuantileTransformer

It turns out that StandardScaler is not at all a very good choice, because the data are not in the range of [0, 1]. And the QuantileTransformer made the data not only in the range of [0,1] but also evenly distributed throughout the area.

Conclusions

  • building the optimal model requires experimenting
  • it is worthwhile to examine how the input data looks like
  • proper preprocessing of the input data is very important
  • the decision classes need not be of equal size
  • we should try to make the neural network as simple as possible
  • not always a large number of iterations is needed
  • here we had relatively small dataset, but the bigger the dataset, the better we can train the neural network.

Proof of Work Done

Scripts used in this work (as Jupyter Notebook)

H2
H3
H4
3 columns
2 columns
1 column
5 Comments