Python Scikit-learn version Backward Incompatibility

Here is the issue I found during my work while deploying models trained on my laptop to Spark cluster.  The Scikit-learn version on our laptop is 0.17.1 and the one running on Spark cluster is 0.15.2.

I use virtual env to replicate the issue.

$ pip install virtualenv
$ cd /path/to/myfolder/

I first create two separate virtual enviornments respectively for Scikit-learn 0.15 and 0.17 installation. I keep the versions of dependency packages the same (numpy version 1.11.1 and scipy version 0.18.0). I use virtualenv to create two separate folders vesklearn15 and vesklearn17, and put python code in code folder. The directory setup is shown as follows:

/path/to/myfolder/vesklearn15
/path/to/myfolder/vesklearn17
/path/to/myfolder/code

The following two code snippets illustrate how I installed the virtual environments and scikit-learn versions.

$ virtualenv vesklearn15
$ cd vesklearn15/bin
$ source activate
(vesklearn15) $ ./pip install numpy
(vesklearn15) $ ./pip install scipy
(vesklearn15) $ ./pip install scikit-learn==0.15.2
(vesklearn15) $ deactivate
$ cd ../../
$ virtualenv vesklearn17
$ cd vesklearn17/bin
$ source activate
(vesklearn17) $ ./pip install numpy
(vesklearn17) $ ./pip install scipy
(vesklearn17) $ ./pip install scikit-learn==0.17.1
(vesklearn17) $ deactivate
$ cd ../../

I use Iris data to train a simple linear kernel SVM model and score a single instance from Iris dataset. The following code snippet is saved as ‘save_model.py’ file in folder /path/to/myfolder/code/

"""%%file save_model.py"""

from sklearn.svm import SVC
from sklearn import datasets
import numpy as np
import pickle
import sklearn

# define a function to save the model as pickle file
def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

iris = datasets.load_iris()
X = iris.data
Y = iris.target

# only use classes 0 and 1 and train a binary classification model
ind = np.nonzero(Y<2)
X = X[ind]
Y = Y[ind]

# split the training and test data.
train_data = X[:-1,:]
train_label = Y[:-1]
test_data = X[-1,:].reshape(1, -1)
test_label = Y[-1]

# train a simple classifier
eval_clf = SVC(C=1, kernel='linear', class_weight='auto', probability=True)
eval_clf.fit(train_data,train_label)

print 'Sklearn version is '+sklearn.__version__

# print the data
print test_data

# print the probability
out_prob = eval_clf.predict_proba(test_data)
print out_prob

# print the prediction
out_pred = eval_clf.predict(test_data)
print out_pred

# print the decision function (distance to the decision boundary)
out_decision = eval_clf.decision_function(test_data)
print out_decision

# save the trained model as pickle file
save_object(eval_clf,"/path/to/myfolder/code/model.pickle")

The next code snippet, saved as ‘load_model.py’ in /path/to/myfolder/code/, loads the model back and applies it on test data:

"""%%file load_model.py"""

from sklearn.svm import SVC
from sklearn import datasets
import numpy as np
import pickle
import sklearn

with open('/path/to/myfolder/code/model.pickle', 'rb') as fp:
    trained_clf = pickle.load(fp)

iris = datasets.load_iris()
X = iris.data
Y = iris.target

# only use classes 0 and 1 and train a binary classification model
ind = np.nonzero(Y<2)
X = X[ind]
Y = Y[ind]

# split the training and test data.
train_data = X[:-1,:]
train_label = Y[:-1]
test_data = X[-1,:].reshape(1, -1)
test_label = Y[-1]

print 'Sklearn version is '+sklearn.__version__

# print the data
print test_data

# print the probability
out_prob = trained_clf.predict_proba(test_data)
print out_prob

# print the prediction
out_pred = trained_clf.predict(test_data)
print out_pred

# print the decision function (distance to the decision boundary)
out_decision = trained_clf.decision_function(test_data)
print out_decision

Now first activate the virtualenv for scikit-learn 0.17, and execute the python code:

$ cd vesklearn17/bin
$ source activate
(vesklearn17) $ python ../../code/save_model.py

The output of prediction probabilities, prediction label and decision function score, respectively, should be similar to the values listed as follows:

Sklearn version is 0.17.1
[[ 5.7  2.8  4.1  1.3]]
[[ 0.01596097  0.98403903]]
[1]
[ 2.06736897]

The pickle file ‘model.pickle’ is created in code folder.

Next activate the virtualenv for scikit-learn 0.15, and load back the pickle file. For convenience I open two terminal windows and activate the other virtualenv in the second terminal.

$ cd vesklearn15/bin
$ source activate
(vesklearn15) $ python ../../code/load_model.py

The output shows:

Sklearn version is 0.15.2
[[ 5.7  2.8  4.1  1.3]]
[[  9.99994689e-01   5.31146239e-06]]
[0]
[[-4.97305797]]

As shown in the highlighted lines, the predicted probabilities are swapped and the predicted label is different. The decision function outcome also changes from positive to negative. This issue starts occur between Scikit-learn 0.15.X and 0.16.X.   The main potential risk is the silent error could catch people off guard during implementation.  In contrast, when I save the pickle file in 0.15.2 and load it back in 0.17.1, it will pop up a “dual coefficient” error and break the code, which means forward incompatibility is also there but caught by error. We hope newer versions of Scikit-learn to incorporate some alert message, or intentional code break to avoid such issue.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s