The TA just want to write more machine learning techniques and plot figures to make all students understand how the homework can be done. **Please do not upload jupyter notebook or plot figure in your submitted python code.** You can plot figure on your report if you want to show your figure!!
You can only import numpy,scipy and pandas for calculation. You cannot import any other toolkit for logistic regression and probablistic generative model.
You can also import some python built-in package for I/O, for example, csv, etc.
import numpy as np
In the following, I use numpy to load data. You can use other python built-in package as well.
**IMPORTANT!!**
Please use
import sys
X_train_fpath = sys.argv[1]
Y_train_fpath = sys.argv[2]
X_test_fpath = sys.argv[3]
output_fpath = sys.argv[4]
instead the next following block in your submitted code.
or using argparse package, then write the corresponding shell script parameters.
X_train_fpath = 'data/X_train'
Y_train_fpath = 'data/Y_train'
X_test_fpath = 'data/X_test'
output_fpath = 'output.csv'
X_train = np.genfromtxt(X_train_fpath, delimiter=',', skip_header=1)
Y_train = np.genfromtxt(Y_train_fpath, delimiter=',', skip_header=1)
How can we know that the model we have trained is good or bad without testing data? We may train our model then output result to the kaggle's competition..... Then, you can fine tuned your model to make the preformance better. However, the quota is limited. It is not so useful, isn't it?
Here are some tips for training.
You can first divide your data into training set and validation set. When training on the training set, you can use the validation set to fine tune your hyperparameters. Your training data may be wasted, so you can choose N-fold cross validation. For more information, please find out in the lecture 2.
There are many data processing methods for machine learning, you can try out yourself that which processing method is better for the model and for the dataset. Different data processing method may be suitable for different dataset. Here are just several common processing methods, (may not all suitable for the homework this time.... You can try out!!!)
One-hot Encoding : Actually, in this dataset, most condition is already represented with one-hot encoding. The following is just an explanation of one-hot encoding, (Not in our dataset).
Origin Representaion: (0 = others, 1 = graduate school, 2 = university, 3 = high school)
Education |
---|
2 |
1 |
1 |
2 |
3 |
One-hot Encoding:
others | graduate school | university | high school |
---|---|---|---|
0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
fnlwgt |
---|
226802 |
89814 |
336951 |
160323 |
103497 |
into
0~100k | 100k~200k | 200k~300k | 300k~400k |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 |
or (where 0: 0~100k, 1:100k~200k, 2: 200k~300k, 3: 300k~400k, etc.)
fnlwgt |
---|
2 |
0 |
3 |
1 |
1 |
Normalize all input data to 0 and 1, MUST specify training or testing. In real world, we don't know how the test data look like. We can only normalize the test data with the input data distribution. So, when testing, please input the corresponding normalizing hyperparameters.
We need to calculate by:
$z_i = \frac{x_i - min(x)}{max(x)-min(x)}$
def _normalize_column_0_1(X, train=True, specified_column = None, X_min = None, X_max=None):
# The output of the function will make the specified column of the training data
# from 0 to 1
# When processing testing data, we need to normalize by the value
# we used for processing training, so we must save the max value of the
# training data
if train:
if specified_column == None:
specified_column = np.arange(X.shape[1])
length = len(specified_column)
X_max = np.reshape(np.max(X[:, specified_column], 0), (1, length))
X_min = np.reshape(np.min(X[:, specified_column], 0), (1, length))
X[:, specified_column] = np.divide(np.subtract(X[:, specified_column], X_min), np.subtract(X_max, X_min))
return X, X_max, X_min
Normalize the specified column to a normal distribution. You can try out different kinds of normalization to make the model learn the data distribution more easily.
For Normal distribution, we can calculate by
$z_i = \frac{x_i - min(x)}{max(x)-min(x)}$
def _normalize_column_normal(X, train=True, specified_column = None, X_mean=None, X_std=None):
# The output of the function will make the specified column number to
# become a Normal distribution
# When processing testing data, we need to normalize by the value
# we used for processing training, so we must save the mean value and
# the variance of the training data
if train:
if specified_column == None:
specified_column = np.arange(X.shape[1])
length = len(specified_column)
X_mean = np.reshape(np.mean(X[:, specified_column],0), (1, length))
X_std = np.reshape(np.std(X[:, specified_column], 0), (1, length))
X[:,specified_column] = np.divide(np.subtract(X[:,specified_column],X_mean), X_std)
return X, X_mean, X_std
def _shuffle(X, Y):
randomize = np.arange(len(X))
np.random.shuffle(randomize)
return (X[randomize], Y[randomize])
def train_dev_split(X, y, dev_size=0.25):
train_len = int(round(len(X)*(1-dev_size)))
return X[0:train_len], y[0:train_len], X[train_len:None], y[train_len:None]
# These are the columns that I want to normalize
col = [0,1,3,4,5,7,10,12,25,26,27,28]
X_train, X_mean, X_std = _normalize_column_normal(X_train, specified_column=col)
Logistic regression is a classification model.
From the algorithm description, we implement:
_sigmoid
: to compute the sigmoid of the input.
Use np.clip to avoid overflow.
The smallest representable positive number is
>> np.finfo(np.float32).eps
>> 1.1920929e-07
Hence, we choose to clip at 1e-6 and 1-1e-6.
get_prob
: given weight and bias, find out the model predict the probability to output 1
infer
: if the probability > 0.5, then output 1, or else output 0.
_cross_entropy
: compute the cross-entropy between the model output and the true label.
_compute_loss
: to compute the loss function $L(w)$ with input $X$, $Y$ and $w$
_gradient
: With math derivation, the gradient of the cross entropy is
$\sum_{n} - (\hat{y}^n - f_{w,b}(x^n))x_i^n$
def _sigmoid(z):
# sigmoid function can be used to output probability
return np.clip(1 / (1.0 + np.exp(-z)), 1e-6, 1-1e-6)
def get_prob(X, w, b):
# the probability to output 1
return _sigmoid(np.add(np.matmul(X, w), b))
def infer(X, w, b):
# use round to infer the result
return np.round(get_prob(X, w, b))
def _cross_entropy(y_pred, Y_label):
# compute the cross entropy
cross_entropy = -np.dot(Y_label, np.log(y_pred))-np.dot((1-Y_label), np.log(1-y_pred))
return cross_entropy
def _gradient(X, Y_label, w, b):
# return the mean of the graident
y_pred = get_prob(X, w, b)
pred_error = Y_label - y_pred
w_grad = -np.mean(np.multiply(pred_error.T, X.T), 1)
b_grad = -np.mean(pred_error)
return w_grad, b_grad
def _gradient_regularization(X, Y_label, w, b, lamda):
# return the mean of the graident
y_pred = get_prob(X, w, b)
pred_error = Y_label - y_pred
w_grad = -np.mean(np.multiply(pred_error.T, X.T), 1)+lamda*w
b_grad = -np.mean(pred_error)
return w_grad, b_grad
def _loss(y_pred, Y_label, lamda, w):
return _cross_entropy(y_pred, Y_label) + lamda * np.sum(np.square(w))
def accuracy(Y_pred, Y_label):
acc = np.sum(Y_pred == Y_label)/len(Y_pred)
return acc
Gradient Desent We construct a loss function $L(w)$ with a parameter $w$: We want to find out $w^* = \underset{w}{\operatorname{argmin}} L(w)$
def train(X_train, Y_train):
# split a validation set
dev_size = 0.1155
X_train, Y_train, X_dev, Y_dev = train_dev_split(X_train, Y_train, dev_size = dev_size)
# Use 0 + 0*x1 + 0*x2 + ... for weight initialization
w = np.zeros((X_train.shape[1],))
b = np.zeros((1,))
regularize = True
if regularize:
lamda = 0.001
else:
lamda = 0
max_iter = 40 # max iteration number
batch_size = 32 # number to feed in the model for average to avoid bias
learning_rate = 0.2 # how much the model learn for each step
num_train = len(Y_train)
num_dev = len(Y_dev)
step =1
loss_train = []
loss_validation = []
train_acc = []
dev_acc = []
for epoch in range(max_iter):
# Random shuffle for each epoch
X_train, Y_train = _shuffle(X_train, Y_train)
total_loss = 0.0
# Logistic regression train with batch
for idx in range(int(np.floor(len(Y_train)/batch_size))):
X = X_train[idx*batch_size:(idx+1)*batch_size]
Y = Y_train[idx*batch_size:(idx+1)*batch_size]
# Find out the gradient of the loss
w_grad, b_grad = _gradient_regularization(X, Y, w, b, lamda)
# gradient descent update
# learning rate decay with time
w = w - learning_rate/np.sqrt(step) * w_grad
b = b - learning_rate/np.sqrt(step) * b_grad
step = step+1
# Compute the loss and the accuracy of the training set and the validation set
y_train_pred = get_prob(X_train, w, b)
Y_train_pred = np.round(y_train_pred)
train_acc.append(accuracy(Y_train_pred, Y_train))
loss_train.append(_loss(y_train_pred, Y_train, lamda, w)/num_train)
y_dev_pred = get_prob(X_dev, w, b)
Y_dev_pred = np.round(y_dev_pred)
dev_acc.append(accuracy(Y_dev_pred, Y_dev))
loss_validation.append(_loss(y_dev_pred, Y_dev, lamda, w)/num_dev)
return w, b, loss_train, loss_validation, train_acc, dev_acc # return loss for plotting
After training, let's take a look our the result. You can use the matplotlib to plot out the result in order to check whether the result is correct. **BUT PLEASE COMMENT THEM BEFORE YOUR SUBMISSION.**
import matplotlib.pyplot as plt
%matplotlib inline
# return loss is to plot the result
w, b, loss_train, loss_validation, train_acc, dev_acc= train(X_train, Y_train)
We can show that the both the training loss and validation loss decays during training. When you submit your final result, you can use N-fold cross validation to find the best parameters for the model.
plt.plot(loss_train)
plt.plot(loss_validation)
plt.legend(['train', 'dev'])
plt.show()
plt.plot(train_acc)
plt.plot(dev_acc)
plt.legend(['train', 'dev'])
plt.show()
Don't forget to do the same data processing for the data, or else, your model cannot fit.
X_test = np.genfromtxt(X_test_fpath, delimiter=',', skip_header=1)
# Do the same data process to the test data
X_test, _, _= _normalize_column_normal(X_test, train=False, specified_column = col, X_mean=X_mean, X_std=X_std)
result = infer(X_test, w, b)
with open(output_fpath, 'w') as f:
f.write('id,label\n')
for i, v in enumerate(result):
f.write('%d,%d\n' %(i+1, v))
Let's see what infulence the output most.
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
content = f.readline().rstrip('\n')
features = np.array([x for x in content.split(',')])
for i in ind[0:10]:
print(features[i], w[i])