Income Prediction -- Logistic Regression

Please DON'T upload jupyter notebook to your github!! Please DON'T plot any figure in your python code!!

The TA just want to write more machine learning techniques and plot figures to make all students understand how the homework can be done. **Please do not upload jupyter notebook or plot figure in your submitted python code.** You can plot figure on your report if you want to show your figure!!

Built-in Package and Numpy,Scipy,Pandas ONLY!!!

You can only import numpy,scipy and pandas for calculation. You cannot import any other toolkit for logistic regression and probablistic generative model.

You can also import some python built-in package for I/O, for example, csv, etc.

In [1]:
import numpy as np

Load Data

In the following, I use numpy to load data. You can use other python built-in package as well.

**IMPORTANT!!**

Please use

import sys
X_train_fpath = sys.argv[1]
Y_train_fpath = sys.argv[2]
X_test_fpath = sys.argv[3]
output_fpath = sys.argv[4]

instead the next following block in your submitted code.

or using argparse package, then write the corresponding shell script parameters.

In [2]:
X_train_fpath = 'data/X_train'
Y_train_fpath = 'data/Y_train'
X_test_fpath = 'data/X_test'
output_fpath = 'output.csv'
In [3]:
X_train = np.genfromtxt(X_train_fpath, delimiter=',', skip_header=1)
Y_train = np.genfromtxt(Y_train_fpath, delimiter=',', skip_header=1)

Tips in Machine Learning Training

How can we know that the model we have trained is good or bad without testing data? We may train our model then output result to the kaggle's competition..... Then, you can fine tuned your model to make the preformance better. However, the quota is limited. It is not so useful, isn't it?

Here are some tips for training.

I. Validation Set

You can first divide your data into training set and validation set. When training on the training set, you can use the validation set to fine tune your hyperparameters. Your training data may be wasted, so you can choose N-fold cross validation. For more information, please find out in the lecture 2.

II. Data Processing

There are many data processing methods for machine learning, you can try out yourself that which processing method is better for the model and for the dataset. Different data processing method may be suitable for different dataset. Here are just several common processing methods, (may not all suitable for the homework this time.... You can try out!!!)

  1. Normalization: Without normalization, a specific feature may influence the model the most, which is not we want. **We can normalize the feature to 0~1 or noramlize our data in to Normal distribution.** It is one of the most useful data processing methods. In the following, we will implement both normalization methods, you can try out whether which perform better.
  2. One-hot Encoding : Actually, in this dataset, most condition is already represented with one-hot encoding. The following is just an explanation of one-hot encoding, (Not in our dataset).

    Origin Representaion: (0 = others, 1 = graduate school, 2 = university, 3 = high school)

Education
2
1
1
2
3
One-hot Encoding:

others graduate school university high school
0 0 1 0
0 1 0 0
0 1 0 0
0 0 1 0
0 0 0 1
  1. Discretization: You can discretize to a range of a column. For example, you can turn a column
fnlwgt
226802
89814
336951
160323
103497
into 

0~100k 100k~200k 200k~300k 300k~400k
0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0
0 1 0 0
or  (where 0: 0~100k, 1:100k~200k, 2: 200k~300k, 3: 300k~400k, etc.)

fnlwgt
2
0
3
1
1

Data Processing -- Normalization

Normalize all input data to 0 and 1, MUST specify training or testing. In real world, we don't know how the test data look like. We can only normalize the test data with the input data distribution. So, when testing, please input the corresponding normalizing hyperparameters.

We need to calculate by:

$z_i = \frac{x_i - min(x)}{max(x)-min(x)}$

In [4]:
def _normalize_column_0_1(X, train=True, specified_column = None, X_min = None, X_max=None):
    # The output of the function will make the specified column of the training data 
    # from 0 to 1
    # When processing testing data, we need to normalize by the value 
    # we used for processing training, so we must save the max value of the 
    # training data
    if train:
        if specified_column == None:
            specified_column = np.arange(X.shape[1])
        length = len(specified_column)
        X_max = np.reshape(np.max(X[:, specified_column], 0), (1, length))
        X_min = np.reshape(np.min(X[:, specified_column], 0), (1, length))
        
    X[:, specified_column] = np.divide(np.subtract(X[:, specified_column], X_min), np.subtract(X_max, X_min))

    return X, X_max, X_min

Normalize the specified column to a normal distribution. You can try out different kinds of normalization to make the model learn the data distribution more easily.

For Normal distribution, we can calculate by

$z_i = \frac{x_i - min(x)}{max(x)-min(x)}$

In [5]:
def _normalize_column_normal(X, train=True, specified_column = None, X_mean=None, X_std=None):
    # The output of the function will make the specified column number to 
    # become a Normal distribution
    # When processing testing data, we need to normalize by the value 
    # we used for processing training, so we must save the mean value and 
    # the variance of the training data
    if train:
        if specified_column == None:
            specified_column = np.arange(X.shape[1])
        length = len(specified_column)
        X_mean = np.reshape(np.mean(X[:, specified_column],0), (1, length))
        X_std  = np.reshape(np.std(X[:, specified_column], 0), (1, length))
    
    X[:,specified_column] = np.divide(np.subtract(X[:,specified_column],X_mean), X_std)
     
    return X, X_mean, X_std
In [6]:
def _shuffle(X, Y):
    randomize = np.arange(len(X))
    np.random.shuffle(randomize)
    return (X[randomize], Y[randomize])
    
def train_dev_split(X, y, dev_size=0.25):
    train_len = int(round(len(X)*(1-dev_size)))
    return X[0:train_len], y[0:train_len], X[train_len:None], y[train_len:None]
In [7]:
# These are the columns that I want to normalize
col = [0,1,3,4,5,7,10,12,25,26,27,28]
X_train, X_mean, X_std = _normalize_column_normal(X_train, specified_column=col)

Logistic Regression Tutorial

Logistic regression is a classification model.

From the algorithm description, we implement:

_sigmoid: to compute the sigmoid of the input. Use np.clip to avoid overflow. The smallest representable positive number is

>> np.finfo(np.float32).eps
>> 1.1920929e-07

Hence, we choose to clip at 1e-6 and 1-1e-6.

get_prob: given weight and bias, find out the model predict the probability to output 1

infer: if the probability > 0.5, then output 1, or else output 0.

_cross_entropy: compute the cross-entropy between the model output and the true label.

_compute_loss : to compute the loss function $L(w)$ with input $X$, $Y$ and $w$

_gradient : With math derivation, the gradient of the cross entropy is $\sum_{n} - (\hat{y}^n - f_{w,b}(x^n))x_i^n$

In [8]:
def _sigmoid(z):
    # sigmoid function can be used to output probability
    return np.clip(1 / (1.0 + np.exp(-z)), 1e-6, 1-1e-6)

def get_prob(X, w, b):
    # the probability to output 1
    return _sigmoid(np.add(np.matmul(X, w), b))

def infer(X, w, b):
    # use round to infer the result
    return np.round(get_prob(X, w, b))

def _cross_entropy(y_pred, Y_label):
    # compute the cross entropy
    cross_entropy = -np.dot(Y_label, np.log(y_pred))-np.dot((1-Y_label), np.log(1-y_pred))
    return cross_entropy

def _gradient(X, Y_label, w, b):
    # return the mean of the graident
    y_pred = get_prob(X, w, b)
    pred_error = Y_label - y_pred
    w_grad = -np.mean(np.multiply(pred_error.T, X.T), 1)
    b_grad = -np.mean(pred_error)
    return w_grad, b_grad

def _gradient_regularization(X, Y_label, w, b, lamda):
    # return the mean of the graident
    y_pred = get_prob(X, w, b)
    pred_error = Y_label - y_pred
    w_grad = -np.mean(np.multiply(pred_error.T, X.T), 1)+lamda*w
    b_grad = -np.mean(pred_error)
    return w_grad, b_grad

def _loss(y_pred, Y_label, lamda, w):
    return _cross_entropy(y_pred, Y_label) + lamda * np.sum(np.square(w))
In [9]:
def accuracy(Y_pred, Y_label):
    acc = np.sum(Y_pred == Y_label)/len(Y_pred)
    return acc

Logistic Regression

Gradient Desent We construct a loss function $L(w)$ with a parameter $w$: We want to find out $w^* = \underset{w}{\operatorname{argmin}} L(w)$

  1. Pick an inital value $w_0$
  2. compute $\frac{dL}{dw}|_{w=w_0}$
  3. update $w_{i+1} ← w_i - \eta \frac{dL}{dw}|_{w=w_i}$ We want to find out a function y =
In [11]:
def train(X_train, Y_train):
    # split a validation set
    dev_size = 0.1155
    X_train, Y_train, X_dev, Y_dev = train_dev_split(X_train, Y_train, dev_size = dev_size)
    
    # Use 0 + 0*x1 + 0*x2 + ... for weight initialization
    w = np.zeros((X_train.shape[1],)) 
    b = np.zeros((1,))

    regularize = True
    if regularize:
        lamda = 0.001
    else:
        lamda = 0
    
    max_iter = 40  # max iteration number
    batch_size = 32 # number to feed in the model for average to avoid bias
    learning_rate = 0.2  # how much the model learn for each step
    num_train = len(Y_train)
    num_dev = len(Y_dev)
    step =1

    loss_train = []
    loss_validation = []
    train_acc = []
    dev_acc = []
    
    for epoch in range(max_iter):
        # Random shuffle for each epoch
        X_train, Y_train = _shuffle(X_train, Y_train)
        
        total_loss = 0.0
        # Logistic regression train with batch
        for idx in range(int(np.floor(len(Y_train)/batch_size))):
            X = X_train[idx*batch_size:(idx+1)*batch_size]
            Y = Y_train[idx*batch_size:(idx+1)*batch_size]
            
            # Find out the gradient of the loss
            w_grad, b_grad = _gradient_regularization(X, Y, w, b, lamda)
            
            # gradient descent update
            # learning rate decay with time
            w = w - learning_rate/np.sqrt(step) * w_grad
            b = b - learning_rate/np.sqrt(step) * b_grad
            
            step = step+1
            
        # Compute the loss and the accuracy of the training set and the validation set
        y_train_pred = get_prob(X_train, w, b)
        Y_train_pred = np.round(y_train_pred)
        train_acc.append(accuracy(Y_train_pred, Y_train))
        loss_train.append(_loss(y_train_pred, Y_train, lamda, w)/num_train)
        
        y_dev_pred = get_prob(X_dev, w, b)
        Y_dev_pred = np.round(y_dev_pred)
        dev_acc.append(accuracy(Y_dev_pred, Y_dev))
        loss_validation.append(_loss(y_dev_pred, Y_dev, lamda, w)/num_dev)
    
    return w, b, loss_train, loss_validation, train_acc, dev_acc  # return loss for plotting

After training, let's take a look our the result. You can use the matplotlib to plot out the result in order to check whether the result is correct. **BUT PLEASE COMMENT THEM BEFORE YOUR SUBMISSION.**

In [12]:
import matplotlib.pyplot as plt
%matplotlib inline
In [13]:
# return loss is to plot the result
w, b, loss_train, loss_validation, train_acc, dev_acc= train(X_train, Y_train)

We can show that the both the training loss and validation loss decays during training. When you submit your final result, you can use N-fold cross validation to find the best parameters for the model.

In [14]:
plt.plot(loss_train)
plt.plot(loss_validation)
plt.legend(['train', 'dev'])
plt.show()
In [15]:
plt.plot(train_acc)
plt.plot(dev_acc)
plt.legend(['train', 'dev'])
plt.show()

Test data and output result

Don't forget to do the same data processing for the data, or else, your model cannot fit.

In [16]:
X_test = np.genfromtxt(X_test_fpath, delimiter=',', skip_header=1)
# Do the same data process to the test data
X_test, _, _= _normalize_column_normal(X_test, train=False, specified_column = col, X_mean=X_mean, X_std=X_std)
In [17]:
result = infer(X_test, w, b)
In [18]:
with open(output_fpath, 'w') as f:
        f.write('id,label\n')
        for i, v in  enumerate(result):
            f.write('%d,%d\n' %(i+1, v))

Let's see what infulence the output most.

In [19]:
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
    content = f.readline().rstrip('\n')
features = np.array([x for x in content.split(',')])
for i in ind[0:10]:
    print(features[i], w[i])
capital_gain 1.2775074614148736
 Never-married -0.7941006191751273
 Married-civ-spouse 0.7093272716255729
 Bachelors 0.5658462127622863
 Own-child -0.5574883078780752
 Exec-managerial 0.49496914446155543
 Other-service -0.48182828393472404
 Unmarried -0.4782921134912772
 Not-in-family -0.4314282456914995
 Divorced -0.40120930144106776
In [ ]: