This post is some of the details I use as a quick reference when things slip my mind. It’s a snapshot of frequently used operations, jargon and methods that trip me while working with data, Pytorch or Fastai. A lot of writing here contains typos and the intent was to have a quick reference to the concept.

  1. Count and sort with Pandas https://stackoverflow.com/questions/40454030/count-and-sort-with-pandas
  2. Using isin for selecting from a list https://stackoverflow.com/questions/40454030/count-and-sort-with-pandas

Topics

  1. Collaborative filtering with Fastai https://towardsdatascience.com/collaborative-filtering-with-fastai-3dbdd4ef4f00

Valuable snippets


Dropping duplicates

dedup_df = without_reviews_df.drop_duplicates(subset=['product_id', 'user_nickname', 'review_text'], keep='first', 
                inplace=False)
dedup_df



Group by

grouped_by_category_df = dedup_df.groupby(by='product_category_primary')
grouped_by_category_df



Group by user_nickname in sorted order and count the number of products in each group and return nlargest groups


x = (relevant_makeup_df.groupby(by=['user_nickname'], sort=True)['product']
                       .count()
                       .nlargest(38679))



Find percentage of missing values in the dataset

  1. relevant_makeup_df.isna().sum() Finds the all the nan or null values in the dataset for each column and sums it up
  2. relevant_makeup_df.shape[0] Fetches the number of rows in the dataset and divides the result of 1 and multiples by 100 to make it percentage
relevant_makeup_df.isna().sum()/relevant_makeup_df.shape[0] * 100



Find and assign all not null values to a dataframe for a specific column

# For each group - in this case for user `Mochapj`
d = grouped_by_user_df.get_group('Mochapj')

# Find all the `skinConcerns` that are not null
skin_concerns_df = d[d['skinConcerns'].notnull()]

# Fetch the unique value if only one
skin_concerns_df['skinConcerns'].unique()[0]

# select all records from `relevant_make_df` for the user `Mochapj` and assign the attribute `skinConcerns` with the value `aging`
relevant_makeup_df.loc[relevant_makeup_df['user_nickname']=='Mochapj', 'skinConcerns'] = 'aging'

# List and view the updated values
relevant_makeup_df.loc[relevant_makeup_df['user_nickname']=='Mochapj']



Drop na values for a specific column in a dataframe

df.dropna(subset=['EPS'], how='all', inplace=True)



What is an Embedding

An embedding matrix is a weight matrix that is randomly generated by the Pytorch’s nn.Embedding module which creates a bunch of random values

So an embedding is just a matrix that you can lookup to obtain the weight.

In collaborative filtering there is an additional bias added to the weights.


class EmbeddingDotBias(Module):
    "Base dot model for collaborative filtering."
    def __init__(self, n_factors:int, n_users:int, n_items:int, y_range:Tuple[float,float]=None):
        self.y_range = y_range
        (self.u_weight, self.i_weight, self.u_bias, self.i_bias) = [embedding(*o) for o in [
            (n_users, n_factors), (n_items, n_factors), (n_users,1), (n_items,1)
        ]]

    def forward(self, users:LongTensor, items:LongTensor) -> Tensor:
        dot = self.u_weight(users)* self.i_weight(items)
        res = dot.sum(1) + self.u_bias(users).squeeze() + self.i_bias(items).squeeze()
        if self.y_range is None: return res
        return torch.sigmoid(res) * (self.y_range[1]-self.y_range[0]) + self.y_range[0]

If you notice the forward block the weights for the users is multiplied with the weight for the items and a random bias is added to the user and weights.

The torch.sigmoid is the non-linearity added to ensure the result stays between the output values. Think between 0 and 5 for a movie rating system



What is weight decay

Weight decay penalises complexity by subtracting the square of the weights of the parameters and multiples it with a constant This is so that the best loss is not substituting the values of the parameters with 0 Hence wd is usually e-1 or e-01

Standard SGD

Here the following things happen

y = ax + b

Where m is the slope and b is the intercept

  1. Loss is calculated by the mse function which subtracts the (y' - y) ** 2 and finds the mean of it
  2. We assume the value of
def mse(y_hat, y): 
    return ((y_hat-y)**2).mean()

a = nn.Parameter(a); a

def update():
    y_hat = x@a
    loss = mse(y, y_hat)
    if t % 10 == 0: print(loss)
    loss.backward()
    with torch.no_grad():
        a.sub_(lr * a.grad)
        a.grad.zero_()

Here the gradient is the rate of change of loss with respect to the change in weights.

Here $a$ represents the weights in each layer.

\(a_t = a_{t-1} - (lr * \frac{dLoss}{da})\)

SGD with weight decay

def update(x,y,lr):
    wd = 1e-5
    y_hat = model(x)
    # weight decay
    w2 = 0.
    for p in model.parameters(): w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    with torch.no_grad():
        for p in model.parameters():
            p.sub_(lr * p.grad)
            p.grad.zero_()
    return loss.item()



What is momentum

Momentum is a constant that is used to multiply the derivative which a.gradient or p.gradient in the earlier example in way where it adds momentum to the direction in which the model is learning

assuming b = 0.9 (beta/momentum)

  1. So assuming for epoch 1 the gradient was 0.59 and epoch 2 the gradient was 0.74 the calculation for the new weight for a b would multiply the 0.59 (previous epoch) with 0.9 thus retaining the old momentum and multiply gradient of epoch 2 with 0.1 and subtract the weights

Thus maintaining directional momentum. (momentum = 0.9 is the standard)



Loss functions

  1. SGD which uses Mean squared error
  2. Adam
  3. RMSProp

RMSProp

RMSPRop is very similar to Adagrad, with the aim of resolving Adagrad’s primary limitation. Adagrad will continually shrink the learning rate for a given parameter (effectively stopping training on that parameter eventually). RMSProp however is able to shrink or increase the learning rate.

RMSProp will divide the overall learning rate by the square root of the sum of squares of the previous update gradients for a given parameter (as is done in Adagrad). The difference is that RMSProp doesn’t weight all of the previous update gradients equally, it uses an exponentially weighted moving average of the previous update gradients. This means that older values contribute less than newer values. This allows it to jump around the optimum without getting further and further away.

Further, it allows us to account for changes in the hypersurface as we travel down the gradient, and adjust learning rate accordingly. If our parameter is stuck in a shallow plain, we’d expect it’s recent gradients to be small, and therefore RMSProp increases our learning rate to push through it. Likewise, when we quickly descend a steep valley, RMSProp lowers the learning rate to avoid popping out of the minima.



Adam

Adam (Adaptive Moment Estimation) combines the benefits of momentum with the benefits of RMSProp. Momentum is looking at the moving average of the gradient, and continues to adjust a parameter in that direction. RMSProp looks at the weighted moving average of the square of the gradients; this is essentially the recent variance in the parameter, and RMSProp shrinks the learning rate proportionally. Adam does both of these things - it multiplies the learning rate by the momentum, but also divides by a factor related to the variance.

Universal Approximation Theorem

When a combination of matrix (Weights) multiplications are stacked together with activation functions it can solve any arbitarily complex function to a high level of accuracy.

The loss function is what the back propagation relies on where it goes back to adjust the weights. The adjustment is essentially the gradient subtracted from the weights.



What are dropoffs

Use regularisation rather than reducing than parameters. We periodically drop off activations randomly based on a parameter. emb_drop and p. emb_drop drops off certain embeddings while p drops activations.

last_learner = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=accuracy)



p in tabular_learner

It’s form of regularisation. p in tabular_learners is the probability of dropping activations for each layer. It’s specified on a per layer basis. p=[0.001, 0.01]

The common value is 0.5. The layer activations are dropped off with the probability of p

Dropouts aren’t used at the time of testing. Only during training. The library handles it.



layers in tabular_learner

It’s the number of attributes you want to assign for each feature (input parameter). It defines the shape of the parameter matrix that the input will be multiplied with You can specify multiple layers for tabular data. layers=[100, 50]



emb_drop in tabular_learner

Embedding dropoffs deletes the outputs of the activations of certain embeddings at random with some probability



Predictions for tabular learner

for index, row in test_df.iloc[0:50].iterrows():
    actual = row['rating']
    prediction = last_learner.predict(row)
    prediction_ratios = prediction[2].numpy()
    scores = [(c, k) for c, k in zip(last_learner.data.train_dl.classes, prediction_ratios)]
    top_scores = sorted(scores, key=lambda x: x[1], reverse=True)[:2]
    first_score = top_scores[0][1]
    second_score = top_scores[1][1]
    
    if first_score > 0.65:
        print(f'Sure - a: {actual} p: {prediction[0]} r: {prediction_ratios}')
    else:
        print(f'Unsure a: {actual} p*: {top_scores[0][0], top_scores[1][0]} r: {scores}')



What is BatchNormalisation

It reduces something called Internal Covariant Shift. Math has proved that it doesn’t reduce internal covariant shift and why it works is not because of internal covariant shift

system schema

  1. For each mini batch x which is activations, first we find the mean
  2. Find the variance of all the activations
  3. We normalize the value that is (values - mean / standard deviation)
  4. Scale and shift where we add a bias term and another variable like a bias term which is multipled instead of adding. Hence scaled
  5. It’s a used in continuous parameters. It’s a form of regularization.



Convolution2D

Convolution2D essentially scans the pixels of an image as a 2x2 (as defined) matrix and multiples the values of each pixel with a weight and reduces a 2x2 matrix of pixels into a single value.

system schema

In the above image the weights for alpha, beta, gamma and theta represent the weights identifying certain colors and reducing the result to a 2x2 matrix

system schema

This image represents how the image matrix having

[   
    [a, b, c], 
    [d, e, f], 
    [h, i, j]
]

is now flattened into a 1 dimensonal tensor [a, b, c, d, e, f, g, h, i, j]. THe weights are called the kernel

The b in the example represents the bias.





Learning Rates

When we pass

fit(1, 1e-3)

Every layer is trained with this learning rate



Discrimative learning rates

When we pass a slice with a single value the final layers get the value of 1e-3 but the all the layers before get 1e-3 divided by 3

fit(1, slice(1e-3))

When it’s

fit(1, slice(1e-5, 1e-3))

When 2 values are passed the first layers get (1e-5) and the subsequent layers get the layers get a gradually changing learning rate in the order of (1e-3)/2



Kaiming Initialisation

The process of initialising weights which yield a standard deviation of 1 and mean close to 0

Essentially it suggests that given

  1. A matrix of shape nxm (say 50,000 rows by 784 columns) like MNIST a good initialisation for weight matrices is

Assuming we want nh hidden layers or attributes (nh=50 for this example)

nh=50
w = torch.rand(m, nh)/math.sqrt(nh)

This would result in a weight matrix that provides us with a mean of 0 and a standard deviation of 1 when the activation for the network is a ReLu

import torch, math
torch.rand(784, 50)/math.sqrt(784)
tensor([[0.0292, 0.0176, 0.0236,  ..., 0.0146, 0.0024, 0.0097],
        [0.0249, 0.0037, 0.0325,  ..., 0.0228, 0.0183, 0.0339],
        [0.0320, 0.0271, 0.0021,  ..., 0.0319, 0.0205, 0.0293],
        ...,
        [0.0234, 0.0118, 0.0051,  ..., 0.0259, 0.0341, 0.0027],
        [0.0008, 0.0269, 0.0101,  ..., 0.0331, 0.0341, 0.0064],
        [0.0322, 0.0173, 0.0029,  ..., 0.0299, 0.0224, 0.0220]])



Broadcasting

How the underlying matmul within pyTorch works.

The thing about using broadcasting for multiplications is that it transposes each column of a and multiples it with the entire matrix b as columnar multiplication and sums it up to get the same result

So in the following example, consider matrices a and b

a = tensor([[1., 2.], [3., 5.], [4., 6.]])
b = tensor([[1., 3., 5., 7.], [2., 4., 6. ,8.]])
a.shape, b.shape
(torch.Size([3, 2]), torch.Size([2, 4]))
a[0, None].shape
torch.Size([1, 2])

The expected output for standard matrix multiplication is this

a.matmul(b)
tensor([[ 5., 11., 17., 23.],
        [13., 29., 45., 61.],
        [16., 36., 56., 76.]])

Our matrices is are shaped as

ar, ac = a.shape
br, bc = b.shape

a.shape, b.shape
(torch.Size([3, 2]), torch.Size([2, 4]))

The matmul method within pyTorch uses broadcasting to achieve higher computational speeds by passing the logic to the AL10 layers where it’s implemented in.

The method looks like this

def matmul(a,b):
    ar,ac = a.shape
    br,bc = b.shape
    assert ac==br
    c = torch.zeros(ar, bc)
    for i in range(ar):
#       c[i,j] = (a[i,:]          * b[:,j]).sum() # previous
        c[i]   = (a[i  ].unsqueeze(-1) * b).sum(dim=0)
    return c

Breaking down line number 13 which does the actual multiplication. He row 0 of matrix a is

a[0], a[0].shape
(tensor([1., 2.]), torch.Size([2]))

It does a a[0].unsqueeze(-1) to introduce a dimension and converts the row into a column. What this means is the matrix is now shaped differently

a[0].unsqueeze(-1), a[0].unsqueeze(-1).shape
(tensor([[1.],
         [2.]]),
 torch.Size([2, 1]))

So now our matrix which was of shape (2,) is now of shape (2, 1).

Now the expected output of row[0] in matrix c is [ 5., 11., 17., 23.] This is what a[i].unsqueeze(-1) * b does

a[0].unsqueeze(-1) * b
tensor([[ 1.,  3.,  5.,  7.],
        [ 4.,  8., 12., 16.]])

We now sum the two rows in the column dimension to get a single row of outputs

(a[0].unsqueeze(-1) * b).sum(dim=0)
tensor([ 5., 11., 17., 23.])

The loop essentially does this for each row of matrix a to generate rows of matrix c