Understanding Pytorch 1 dimensional CNN (Conv1d) Shapes For Text Classification

Sumanshu Arora
5 min readAug 17, 2020

--

Hello Readers,

I am a Data Scientist working with a major bank in Australia in Machine Learning automation space. For a project that i was working on i was looking to build a text classification model and having my focus shift from Tensorflow to Pytorch recently (for no reason other than learning a new framework), i started exploring Pytorch CNN 1d architecture for my model.

I personally found it a bit confusing after seeing some examples on internet which show word embedding dimensions as input to Conv1d layer. This may be easy to understand & obvious for many but i like to understand things visually so i started exploring how can i best understand this. Having read some articles it gave me a brief idea but i couldn’t say with 100% confidence that now i completely understand this so i decided to explore this myself using some dummy variables and comparing the Conv1d layer output with manual calculation which i am going to share in this article.

Before we move on, i am assuming that reader understands the basic concept of Deep Learning, ANN, CNN, Word Embedding etc. If not, this may be a bit difficult to grasp. But let’s get started.

For easiness, i am going to use a simple example where we have sentence length of 5 and word embedding dimension of 3, so

n =1 (Number of samples/batch size)

d=3 (Word Embedding Dimension)

l = 5 (Sentence Length)

so shape of one sample should be (1,5,3) hence i am going to create a random pytorch array having same shape.

>> import numpy
>> import torch.nn as nn
>> import torch
>> n=1 #batch size
>> l=5 #sentence len
>> d=3 #embedding dimension
>> rand_arr = torch.rand(n,l,d)

As mentioned earlier, embedding dimension size can be the input to Conv1d layer and just for show case purpose we would ask Conv1d layer to output 1 channel. Let’s define the Conv1d layer as —

input channels = 3

output channels = 1

filter = 2

stride = 1 (default)

>> conv1 = nn.Conv1d(d, 1, 2)

Now, let’s look at the random array we generated and it’s shape

>> print(rand_arr)

[[[0.9527, 0.9451, 0.2209]
[0.0332, 0.8993, 0.8718],
[0.7281, 0.4627, 0.7274],
[0.1490, 0.4004, 0.3260],
[0.3055, 0.7935, 0.1360]]]

>> print(rand_arr.shape)torch.Size([1, 5, 3])

Assuming the 5 word sentence for which we created 3 dimensional word embedding in the above steps was - “Word embedding are so cool” .

We can view this above sentence-embedding relationship horizontally as -

Word — [0.9527, 0.9451, 0.2209]

embedding — [0.0332, 0.8993, 0.8718]

are — [0.7281, 0.4627, 0.7274]

so — [0.1490, 0.4004, 0.3260]

cool — [0.3055, 0.7935, 0.1360]

To make the random array to be fit for defined conv1 layer, we would have to reshape it to have channel (which is this case is embedding) first.

>> rand_arr_permute = rand_arr.permute(0,2,1)
>> print(rand_arr_permute)
tensor([[[0.9527, 0.0332, 0.7281, 0.1490, 0.3055],
[0.9451, 0.8993, 0.4627, 0.4004, 0.7935],
[0.2209, 0.8718, 0.7274, 0.3260, 0.1360]]])

Now, we can view the sentence-embedding relationship vertically as —

  Word     embedding      are       so         cool
[0.9527, 0.0332, 0.7281, 0.1490, 0.3055],
[0.9451, 0.8993, 0.4627, 0.4004, 0.7935],
[0.2209, 0.8718, 0.7274, 0.3260, 0.1360]

In case of Conv1d, we do not have filters that stride through a two dimensional matrix horizontally and vertically as shown below —

2d Convolution

Image Source — https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

But, the filter only strides through 1 dimension i.e horizontal so if we say that our kernel size is 2, it means it will stride through pair of 2 words (bi-grams) which are represented vertically after the permutation step we performed above. For illustration, see images below —

Can you now guess what would be the shape of our conv1d filter weight matrix?

It is (1,3,2) wherein shape[0] = 1 is the number of samples, shape[1] = 3 is the input embedding size and shape[2] = 2 is the filter size.

Since we have provided input size equal to embedding dimension so it will always have the shape[1] same as embedding size to enable striding on the full word or pair of full words.

I hope this gives you some clarity in imagining how the 1d convolution takes place on text data along the dimension channels.

If you are someone who do not care about maths of how you get your results then you can skip the below section of article.

But for those who are interested in calculation, for the sake of simplicity lets begin with taking only the first two words of our sentence i.e. Word and embedding

So our new array would look like —

 Word     embedding      
0.9527, 0.0332,
0.9451, 0.8993,
0.2209, 0.8718,

Lets, now look at the initial weights of the conv1 layer we defined earlier.

>> print(conv1.weight)Parameter containing:
tensor([[[ 0.1952, -0.1954],
[ 0.3689, -0.2420],
[-0.1060, 0.0735]]], requires_grad=True)

If we were to perform convolution calculation on the above weight and embedding values, we should perform element wise multiplication of weights (which is 3*2) and embedding (which is also 3*2). Sum them all up to a single value and add bias term to it which is exactly what we are going to do in next step.

But before that i should mention that the shape of bias term in conv1d is equal to number of output channels which is one bias term for each channel. Since we have defined the output channel as 1 so it is supposed to have shape of 1 and the randomly assigned value is 0.2665.

>> print(conv1.bias)
Parameter containing:
tensor([0.2665], requires_grad=True)
>> conv1.bias.shape
torch.Size([1])

Lets, now do the convolution calculation.

>> print(torch.sum((rand_arr_permute[:,:,:2]*conv1.weight)) + conv1.bias)tensor([0.6177], grad_fn=<AddBackward0>)

which is equivalent to —

((0.9527*0.1952)+(0.0332*-0.1954)+(0.9451*0.3689)+(0.8993*-0.242)+(0.2209*-0.106)+(0.8718*0.0735)) + (0.2665) = 0.6177

After this calculation the filter will shift 1 step to the right in the matrix and perform same calculation on the next pair of words i.e. “embedding are”. So after one full cycle we will have 4 outputs generated by pairs —

  1. (Words, embedding) — 0.6177
  2. (embedding, are) — 0.3115
  3. (are, so) — 0.4002
  4. (so, cool) — 0.1670
>> print(conv1(rand_arr_permute))tensor([[[0.6177, 0.3115, 0.4002, 0.1670]]], grad_fn=<SqueezeBackward1>)

so your output would be shape torch.Size([1, 1, 4]) wherein shape[0]=1 is sample size, shape[1]=1 is output channels and shape[2]=4 which is the reduced convoluted (or reduced) embedding dimension. We can now apply max pool on it which will reduce the dimension size even further.

Hope you found this article helpful in understanding how 1d convolution takes place in Pytorch and also in visualizing how the kernel strides though the pair of words in sentences.

Thanks & Regards,
Sam

For feedback please email at sumanshusamarora@gmail.com

--

--