You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using ElementAccumulator = cutlass::half_t; // Data type of accumulator
The tensors that I have have layout NCHW and use float so I used this snippet of code in the example:
// The code section below describes datatype for input, output tensors and computation between
// elements
using ElementAccumulator = float; // Data type of accumulator
using ElementComputeEpilogue = float; // Data type of epilogue computation (alpha, beta)
using ElementInputA = float; // Data type of elements in input tensor
using ElementInputB = float; // Data type of elements in input tensor
using ElementOutput = float; // Data type of elements in output tensor
using LayoutInputA = cutlass::layout::TensorNCHW;
using LayoutInputB = cutlass::layout::TensorNCHW;
using LayoutOutput = cutlass::layout::TensorNCHW;
How do I fix this error? Are convolutions on NCHW tensors even supported by Cutlass?
Basic question: How do I convert a pytorch tensor into a cutlass::TensorRef? I am currently using this snippet of code:
auto makeDeviceRef = [](const Tensor &tensor) {
auto tensorLayout = cutlass::layout::TensorNCHW::packed({tensor.stride(0), tensor.stride(1), tensor.stride(2), tensor.stride(3)});
auto ret = cutlass::make_TensorRef(tensor.data_ptr<scalar_t*>(), tensorLayout);
return ret;
};
Does that look correct?
3. Can cutlass handle convolutions with variable filter sizes that are not known at compile time? It seems like templates need to be instantiated with concrete numbers so cutlass only works for known filter sizes.
4. How can I choose what cuda stream to run the kernel on?
5. Is it possible to build the kernel for all sm arches and dynamically choose the one latest one dependent on the what machine I am running on? It seems like that's not possible because the arch has to be passed down to cutlass at compile time to DefaultDepthwiseDirect2dConvFprop.
6. How do I choose meta variables like pipelines stages, groups per CTA, thread block output shape, etc. in an automated way (I am afraid I may choose variables that could slow down my kernel)? Do I have to tune everything by hand or is there a way to choose these variables based on the GPU I am using more automatically?
If these are compile-time constants, I may have to use a look-up table to figure them out per GPU?
7. Depth-wise convolution 2D specific question: what is this tensor_b_transpose variable (
)? It seems like in the code this is not derived from tensor_b at all (I thought it was going to be the transpose of tensor_b, but it seems to not be initialized in the example. Is this just a temporary tensor that is filled by cutlass itself? I am wondering why this is required? Why can't cutlass allocate this tensor by itself?
The text was updated successfully, but these errors were encountered:
My high level goal is to use one of cutlass' 2D depthwise convolution kernels with pytorch's tensors.
I am starting off with the SIMT kernel because that can work on older devices. So I am basically copying code from this example:
cutlass/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu
Line 80 in affd1b6
The tensors that I have have layout NCHW and use float so I used this snippet of code in the example:
This throws an error with nvcc:
What is your question?
Questions:
Does that look correct?
3. Can cutlass handle convolutions with variable filter sizes that are not known at compile time? It seems like templates need to be instantiated with concrete numbers so cutlass only works for known filter sizes.
4. How can I choose what cuda stream to run the kernel on?
5. Is it possible to build the kernel for all sm arches and dynamically choose the one latest one dependent on the what machine I am running on? It seems like that's not possible because the arch has to be passed down to cutlass at compile time to
DefaultDepthwiseDirect2dConvFprop
.6. How do I choose meta variables like pipelines stages, groups per CTA, thread block output shape, etc. in an automated way (I am afraid I may choose variables that could slow down my kernel)? Do I have to tune everything by hand or is there a way to choose these variables based on the GPU I am using more automatically?
If these are compile-time constants, I may have to use a look-up table to figure them out per GPU?
7. Depth-wise convolution 2D specific question: what is this
tensor_b_transpose
variable (cutlass/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu
Line 424 in affd1b6
tensor_b
at all (I thought it was going to be the transpose of tensor_b, but it seems to not be initialized in the example. Is this just a temporary tensor that is filled by cutlass itself? I am wondering why this is required? Why can't cutlass allocate this tensor by itself?The text was updated successfully, but these errors were encountered: