cublas中執行矩陣乘法運算的函式首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼

阿新 • • 發佈：2019-01-28

cublas中執行矩陣乘法運算的函式

首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼

cublas中執行矩陣乘法運算的函式

首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼

// SOME PRECAUTIONS:
// IF WE WANT TO CALCULATE ROW-MAJOR MATRIX MULTIPLY C = A * B,
// WE JUST NEED CALL CUBLAS API IN A REVERSE ORDER: cublasSegemm(B, A)!
 
// The reason is explained as follows:

// CUBLAS library uses column-major storage, but C/C++ use row-major storage.
// When passing the matrix pointer to CUBLAS, the memory layout alters from
// row-major to column-major, which is equivalent to an implict transpose.

// In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
 
// C = A * B, we can't use the input order like cublasSgemm(A, B)  because of
// implict transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
// If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
// multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
 
// is a column-based cublas matrix, which means C(T) in C/C++, we need extra
// transpose code to convert it to a row-based C/C++ matrix.

// To solve the problem, let's consider our desired result C, a row-major matrix.
// In cublas format, it is C(T) actually (becuase of the implict transpose).
// C = A * B, so C(T) = (A * B) (T) = B(T) * A(T). Cublas matrice B(T) and A(T)
// happen to be C/C++ matrice B and A (still becuase of the implict transpose)!
// We don't need extra transpose code, we only need alter the input order!
//
// CUBLAS provides high-performance matrix multiplication.
// See also:
// V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
// in Proc. 2008 ACM/IEEE Conf. on Superconducting (SC '08),
// Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
//

小例子C++中：

A矩陣：0 3 5 B矩陣：1 1 1

0 0 4 1 1 1

1 0 0 1 1 1

現在要求C = A*B

C++中的結果

C矩陣：8 8 8

4 4 4

1 1 1

在cublas中：變成以行為主

A矩陣：0 0 1 B矩陣：1 1 1

3 0 0 1 1 1

5 4 0 1 1 1

在cublas中求C2=B*A

結果如下：C2在cublas中以列為主

慣性思維，先把結果用行為主儲存好理解：

C2矩陣：8 4 1

8 4 1

在cublas實際是一列儲存的，結果如下：

C2矩陣：8 8 8

4 4 4

1 1 1

此時在cublas中B*A的結果與C++中A*B結果一樣，使用cublas時只需改變下引數的位置即可得到想要的結果。

cublas<t>gemm()

cublasStatus_t cublasSgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           intm, intn, intk,
                           const float*alpha,
                           const float*A, intlda,
                           const float*B, intldb,
                           const float*beta,
                           float*C, intldc);
cublasStatus_t cublasDgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           intm, intn, intk,
                           const double*alpha,
                           const double*A, intlda,
                           const double*B, intldb,
                           const double*beta,
                           double*C, intldc);
cublasStatus_t cublasCgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           intm, intn, intk,
                           constcuComplex *alpha,
                           constcuComplex *A, intlda,
                           constcuComplex *B, intldb,
                           constcuComplex *beta,
                           cuComplex *C, intldc);
cublasStatus_t cublasZgemm(cublasHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           intm, intn, intk,
                           constcuDoubleComplex *alpha,
                           constcuDoubleComplex *A, intlda,
                           constcuDoubleComplex *B, intldb,
                           constcuDoubleComplex *beta,
                           cuDoubleComplex *C, intldc);

引數含義可參考下面的資訊：

使用cublas中cublasSgemm實現簡單的矩陣相乘程式碼如下：

標頭檔案：matrix.h

 1 // SOME PRECAUTIONS:
 2 // IF WE WANT TO CALCULATE ROW-MAJOR MATRIX MULTIPLY C = A * B,
 3 // WE JUST NEED CALL CUBLAS API IN A REVERSE ORDER: cublasSegemm(B, A)!
 4 // The reason is explained as follows:
 5 
 6 // CUBLAS library uses column-major storage, but C/C++ use row-major storage.
 7 // When passing the matrix pointer to CUBLAS, the memory layout alters from
 8 // row-major to column-major, which is equivalent to an implict transpose.
 9 
10 // In the case of row-major C/C++ matrix A, B, and a simple matrix multiplication
11 // C = A * B, we can't use the input order like cublasSgemm(A, B)  because of
12 // implict transpose. The actual result of cublasSegemm(A, B) is A(T) * B(T).
13 // If col(A(T)) != row(B(T)), equal to row(A) != col(B), A(T) and B(T) are not
14 // multipliable. Moreover, even if A(T) and B(T) are multipliable, the result C
15 // is a column-based cublas matrix, which means C(T) in C/C++, we need extra
16 // transpose code to convert it to a row-based C/C++ matrix.
17 
18 // To solve the problem, let's consider our desired result C, a row-major matrix.
19 // In cublas format, it is C(T) actually (becuase of the implict transpose).
20 // C = A * B, so C(T) = (A * B) (T) = B(T) * A(T). Cublas matrice B(T) and A(T)
21 // happen to be C/C++ matrice B and A (still becuase of the implict transpose)!
22 // We don't need extra transpose code, we only need alter the input order!
23 //
24 // CUBLAS provides high-performance matrix multiplication.
25 // See also:
26 // V. Volkov and J. Demmel, "Benchmarking GPUs to tune dense linear algebra,"
27 // in Proc. 2008 ACM/IEEE Conf. on Superconducting (SC '08),
28 // Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
29 //
30 
31 #include <stdio.h>
32 #include <stdlib.h>
33 
34 //cuda runtime
35 #include <cuda_runtime.h>
36 #include <cublas_v2.h>
37 
38 
39 //包含的庫
40 #pragma comment (lib,"cudart")
41 #pragma comment (lib,"cublas")
42 
43 //使用這個巨集就可以很方便的將我們習慣的行為主的資料轉化為列為主的資料
44 //#define  IDX2C(i,j,leading) (((j)*(leading))+(i))
45 
46 typedef struct _matrixSize      // Optional Command-line multiplier for matrix sizes
47 {
48     unsigned int uiWA, uiHA, uiWB, uiHB, uiWC, uiHC;
49 } sMatrixSize;
50 
51 cudaError_t matrixMultiply(float *h_C, const float *h_A, const float *h_B,int devID, sMatrixSize &matrix_size);

CPP檔案：matrix.cpp

  1 #include "matrix.h"
  2 
  3 cudaError_t matrixMultiply(float *h_C, const float *h_A, const float *h_B,int devID, sMatrixSize &matrix_size){
  4     float *dev_A = NULL;
  5     float *dev_B = NULL;
  6     float *dev_C = NULL;
  7     float *h_CUBLAS = NULL;
  8 
  9     cudaDeviceProp devicePro;
 10     cudaError_t cudaStatus;
 11 
 12     cudaStatus = cudaGetDeviceProperties(&devicePro, devID);
 13 
 14     if(cudaStatus != cudaSuccess){
 15         fprintf(stderr,"cudaGetDeviceProperties returned error code %d, line(%d)\n", cudaStatus, __LINE__);
 16         goto Error;
 17     }
 18 
 19     // allocate device memory for matrices dev_A 、 dev_B and dev_C
 20     unsigned int size_A = matrix_size.uiWA * matrix_size.uiHA;
 21     unsigned int mem_size_A = sizeof(float) * size_A;
 22 
 23     unsigned int size_B = matrix_size.uiWB * matrix_size.uiHB;
 24     unsigned int mem_size_B = sizeof(float) * size_B; 
 25 
 26     unsigned int size_C = matrix_size.uiWC * matrix_size.uiHC;
 27     unsigned int mem_size_C = sizeof(float) * size_C;
 28 
 29     //cudaMalloc dev_A
 30     cudaStatus = cudaMalloc( (void**)&dev_A, mem_size_A);
 31     if(cudaStatus != cudaSuccess){
 32         fprintf(stderr, "cudaMalloc dev_A return error code %d, line(%d)\n", cudaStatus, __LINE__);
 33         goto Error;
 34     }
 35 
 36     //cudaMalloc dev_B
 37     cudaStatus = cudaMalloc( (void**)&dev_B, mem_size_B);
 38     if(cudaStatus != cudaSuccess){
 39         fprintf(stderr, "cudaMalloc dev_B return error code %d, line(%d)\n", cudaStatus, __LINE__);
 40         goto Error;
 41     }
 42 
 43     //cudaMalloc dev_C
 44     cudaStatus = cudaMalloc( (void**)&dev_C, mem_size_C);
 45     if(cudaStatus != cudaSuccess){
 46         fprintf(stderr, "cudaMalloc dev_C return error code %d, line(%d)\n", cudaStatus, __LINE__);
 47         goto Error;
 48     }
 49 
 50     // allocate host memory for result matrices h_CUBLAS
 51     h_CUBLAS = (float*)malloc(mem_size_C);
 52     if( h_CUBLAS == NULL && size_C > 0){
 53         fprintf(stderr, "malloc h_CUBLAS error, line(%d)\n",__LINE__);
 54         goto Error;
 55     }
 56 
 57 
 58     /*
 59     copy the host input vector h_A, h_B in host memory 
 60     to the device input vector dev_A, dev_B in device memory
 61     */
 62 
 63     //cudaMemcpy h_A to dev_A
 64     cudaStatus = cudaMemcpy(dev_A, h_A, mem_size_A, cudaMemcpyHostToDevice);
 65     if( cudaStatus != cudaSuccess){
 66         fprintf(stderr,"cudaMemcpy h_A to dev_A return error code %d, line(%d)", cudaStatus, __LINE__);
 67         goto Error;
 68     }
 69 
 70 
 71     //cudaMemcpy h_B to dev_B
 72     cudaStatus = cudaMemcpy(dev_B, h_B, mem_size_B, cudaMemcpyHostToDevice);
 73     if( cudaStatus != cudaSuccess){
 74         fprintf(stderr,"cudaMemcpy h_B to dev_B returned error code %d, line(%d)", cudaStatus, __LINE__);
 75         goto Error;
 76     }
 77 
 78     //CUBLAS version 2.0
 79     {
 80         cublasHandle_t handle;
 81         cublasStatus_t ret;
 82 
 83         ret = cublasCreate(&handle);
 84         if(ret != CUBLAS_STATUS_SUCCESS){
 85             fprintf(stderr, "cublasSgemm returned error code %d, line(%d)", ret, __LINE__);
 86             goto Error;
 87         }
 88 
 89         cudaEvent_t start;
 90         cudaEvent_t stop;
 91 
 92         cudaStatus = cudaEventCreate(&start);
 93         if(cudaStatus != cudaSuccess){
 94             fprintf(stderr,"Falied to create start Event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
 95             goto Error;
 96         }
 97 
 98         cudaStatus = cudaEventCreate(&stop);
 99         if(cudaStatus != cudaSuccess){
100             fprintf(stderr,"Falied to create stop Event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
101             goto Error;
102         }
103 
104         //recode start event
105         cudaStatus = cudaEventRecord(start,NULL);
106         if(cudaStatus != cudaSuccess){
107             fprintf(stderr,"Failed to record start event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
108             goto Error;
109         }
110 
111         //matrix multiple A*B, beceause matrix is column  primary in cublas, so we can change the input
112         //order to B*A.the reason you can see the file matrix.h
113 
114         float alpha = 1.0f;
115         float beta = 0.0f;
116         //ret = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHB, matrix_size.uiHA, matrix_size.uiWA,
117             //&alpha, dev_B, matrix_size.uiWB, dev_A, matrix_size.uiWA, &beta, dev_C, matrix_size.uiWA);
118 
119         ret = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHA, matrix_size.uiHB, matrix_size.uiWB,
120             &alpha, dev_A, matrix_size.uiWA, dev_B, matrix_size.uiWB, &beta, dev_C, matrix_size.uiWB);
121 
122 
123         if(ret != CUBLAS_STATUS_SUCCESS){
124             fprintf(stderr,"cublasSgemm returned error code %d, line(%d)\n", ret, __LINE__);
125         }
126 
127         printf("cublasSgemm done.\n");
128 
129         //recode stop event
130         cudaStatus = cudaEventRecord(stop,NULL);
131         if(cudaStatus != cudaSuccess){
132             fprintf(stderr,"Failed to record stop event (error code %s)!\n",cudaGetErrorString( cudaStatus ) );
133             goto Error;
134         }
135 
136         //wait for the stop event to complete
137         cudaStatus = cudaEventSynchronize(stop);
138         if(cudaStatus != cudaSuccess){
139             fprintf(stderr,"Failed to synchronize on the stop event (error code %s)!\n", cudaGetErrorString( cudaStatus ) );
140             goto Error;
141         }
142 
143         float secTotal = 0.0f;
144         cudaStatus = cudaEventElapsedTime(&secTotal ,start, stop);
145         if(cudaStatus != cudaSuccess){
146             fprintf(stderr,"Failed to get time elapsed between event (error code %s)!\n", cudaGetErrorString( cudaStatus ) );
147             goto Error;
148         }
149 
150         //copy result from device to host
151         cudaStatus = cudaMemcpy(h_CUBLAS, dev_C, mem_size_C, cudaMemcpyDeviceToHost);
152         if(cudaStatus != cudaSuccess){
153             fprintf(stderr,"cudaMemcpy dev_C to h_CUBLAS error code %d, line(%d)!\n", cudaStatus, __LINE__);
154             goto Error;
155         }
156 
157     }
158 
159 
160     for(int i = 0; i < matrix_size.uiWC; i++){
161         for(int j = 0; j < matrix_size.uiHC; j++){
162             printf("%f    ", h_CUBLAS[ i*matrix_size.uiWC + j]);
163         }
164         printf("\n");
165     }
166 
167 /*
168     //change the matrix from column primary to rows column primary
169     for(int i = 0; i<matrix_size.uiWC; i++){
170         for(int j = 0; j<matrix_size.uiHC; j++){
171             int at1 = IDX2C(i,j,matrix_size.uiWC);  //element location in rows primary
172             int at2 = i*matrix_size.uiWC +j;        //element location in column primary
173             if(at1 >= matrix_size.uiWC*matrix_size.uiHC || at2 >= matrix_size.uiWC*matrix_size.uiHC)
174                 printf("transc error \n");
175             h_C[ at1 ] = h_CUBLAS[ at2 ];
176         }
177     }
178 */
179 /*
180     for(int i = 0; i<matrix_size.uiWC; i++){
181         for(int j = 0; j<matrix_size.uiHC; j++){
182             //int at1 = IDX2C(i,j,matrix_size.uiWC);  //element location in rows primary
183             int at2 = i*matrix_size.uiWC +j;        //element location in column primary
184             //if(at1 >= matrix_size.uiWC*matrix_size.uiHC || at2 >= matrix_size.uiWC*matrix_size.uiHC)
185                 //printf("transc error \n");
186             h_C[ at2 ] = h_CUBLAS[ at2 ];
187         }
188     }
189 */
190 
191 Error:
192     cudaFree(dev_A);
193     cudaFree(dev_B);
194     cudaFree(dev_C);
195     free(h_CUBLAS);
196     dev_A = NULL;
197     dev_B = NULL;
198     dev_C = NULL;
199     h_CUBLAS = NULL;
200     return cudaStatus;
201 }
202 
203 
204 
205 
206 cudaError_t reduceEdge(){
207     cudaError_t cudaStatus = cudaSuccess;
208 Error:
209     return cudaStatus;
210 }

分類: cuda

cublas中執行矩陣乘法運算的函式首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼

cublas中執行矩陣乘法運算的函式首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼 cublas中執行矩陣乘法運算的函式首先要注意的是cublas使用的是以列為主的儲存方式，和c/c+

problem：瀏覽器中顯示的內容，和通過右鍵看到的網頁原始碼不一樣？（未解決）

今天做淘寶爬蟲時，發現：瀏覽器中顯示的內容，和通過右鍵看到的網頁原始碼不一樣？查了資料還是不太懂。。。前端知識幾乎沒有。檢視原檔案只是看到的網頁初始狀態，但實際上網頁在載入完成後可能立即就會執行 js 改變了初始狀態。現在的網頁不同於傳統的動態網

CUBLAS中的矩陣乘法函式詳解

關於cuBLAS庫中矩陣乘法相關的函式及其輸入輸出進行詳細討論。▶ 漲姿勢：● cuBLAS中能用於運算矩陣乘法的函式有4個，分別是 cublasSgemm（單精度實數）、cublasDgemm（雙精度實數）、cublasCgemm（單精度複數）、cublasZgemm（雙精

大矩陣乘法運算map reduce實現思路

實現思路：儲存：大矩陣很多都是稀疏矩陣，並且有可能有上百萬的行和上百萬的列。那麼矩陣可以存在類似HBase面向列的分散式資料庫中。假設HTable中有兩個表A和表B分別儲存兩個巨型矩陣a和b。表A和表B都是隻有一個列族。列名都是1開始計數。那麼表A和表B所儲存的矩

html靜態頁面中執行php、asp函式程式碼

啟用伺服器端包含 1、在IIS 管理器中，展開本地計算機，右鍵單擊“網站”資料夾（在所有網站上啟用 SSI），或者右鍵單擊某個特定的網站，然後單擊“屬性”。 2、單擊“主目錄”選項卡。 3、在“應用程式設定”部分中，單擊“配置”。 4、在“對映”選項卡上，單擊“新增”。 5

Java實現ACM中的矩陣乘法

問題 F: 矩陣乘法題目描述請你實現一個程式，用於求兩個矩陣的乘積。輸入包括一系列的測試用例，每個測試用例的第一行包含三個整數a，b，c，其中a是第一個矩陣的行數，b是第一個矩陣的列數並且是第二個矩陣的行數，c是第二個矩陣的列數，接下來是a行，每行包含b個整數，每個整數用空格

【微信小程式】在wxml中執行復雜運算的巧妙方法

前言：微信小程式wxml中的{{ }}可以進行簡單四則運算，三元運算子等簡單的運算。但是像str.split(',')，arr.concat()等複雜的運算是沒辦法在{{ }}中執行的。但是我們可以

python中實現矩陣乘法

# TODO 計算矩陣乘法 AB，如果無法相乘則raise ValueError def matxMultiply(A, B): multiply = [] if len(A[0]) != len(B): raise ValueError

eclipse中執行兩個main函式

在eclipse中有時候可能需要同時開啟兩個專案 main（1）和main（2）這兩個main函式都有執行，你可以在console介面看到一個點選類似於小電視的那個按鈕就可以切換兩個控制檯，兩個main函式都在執行或者在debug狀態下可以看到有兩個main函式在執行

Map裡面放資料，然後再把map放到list中，在把list轉換成json，然後再把json存到map裡面，最後使用Hash Key的方式，存入到redis中

部分程式碼如下： map.put("busi

js中過濾輸入框，和文字域中的特殊字串。

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title></title> </head> <script type="text/ja

Nginx執行Laravel的配置（這裡要注意 laravel 的路徑預設是 pathinfo）

Nginx執行Laravel的配置修改nginx.conf。修改前記得備份一下，萬一改錯了還能還原回去。 server { listen 80; &

Zk 中兩列 listbox資料轉移，並獲得 listbox中的值

<hlayout height="160px" width="260px"> <listbox id="candidateLb" hflex="1" vflex="true

MyBatis中只傳一個String引數時要注意的事項

引數名為_parameter，不需要指定成傳入的引數名，參考程式碼如下： <select id="getLoop" resultMap="AAA" parameterType="String"> SELECT *, ROWNUM RN from tablename

以Point類為基礎，定義一個平面中的Circle類

課堂練習3：以Point類為基礎，定義一個平面中的Circle類： 1、編寫一個無參的建構函式； 2、編寫一個有參的建構函式； 3、在主函式中呼叫無參的建構函式生成圓的例項c1，呼叫有參的建構函式生成圓的例項c2，呼叫例項方法判斷c1和c2是否相重疊。 packa

Java 多執行緒傳值有三種方式，以及另類的第四種方式

現在博主的需求是：有可能在同一個執行緒類執行不一樣的程式。上邊兩個紅框中的cron4j排程器使用的是一個，根據引數不同來執行的。如果我點選後邊的手動執行一次，按照我上邊給出的java程式碼是無法實現的。看下邊的新的程式碼： (adsbygoogle = window.adsbygoo

注意button中type型別的使用不同型別用法不同而且瀏覽器的顯示也不一樣。

更多詳情建此處 https://blog.csdn.net/old_man31/article/details/86386876``` {% extends ‘base.html’ %} {% block content %} Title Add a new topic: {%

使用VLOOUP()函式時要注意的問題

有時候用VLOOUP()比對出的結果出現錯誤，問題出在哪裡呢，下面我們看一下一個例子。 sheet1中有307條資料，而“殘疾人人口基礎資料”中有410條資料，需要說明的是：sheet1中的這307條資料全部在“殘疾人人口基數資料”中，我們要在這41

關於table中一個隱藏的tr，改為顯示後與其他tr樣式不一樣的問題

問題：使用谷歌瀏覽器table中一個隱藏的tr，改為顯示後與其他tr樣式不一樣的根據來源的選擇，改變下面輸入框的型別和數量，微訊號格式與其他tr不一致。顯示如下：解決方案： display 屬性的可選值如下： none 此元素不會被顯示。 block 此元素

iOS開發中瀏覽器能開啟的URL，WebView打不開的處理方法

由於內容簡單,載入不出來是編碼不一致的問題，下面是解決方法: NSString *encodedString = [url stringByAddingPercentEscapesU

cublas中執行矩陣乘法運算的函式 首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼

cublas中執行矩陣乘法運算的函式

cublas中執行矩陣乘法運算的函式

相關推薦

cublas中執行矩陣乘法運算的函式首先要注意的是cublas使用的是以列為主的儲存方式，和c/c++中的以行為主的方式是不一樣的。處理方法可參考下面的註釋程式碼