Benchmarking Different Parallel Sum Reduction Algorithms in CUDA over Varied GPU Architectures Munesh Singh Chauhan Information Technology College of Applied Sciences Ibri, Sultanate of Oman [email protected] Abstract— SIMD architecture has become the main architecture for fine-grain parallelism with the introduction of GPUs (Graphical Processing Units). Many serial applications have now become the focus of GPU computing in order to enhance performance and attain higher throughputs. A new initiative is on the parallel algorithms that can easily be coupled with GPUs to minimize software development cycle both in terms of time and cost. CUDA (Compute Unified Device Architecture) is a parallel computing framework from NVIDIA Corporation that is used to program NVIDIA manufactured GPUs. The parallel algorithms and techniques such as stencil operations, prefix sum reduction, sort, scans, provides optimal to nearoptimal performance enhancements. Sum Reduction using CUDA is analyzed and different versions are surveyed as well as benchmarked. The benchmarking process uses bandwidth and execution time metric to differentiate between varied versions of reduction algorithms. Keywords—Graphical Processing Unitt; Compute Unified Device Architetcure, Parallel Sum Reduction I. INTRODUCTION GPUs provide phenomenal computing power at commodity prices. This would not have been true a decade back as supercomputers provided the bulk of High Performance Computing (HPC) at exorbitant price rates. With the introduction of GPU computing to masses, HPC is now within the reach of common programmer [1]. In order to parallelize legacy applications, a thorough understanding of the conventional parallel algorithms [2] along with their related optimizations is essential. The parallel algorithms provide essential and fast parallelization options to a developer thus optimizing time and work. In most circumstances these parallel collections of algorithms often grouped in libraries such as “Thrust” [3], “CUBLAS” [4], etc. provide peer reviewed logical and numerical solutions that have been vetted through a strict process over a period in time. This saves critical development time rather than re-inventing the wheel. There exist many categories of parallel algorithms but an important and widely used parallel sum reduction algorithm is implemented and benchmarked against different GPU architectures [5]. II. PARALLEL REDUCTION ANALYSIS Parallel reduction [6] plays a key role in summing up vectors, arrays and multi-dimensional matrices. The reduction requirement is widely used in various branches of engineering and sciences. As a result, an efficient and fast reduction that can be processed in parallel can become a key factor in attaining optimum performance gain in diverse applications. Various parallel reductions using interleaved addressing are proposed [7] but lacks benchmarking across varied different GPU architectures. The GPU architecture plays a crucial role in performance enhancements in the following contexts: 1. Number of Multiprocessors 2. Number of threads per block 3. Number of warps that can be executed concurrently 4. Dynamic memory allocation Hence, it becomes imperative to evaluate each architecture on the basis of the different reduction algorithm categories. A GPU code can be termed as memory-bound or computebound. On the basis of this distinction profiling results can be tailored. A compute-bound GPU kernel spends most of its time in computations and processing. As a result GFLOPS (Giga Floating Point Operations per Second) metric is often used to measure performances. Whereas in case of a memory-bound kernels, a kernel spend most of its time in memory operations, hence GB/s (Giga Bytes per Second) metric is used. In the present case, reductions have very low arithmetic intensity; thus GB/s is used as a relevant metric for comparison of various reduction algorithms. III. REDUCTION 1: INTERLEAVED ADDRESSING The Reduction1 algorithm [7] uses shared memory to amortize the array sum using negligible computational latency, as shared memory runs almost at the same speed as L1 cache. The threads of a block, copy the content of the array that is to be reduced into the shared memory with array size equivalent to the thread block size (Number of threads per block). The threads of a block wait for the entire copy operation to finish using __syncthreads() function call. After the copy, the array elements in the shared memory are reduced with the final result allocated to the first element of the shared memory array. This result represents the reduced sum of a block. Since multiple blocks are scheduled, all the block sums are accumulated. The final sum reduction of the block sum is not explained as the emphasis is on the cost of reduction per block. The last steps are common to all categories of reduction algorithms. Reduction2 is the use of reverse loop and thread-based indexing. __global__ void redux2(const int *in, int *out) { //each thread loads one element from //global to shared memory extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i=threadIdx.x + lockIdx.x*blockDim.x; sdata[tid] = in[i]; __syncthreads(); __global__ void kernel1(const int *in, int *out) { //each thread loads one element from //global to shared memory extern __shared__ int sdata[]; //do reduction in shared memory for (unsigned int s = 1; s < blockDim.x; s *= 2) { int index = 2 * s*tid; if (index<blockDim.x) { sdata[index] += sdata[index + s]; } syncthreads(); unsigned int tid = threadIdx.x; unsigned int i = threadIdx.x+blockIdx.x*blockDim.x; sdata[tid] = in[i]; __syncthreads(); //do reduction in shared memory for (unsigned int s = 1; s < blockDim.x; s *= 2) { if (tid % (2 * s) == 0) { sdata[tid] += sdata[tid + s]; } syncthreads(); } //write result for this block to global memory if (tid == 0) out[blockIdx.x] = sdata[0]; } Fig 1: Reduction 1: Interleaved Addressing } //write result for this block to global memory if (tid == 0) out[blockIdx.x] = sdata[0]; } Fig 2: Reduction 2: Thread Divergence __global__ void redux3(const int *in, int *out) { //each thread loads one element from //global to shared memory extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i=threadIdx.x + blockIdx.x*blockDim.x; Disadvantage of Reduction1 algorithm The threads become highly divergent on successive strides. It results in idle threads thus not fully utilizing the GPUs. sdata[tid] = in[i]; __syncthreads(); //do reduction in shared memory for (unsigned int s = blockDim.x / 2; s > 0;s>>=1) IV. REDUCTION 2: THREAD DIVERGENCE { if (tid<s) { sdata[tid] += sdata[tid + s]; } syncthreads(); The Reduction 2 algorithm tries to remove the divergence of the threads. Divergence of threads in a block leads to majority of the threads lying idle and waste precious GPU computing resources. } //write result for this block to global memory if (tid == 0) out[blockIdx.x] = sdata[0]; } Disadvantages of Reduction2 algorithm The Reduction2 algorithm uses strided index and successfully removes the problem of thread divergence. But it suffers from an additional problem related to shared memory bank conflicts, as two different threads try to access the same memory bank. V. REDUCTION 3: SHARED MEMORY BANK CONFLICT Reduction3 Algorithm solves the shared memory bank conflict issues and provides sequential addressing as each thread accesses contiguous locations in the shared memory. The major change done in Reduction3 algorithms as compared to Fig 3: Reduction 3: Shared memory Bank Conflict Disadvantages of Reduction2 algorithm The major disadvantage of the algorithm is that half of the total threads remain idle per iteration. In fact the first iteration starts with only half the total threads functioning at any given time. VI. REDUCTION 4: LOAD WHILE INITIALIZING SHARED MEMORY In this algorithm the blocks are halved as a single per-thread “Add” is done during the first load to the shared memory. This further tends to increase the performance as an addition is embedded in a shared memory load. __global__ void redux4(const int *in, int *out) { //each thread loads one element from //global to shared memory extern __shared__ int sdata[]; __global__ void redux5(const int *in, int *out) { //each thread loads one element from //global to shared memory extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockDim.x * 2); //do reduction in shared memory for (unsigned int s=blockDim.x/2;s > 32; s >>= 1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } sdata[tid] = in[i]+in[i+blockDim.x]; __syncthreads(); //do reduction in shared memory for(unsigned int s=blockDim.x/2; s > 0; s >>= 1) { if (tid<s) { sdata[tid] += sdata[tid + s]; } syncthreads(); } } + sdata[tid] = in[i] + in[i + blockDim.x]; __syncthreads(); unsigned int tid = threadIdx.x; unsigned int i=threadIdx.x+ blockIdx.x*(blockDim.x/2); //write result for this block to global memory if (tid == 0) out[blockIdx.x] = sdata[0]; threadIdx.x if (tid < 32) { sdata[tid] sdata[tid] sdata[tid] sdata[tid] sdata[tid] sdata[tid] += += += += += += sdata[tid sdata[tid sdata[tid sdata[tid sdata[tid sdata[tid + + + + + + 32]; 16]; 8]; 4]; 2]; 1]; } Fig 5: Reduction 5: Loop Unrolling Fig 4: Reduction 4: Load while initializing shared memory VIII. PROFILING RESULTS Advantages of Reduction4 algorithm The shifting of the first level “add” operation with the store of the shared memory amortizes the cost of add operations. This provides avenues for further speed-up. Disadvantage of Reduction4 algorithm The number of blocks are halved which can result in lesser parallelism especially when the size of the input array to be reduced is small. VII. REDUCTION 5: LOOP UNROLLING In reduction 5, there are instructions that cause unnecessary overheads. These ancillary instructions are not related to load, store or any arithmetic computations. Such instructions are basically used for iterations and addressing, thus causing extra overheads. This aspect can be mitigated if some part of the loops can be unrolled. It is noticed that the when s<=32 (Number of threads per block left is less than or equal to 32), the last warp need not be synchronized thus inducing additional parallelism in the reduction example. Advantage of Reduction5 algorithm Loop unrolling exposes additional parallelism into the fore. Disadvantage of Reduction5 algorithm The reduction in the number of blocks (halving) may lead to deceleration of the program’s performance especially when the input (array) is small. This factor is similar to the one elucidated in the Reduction#4 algorithm. As pointed out earlier, the common myth is to run as many threads as possible on a multiprocessor. This aspect [8] is shown to be flawed in this research work. According to the Little’s law as mentioned below: Parallelism (Number of Operations per Multiprocessor) = Latency x Throughput (1) The law espouses the parallelism direct relation to the number of cores per multiprocessor. A 100% throughput is not achievable with less options of parallelism (lesser number of cores per multiprocessor). Effective Bandwidth is used as the parallel sum reduction performance metric as the algorithm is heavily memory-bound. The Effective Bandwidth Calculation is outlined as under: Effective bandwidth calculation Beff = ((Br+Bw)/ 109 )/ T (2) Where Beff = Effective Bandwidth in GB/s Br = Number of Bytes Read Bw = Number of Bytes Written T = Execution Time in seconds The profiling results are benchmarked over three different GPUs (2 having Fermi Architecture and 1 having Kepler Architecture). Each of the five reduction algorithms are compared against different block sizes (Number of threads per block) and corresponding bandwidth (in Giga Bytes per second). The profiling results are displayed in Appendix A (Table I, II and III) for each of the GPU device. Each of the five Reduction algorithms are implemented and graphically benchmarked against best bandwidths in Fig 6, 7 & 8. Fig 6: Best Band Width Vs Block Sizes (NVS 5400M) Issues in CUDA Device Synchronization while Profiling Results 1) Host-Device synchronization must be considered before aggregating profiling results. CUDA kernel launch from CPU host is asynchronous. By asynchronous it means that the program control returns immediately back to the CPU after the kernel is launched. The CPU does not wait for the launch kernel to complete. In other cases CUDA memory copy (cudaMemcpy) functions are synchronous and as such they wait for all the previous CUDA code to finish before initiation. 2) If CPU timers are used to measure performance, then cudaDeviceSynchronize() must be called after each kernel launch. This subroutine usage is discouraged as it stalls the GPU pipeline thus adversely affecting the performance. 3) Instead a better option of profiling is to use CUDA Event timers. CudaEventSynchronize() blocks the CPU until the event time is recorded. This routine is much lighter than cudaDeviceSynchronize(). 4) The most accurate CUDA timers can be observed using Nsight Visual Profiler. This tool is not presently explored in this paper. 5) The maximum threads-per-block size for different GPU architectures is constrained. If more threads are allocated per block than the limit, multiple blocks are queued in the SMs (Streaming Multiprocessors). This may result in further latency related delays. IX. CONCLUSION Fig 7: Best Band Width Vs Block Sizes (Tesla C2050) The best overall bandwidth is achieved in case of NVS5400M (4.508841 GB/sec) with block size 2 and Reduction Algorithm2. It shows, as discussed before that by creating large number of small blocks in flight can actually benefit the memory bandwidth output often leading to peaking. This is in contrast to a common belief in parallel code developers that stuffing blocks with large number of threads can provide performance gain which is not the case as shown. Creating many blocks is the key. A Streaming Multiprocessor (SM) can run many blocks together at the hardware level. Inside each block the basic unit of execution is the warp (group of 32 threads) that is in most cases run concurrently. As a result if we have many blocks running inside an SM we can expect many warps too in simultaneous execution, thus exponentially increasing the performance gains. ACKNOWLEDGMENT Fig 8: Best Band Width Vs Block Sizes (GT 740M) This work has been sponsored by TRC, Sultanate of Oman and is part of the project that investigates the use of High Performance Parallel GPUs for solving compute-intensive and complex applications related to weather modeling and fractal image compression. [3] REFERENCES [1] [2] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone and J. Phillips, 'GPU Computing', Proc. IEEE, vol. 96, no. 5, pp. 879-899, 2008.Sengupta, S., Harris, M., & Garland, M. (2008). Efficient parallel scan algorithms for GPUs. NVIDIA, Santa Clara, CA, Tech. Rep. NVR2008-003, (1), 1-17. 2015. [Online]. Available: http://thrust. googlecode. com. [Accessed: 10- Apr- 2015]. Nvidia, C. U. D. A. (2008). Cublas library. NVIDIA Corporation, Santa Clara, California, 15. [4] [5] [8] [9] J. Nickolls, I. Buck, M. Garland and K. Skadron, 'Scalable parallel programming with CUDA', Queue, vol. 6, no. 2, p. 40, 2008. Y. Zhang and J. D. Owens, (2011, February). A quantitative performance analysis model for GPU architectures. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on (pp. 382-393). IEEE. M. Harris, NVIDIA, Optimizing Parallel Reduction in CUDA, NVIDIA Developer Technology V. Volkov, Better Performance at Lower Occupancy, UC Berkeley, 22 September, 2010 M. Harris, Implementing Performance Metrics in CUDA C/C++, NVIDIA Developer Zone Appendix A TABLE I. Algorithm Block Size Reducion1 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 Reduction2 Reduction3 Reduction4 Reduction5 PROFILE DATA FOR REDUCTION ALGORITHMS (FERMI ARCHITECTURE: NVIDIA NVS5400M) Execution Time (in milliseconds) 4.227936 4.203616 36.506687 43.082817 51.052639 4.203040 4.211136 4.210816 4.186080 4.547904 24.314495 28.322912 43.103970 4.488896 4.399776 4.230272 4.208448 4.484288 23.401920 26.102688 32.283329 4.362560 4.214208 4.237696 4.177152 4.350880 14.031360 15.659936 18.828575 4.372736 4.209664 4.215648 4.412512 4.559392 14.301600 15.412768 19.115744 4.426848 4.650528 4.350976 Br (Bytes read per kernel) 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 Bw (Bytes written per kernel) 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE Effective Bandwidth (in GB/s) 4.464204 4.115862 0.463156 0.390178 0.328947 3.993635 3.984985 3.984801 4.508841 3.804281 0.695400 0.593512 0.389607 3.739318 3.814129 3.966474 4.484876 3.858250 0.722517 0.643994 0.520194 3.847605 3.982080 3.959525 4.267451 3.916302 1.200365 1.072393 0.891486 3.837714 3.985892 3.979990 4.039829 3.737200 1.177683 1.089590 0.878093 3.790804 3.608034 3.856201 Best Effective Bandwidth 4.464204 4.508841 4.484876 4.267451 4.039829 TABLE II. Algorithm Block Size Reducion1 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 Reduction2 Reduction3 Reduction4 Reduction5 PROFILE DATA FOR REDUCTION ALGORITHMS (FERMI ARCHITECTURE: NVIDIA TESLA C2050) Execution Time (in milliseconds) 6.451168 6.475936 11.808000 12.770624 14.195264 6.529248 6.762432 6.836160 7.271616 6.632576 9.582592 9.883584 12.133664 6.539328 6.713440 6.685536 7.137216 6.512896 10.225344 10.293088 10.870112 6.699008 6.645376 6.653024 7.227424 6.630112 8.590912 8.638432 8.860736 7.021984 6.645088 7.297760 7.006688 6.622656 8.507424 8.143008 8.737728 7.122016 6.628576 7.257088 Br (Bytes read per kernel) 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 Bw (Bytes written per kernel) 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE Effective Bandwidth (in GB/s) 2.925729 2.671661 1.431935 1.316301 1.183042 2.570803 2.481550 2.454487 2.595622 2.608565 1.764480 1.700798 1.384050 2.566840 2.499659 2.509786 2.644500 2.656499 1.653567 1.633133 1.544933 2.505656 2.525261 2.522051 2.466410 2.569996 1.960531 1.944057 1.894358 2.389825 2.525063 2.299095 2.544111 2.572889 1.979771 2.062334 1.921027 2.356259 2.531353 2.311980 Best Effective Bandwidth 2.925729 2.608565 2.656499 2.569996 2.544111 TABLE III. Algorithm Block Size Reducion1 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 8192 8 32 128 512 1024 2048 4096 1024 Reduction2 Reduction3 Reduction4 Reduction5 PROFILE DATA FOR REDUCTION ALGORITHMS (KEPLER ARCHITECTURE: NVIDIA GT 740M) Execution Time (in milliseconds) 5.478656 4.903104 13.635552 17.617887 18.840544 6.134880 4.196352 4.441664 5.288512 4.505120 11.047904 14.091296 16.617439 6.115392 4.476576 4.418048 5.651808 4.489312 10.668032 12.560352 13.891552 6.007520 4.665568 4.431072 5.224032 5.095104 7.535552 9.200608 9.025536 6.024928 4.526176 4.789728 5.336128 5.284000 8.436736 9.103328 9.831392 6.342688 4.499872 5.234880 Br (Bytes read per kernel) 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 222*4 Bw (Bytes written per kernel) 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE 222*4/BLOCK_SIZE Effective Bandwidth (in GB/s) 3.445073 3.528684 1.240015 0.954143 0.891354 2.736061 3.999024 3.777698 3.568937 3.840409 1.530452 1.192934 1.010601 2.744780 3.748694 3.797891 3.339528 3.853932 1.584949 1.338337 1.208907 2.794066 3.596842 3.786728 3.412267 3.344261 2.235105 1.825271 1.859769 2.785313 3.707161 3.502963 3.340585 3.224709 1.996359 1.844776 1.707328 2.645773 3.728831 3.205086 Best Effective Bandwidth 3.999024 3.840409 3.853932 3.707161 3.728831
© Copyright 2025