A Computer hardware and components forum. ComputerBanter.com

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » ComputerBanter.com forum » Video Cards » Nvidia Videocards
Site Map Home Authors List Search Today's Posts Mark Forums Read Web Partners

Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !



 
 
Thread Tools Display Modes
  #1  
Old July 14th 11, 02:34 PM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

The number of blocks doesn't really matter.

I test with 2000 and it gives same performance results, it just takes
shorter to test, it is after all divided by seconds taken.

The higher block numbers were just to test if it might help.

Anyway I have managed to find a little optimization trick via ptx.

It's quite significant too.

By adding a "cop" directive which stands for "cache operation" specifier
cuda can be made to run faster:

The following instruction was changed from:

ld.global.s32 %r34, [%r38+0];
To:
ld.global.cg.s32 %r34, [%r38+0];

This seems to give 50% more performance for random access memory with cuda !

However care/more tests should be done to be sure... maybe it's just for
this particular situation, but the difference is so big there is probably
something to it !

Surprisingly the "cop" .cs did not give more performance, which is what I
tried first.

I still have others to try, but this is already pretty spectacular !

Since everything else I tried with code adjustments didn't help !

So there is still hope yet to squeeze some more performance out of it !
=D

The CPU is still twice as fast by a large margin though !

Bye,
Skybuck.

  #2  
Old July 14th 11, 02:42 PM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

The following technique works more or less the same way at the source level:

The following parameter is altered

from:

int *Memory,

to:

volatile int *Memory,

This produces the instruction:

ld.volatile.global.s32 %r34, [%r38+0];

I also tried adding .cg behind the global but that is not allowed that would
be recursive...

volatile already indicates that no cache operations are allowed.

This gives the same 50% performance increase which is very nice !

Bye,
Skybuck.


  #3  
Old July 22nd 11, 05:36 AM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

Interesting news in short: GPU cache 4 times faster then CPU cache ! =D

(Version 0.10 which still uses GPU ram instead of GPU cache also available)

(Version 0.12 is the gpu cache version but still unreleased =D)

Ok, the shared memory kernel is done... it also executes 4000 blocks but
this time sequentially...

This test/results made my jaw drop ! LOL... which offers possibilities/hope
for cuda:

Just a single cuda thread did this:

http://www.skybuck.org/CUDA/RAMTest/...MemoryTest.png

Text:

"
Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
LoadModule...
OpenEvents...
OpenStream...
SetupKernel...
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProce ssor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcess or: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProces sor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel...
ReadBackResults...
DisplayResults...
CloseStream...
CloseEvents...
UnloadModule...
ExecuteCPU...
Kernel execution time in seconds: 0.3385913085937500
CPU execution time in seconds : 1.4263124922301578
Cuda memory transactions per second: 945092186.0015719590000000
CPU memory transactions per second : 224354762.1879504710000000
program finished.
"

Conclusion: shared memory is HELL/SUPER FAST !

Almost 4 times faster than the CPU ?!?!

I am gonna do a little debug test with VS 2010, because this is almost
unbelievable ! LOL. But I believe but gjez ?! Cool.

Though the GPU L1 cache is probably smaller than CPU L1 cache which could
explain it's higher speed

For real purposes I might require an even larger cache and then maybe the
results will be different... but for now it's hopefull

Bye,
Skybuck.


  #4  
Old July 22nd 11, 05:40 AM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

In reality this probably means the gpu is twice as fast as a dual core,
since the dual core will also probably be double as fast as single core.

So if a quad core processor would face a gt 520 they would both be about the
same speed would be my estimate, unless newer cpu's have even faster caches


Bye,
Skybuck.

  #5  
Old July 22nd 11, 02:11 PM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

Woops there was something wrong with the kernel and also the kernel launch
parameters.

Kernel was doing only 1 block, and launch parameters where 4000 threads.

Now the situation has been corrected.

The kernel is doing 4000 blocks and only 1 thread.

It turns out it's fricking slow !


Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].SharedMemoryPerMultiProcessor: 49152
Device[0].RegistersPerMultiProcessor: 32768
Device[0].ConstantMemory: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
LoadModule...
OpenEvents...
OpenStream...
SetupKernel...
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProce ssor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcess or: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProces sor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 1
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 1
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel...
ReadBackResults...
DisplayResults...
CloseStream...
CloseEvents...
UnloadModule...
ExecuteCPU...
Kernel execution time in seconds: 24.2583750000000000
CPU execution time in seconds : 1.4263193366754714
Cuda memory transactions per second: 13191320.5233244183900000
CPU memory transactions per second : 224353685.5819891260000000
program finished.

(Picture already updated above).

  #6  
Old August 1st 11, 03:35 PM posted to alt.comp.hardware.pc-homebuilt,alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

Just for the record,

I also wrote a CPU test which can run on any x86 cpu.

For the AMD X2 3800+ the results for a single core where as follows:

Test CPU Random Memory Access Performance.
version 0.01 created on 1 august 2011 by Skybuck Flying.
program started.
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
ExecuteCPU...
CPU execution time in seconds : 0.7778037432131737
CPU memory transactions per second : 411414836.7016757590000000
program finished.

So that's:

411.414.836 random 32 bit integers per second (mostly from cpu cache).

This higher number is because of optimized code (no slow dynamic indexes and
no slow get element routine/no call overhead).

Bye,
Skybuck.






 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Skybuck's Universal Code Version 6 (The Fast Version) Skybuck Flying[_3_] Nvidia Videocards 0 June 8th 10 04:52 AM
Skybuck's Dream PC design for 2006 (Version 7) FredK Asus Motherboards 0 January 5th 06 01:17 AM
Skybuck's Dream PC design for 2006 (Version 7) FredK General 0 January 5th 06 01:17 AM


All times are GMT +1. The time now is 03:47 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.
Copyright 2004-2018 ComputerBanter.com.
The comments are property of their posters.