A Computer hardware and components forum. ComputerBanter.com

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » ComputerBanter.com forum » Video Cards » Nvidia Videocards
Site Map Home Authors List Search Today's Posts Mark Forums Read Web Partners

An idea how to speed up computer programs and avoid waiting.("event driven memory system")



 
 
Thread Tools Display Modes
  #1  
Old August 9th 11, 08:46 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 13
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


In the code below 3 blocks in parallel are attempted.


"
A very bad idea.
"

This was your idea after all



Not that I knew. If I remember correctly, you posted this
stuff...


It might be bad for current settings, but maybe there is some value in it for other settings.

"
L1 cache is 65,536 byte on recent CPUs,
but only 32,768 byte on future processors. e.g. Zambezi.
Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
(or 65,536!) byte do not fit into L1. The result: Memory
access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
27 (RAM) clocks per access, depending on where the cache
lines have to be fetched from or written to.

If you want to process multiple blocks in parallel, it's
better to check, how many physical processors exist on a
user's machine, then create that amount threads (minus 1
for the OS) running on one physical processor, each.
"

So far the examples have all executed well inside the cache because of the low settings
which almost fit in L1 cache.

However it would also be interesting to change settings to for example a block size of
1920x1200 elements to mimic a single frame of a video codec for example, which might want
to do some random access memory.

So then the L1 cache will probably be whoefully short and then you can still try and see
if your "pairing/parallel" processing of frames might give higher throughput.



Read some recent processor manuals to get an overview.


However I already do see a little problem with this idea. Usually video codec frames are
processed sequentially, so it might be difficult to impossible to process multiple at the
same time.



Possible with multiple cores. GPUs have hundreds of them.


However a frame does consists out of r,g,b and sometimes maybe even alpha channel.



These four byte fit into a dword.


So there could be 3 to 4 channels, so your idea of processing 3 blocks at the same time
might still have some merit, when each color channel is considered a block.



I never wrote something about "processing three blocks at
the same time" - this probably is the result of your mis-
understanding how my examples really work.


Another interesting setting could be to reduce the settings to make it fit entirely into
the L1 cache to see if your pairing/multiple loading code also becomes faster for when it
does fit in cache.



As you can read everywhere else, it does. If cache didn't
work as accelerator, no recent processor had -any- cache.
Caches are the most expensive part of any processor. They
occupy huge areas on a die and increase production costs.


Writing multi threading code for "random access memory" could become tricky. Especially if
writes are involved as well.
Processors might conflict with each other.



Writes are not the bottleneck, because they are scheduled
by the memory controller (no wait cycles). The real brake
are reads. The processor has to wait until requested data
are present - it cannot work with "guessed" data.


As long as processors operate within their own blocks and can never access each other
blocks then it should be ok.



As far as I understood your poatings, your definitions do
not allow inter-block access,


Some care should still be taken at the border of the blocks, inserting some paddings there
could help prevent accidently/unlucky simultanious access. So it does become a little bit
more tricky to write code like that... because of the padding, allocations have to be done
a little bit different, as well as formula's. I can see it become a bit messy though some
variable names could help solve the mess as follows:

Allocate( mElementCount + mPaddingCount );

^ Then padding count could be used in formula's/code where necessary or left out where not
necessary.

Like mod mElementCount to setup indexes correctly.



Which is faster than an "and reg,mask" (1 clock, 1 pipe),
executed while the processor waits for data?


"
With 8,192 elements per block, the multiplication can be
replaced by a shift:

BlockBaseN = BlockIndexN 15

reducing the VectorPath MUL (blocking all pipes for five
clocks) to a DirectPath SHL (one clock, one pipe).
"

While to explanation of this optimization idea/trick is interesting unfortunately it
cannot be really used in the example code,

The example code needs to remain flexible to try out different scenerio's to see if your
idea can be usefull for something



Which idea? What I posted until now were suggestions, how
your ideas could be improved. If I wanted to advertise my
own ideas, I started some threads rather than to reply to
other people's postings.


// initialise each element index to the first index of the block.
ElementIndexA = 0
ElementIndexB = 0
ElementIndexC = 0

// loop 80000 times through the block arrays/elements
SecondLoopBegin "LoopIndex goes from 0 to 79999"

// Seek into memory at location BlockBase + ElementIndex
// do this for A, B and C:
// retrieve the new element index via the old element index
ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

SecondLoopEnd


"
I still do not understand what this is good for, but you
should consider to process one block after the other. If
you are eager to introduce something like "parallel pro-
cessing", divide your entire array into as much parts as
there are processors (hopefully a multiple of two), then
let those threads process a contiguous part of memory in
parallel.
"

The code above represents your "parallel block" idea, where there would be 3 loads in
parallel.



This still is your code, and this stupid idea is based on
your misunderstanding of my explanations how Athlons work
(three execution pipes do not speed up reads - especially
not if each of them has to read data from a different, in
most cases uncached, memory location).


The mMemory[ ... ] is a load. See one of your earlier postings of yourself where you write
about that idea.

You "claimed" the X2 processor can do 3 loads in parallel more or less.





I explain, but don't claim anything. What I wrote here is
a short summary of AMD's reference manuals and practice -
not more, not less.


The code above tries to take adventage of that.



No. The fastest way to perform this task are some -minor-
changes to avoid time consuming instructions like MUL and
a change to sequential (not random) reads as long as this
is possible.


"
If you prefer a single threaded solution, it's better to
process one block after the other, because entire blocks
fit into L1 completely.
"

Perhaps your parallel loading idea might also be faster in such a situation.



If you implement it properly, yes. Accessing three memory
areas with distances of 32,768 byte surely isn't a proper
implementation. My example in



still shows how to speed up reads. Clearly in sequential,
not in random order...


If not then the opposite situation of not fitting into L1 could be tested as well to see
if your parallel loading idea gives more performance.

"
Any "touched" element is present
in L1 after the first access, three registers are enough
to manage the entire loop, faster algorithms can be used
to process multiple elements in parallel (all pipes busy
most of the time), and so on,
"

?


// store some bull**** in block result, this step could be skipped.
// but let's not so compiler doesn't optimize anything away
// perhaps block result should be printed just in case.
// the last retrieved index is store into it.
// do this for A,B,C

BlockResult[ BlockIndexA ] = vElementIndexA
BlockResult[ BlockIndexB ] = vElementIndexB
BlockResult[ BlockIndexC ] = vElementIndexC

// once the 3 blocks have been processed go to the next 3 blocks
BlockIndexA = vBlockIndexA + 3
BlockIndexB = vBlockIndexB + 3
BlockIndexC = vBlockIndexC + 3

FirstLoopEnd

Perhaps this clearifies it a little bit.

Let me know if you have any further questions


"
As shown, your current approach is quite slow
"

Which one ?

The single block version (which should be quite fast) or the three block version (which
was your idea ! ).

"
- it might
be worth a thought to change your current design towards
something making use of built-in acceleration mechanisms
like caching and multiple execution pipelines.
"

It already does caching by accident more or less, multiple execution pipelines that was
tried with your idea.

I think it does do both a bit... I don't know exactly how much it already does these things.

A processor simulator would be necessary to see what exactly is happening.



The best processor simulator is the human brain (and some
knowledge about the target machine).


But I can tell a little bit from re-arranging the code that it does some of these things
that you mentioned.

So perhaps the "three block" version is already optimal but it's still slower than the
"one block" version, at least for 32 bit because then it fits in L1 cache, however
different settings should still be tried to see which one if faster then.

Also 64 bit assembly code of yours should still be tried, it's not tried yet.

"
Using multiples of two (rather than ten) simplifies cal-
culations markably. Hence, the first thing to change was
the size of your array - 4,096 blocks and 8,192 elements
don't occupy that much additional memory, 6,217,728 byte
are peanuts compared to the entire array. In return, you
save a lot of time wasted with calculating addresses and
validating indices (AND automatically keeps your indices
within the valid range).
"

As long as loading and writing is not affected by the extra padding bytes, and as long as
the example executes correctly then arrays could be padding with extra bytes if it makes
address calculations faster without changing the nature of the example.

However if the example suddenly produces different results then it would be a problem.

"
The slightly changed design used addressing modes like

mov REG,block_offset[block_base + index * 4] or

mov block_offset[block_base + index * 4],REG
"

What's the difference here ?

One seems reading the other seems writing ?



Yes.


So apperently you comparing this code to something else ? But what ?



RDI = block count
RSI = current block address
RBP = current index to read

prepa
prefetch rsi
mov edi,block_count
inner_loop:
mov eax,[rsi+0x00+rbp*4] # movl 0x00(%rsi, %rbp, 4),%eax
mov ebx,[rsi+0x04+rbp*4]
mov ecx,[rsi+0x08+rbp*4]
mov edx,[rsi+0x0C+rbp*4]
mov r10,[rsi+0x10+rbp*4]
mov r11,[rsi+0x14+rbp*4]
mov r12,[rsi+0x18+rbp*4]
mov r13,[rsi+0x1C+rbp*4]
nop
....
[process data here]
....
sub ebp, 8
jae inner_loop
outer_loop:
mov ebp,(loop_cnt -1)
add esi,block_size
dec edi
jne inner_loop
....


Reads 8 elements per iteration. EAX is present after NOP,
EBX...R13 are present in the following cycles (one access
per cycle). The loop reads half of an Athlon's cache line
size (one Bulldozer cache line). Replacing the NOP with a
PREFETCH [RSI+RBP*4] preloads the next cache line to save
a few clock cycles. This is the code for the entire loop,
just add some processing of the retrieved data. There are
four registers left for calculations, temporary stores or
whatever you want to do.


Greetings from Augsburg

Bernhard Schornak
  #2  
Old August 9th 11, 11:39 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

The pascal code with the 3 blocks is roughly based on this code/posting of
yours:

"
An example which does one load per pipe would be nice !


....
mov eax,[mem] \
mov ebx,[mem + 0x04] cycle 1
mov ecx,[mem + 0x08] /
nop \
nop cycle 2
nop /
nop \
nop cycle 3
nop /
eax present \
nop cycle 4
nop /
ebx present \
nop cycle 5
nop /
ecx present \
nop cycle 6
nop /
....


It takes 3 clocks to load EAX - Athlons can fetch
32 bit (or 64 bit in long mode) per cycle. Hence,
the new content of EAX will be available in cycle
four, while the other two loads still are in pro-
gress. Repeat this with EBX and ECX - NOPs should
be replaced by some instructions not depending on
the new content of EAX, EBX or ECX. It is the the
programmer's job to schedule instructions wisely.
Unfortunately, an overwhelming majority of coders
does not know what is going on inside the machine
they write code for (= "machine independent").
"

I do twist this text a little bit when I write "parallel" I kinda mean 3
loads in whatever clock cycles it takes.

So it's like a more efficient form of batch processing.

I requested for an example of a "load per pipe".

I also doubted if it was possible.

Yet somehow you claimed it was possible, but then you come with some kind of
story.

Not sure if it was an answer to my question/request or a half answer.

It's also not clear what is ment with a "load per pipe".

Is that parallel, or really short sequential.

For me it's more or less the same as long as it's faster than what the code
is currently doing.

So what it's actually called/named doesn't really matter for me as long as
it's somehow faster.

But now it seems we are starting to miss understand each other on both sides


The manuals seems very cluttered with operating system specific information
which I don't need.

I only need to know hardware architecture for optimization purposes.

I already read "optimization tricks" manual.

I feel I could be wasting my time reading an instruction set like x86 or x64
which could die any day now.

It's probably also quite a bad instruction set with many oddities.

I'd rather read and switch to a more properly designed instruction set.

I had a fun time reading about cuda/ptx I do not think reading x86 would be
any fun at all

But perhaps I will give it a looksy to see if I can find something valuable
in it... I doubt it though.

The processors seem also quite complex and change a lot...

Bye,
Skybuck.

  #3  
Old August 11th 11, 09:50 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 13
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


The pascal code with the 3 blocks is roughly based on this code/posting of yours:



Definitely not!


"
An example which does one load per pipe would be nice !


...
mov eax,[mem] \
mov ebx,[mem + 0x04] cycle 1
mov ecx,[mem + 0x08] /
nop \
nop cycle 2
nop /
nop \
nop cycle 3
nop /
eax present \
nop cycle 4
nop /
ebx present \
nop cycle 5
nop /
ecx present \
nop cycle 6
nop /
...


It takes 3 clocks to load EAX - Athlons can fetch
32 bit (or 64 bit in long mode) per cycle. Hence,
the new content of EAX will be available in cycle
four, while the other two loads still are in pro-
gress. Repeat this with EBX and ECX - NOPs should
be replaced by some instructions not depending on
the new content of EAX, EBX or ECX. It is the the
programmer's job to schedule instructions wisely.
Unfortunately, an overwhelming majority of coders
does not know what is going on inside the machine
they write code for (= "machine independent").
"

I do twist this text a little bit when I write "parallel" I kinda mean 3 loads in whatever
clock cycles it takes.

So it's like a more efficient form of batch processing.

I requested for an example of a "load per pipe".

I also doubted if it was possible.



Your profound misunderstanding of this code snippet led
to these assumptions. Hint:

[mem], [mem+4] and [mem+8] are subsequent dwords stored
in contiguous order

0x00401000 = address of 1st read
0x00401004 = 2nd read
0x00401008 = 3rd read.

Cache line 0x00401000 will be loaded after the 1st read
(if it was not in L1, yet). Your implementation of this
example does this

0x00401000 = address of 1st read
0x00507000 = 2nd read
0x0080F000 = 3rd read,

which is not the same thing - three cache lines have to
be loaded if you access memory like that.


Yet somehow you claimed it was possible, but then you come with some kind of story.

Not sure if it was an answer to my question/request or a half answer.

It's also not clear what is ment with a "load per pipe".



The term speaks for itself.


Is that parallel, or really short sequential.



Depends on the code. If you use subsequential addresses
for reads or writes, it is both. Otherwise, it still is
parallel, but performs multiple random reads or writes.


For me it's more or less the same as long as it's faster than what the code is currently
doing.



One reason why you did not understand what I told.


So what it's actually called/named doesn't really matter for me as long as it's somehow
faster.



As long as you don't know what you are doing, you can't
expect your code will become faster.


But now it seems we are starting to miss understand each other on both sides



No. I perfectly understand your position. However, it's
one thing to offer help or do some work for someone, or
to -demand- more help than already was given.

I hope, you are able to understand this point.


The manuals seems very cluttered with operating system specific information which I don't
need.



Which manuals? Try

http://support.amd.com/us/Processor_TechDocs/24592.pdf
http://support.amd.com/us/Processor_TechDocs/24593.pdf
http://support.amd.com/us/Processor_TechDocs/24594.pdf
http://support.amd.com/us/Processor_TechDocs/26568.pdf
http://support.amd.com/us/Processor_TechDocs/26569.pdf
http://support.amd.com/us/Processor_TechDocs/40546.pdf

for AMD or

http://www.intel.com/Assets/PDF/manual/253665.pdf
http://www.intel.com/Assets/PDF/manual/253666.pdf
http://www.intel.com/Assets/PDF/manual/253667.pdf
http://www.intel.com/Assets/PDF/manual/253668.pdf
http://www.intel.com/Assets/PDF/manual/253669.pdf

for LETNi references and manuals.


I only need to know hardware architecture for optimization purposes.

I already read "optimization tricks" manual.



This is not a -manual-. It's a paper analysing code and
providing hints to improve programming techniques.


I feel I could be wasting my time reading an instruction set like x86 or x64 which could
die any day now.

It's probably also quite a bad instruction set with many oddities.



It's the most popular platform worldwide. After all, it
became that popular, 'cause it was the first (and only)
open standard where you can plug in any hardware of any
manufacturer without being bound to proprietary systems
where you are forced to buy every add-on from the manu-
facturer who already sold you the "nacked" computer.


I'd rather read and switch to a more properly designed instruction set.



Which?

If you think, x86 assembler is that disgusting, you are
better off if you stay with Pascal. It is machine inde-
pendent, much more comfortable and does most of the re-
quired work for you. Of course, you've to pay the price
in form of slow and bloated code, but, hey: Who cares?


Greetings from Augsburg

Bernhard Schornak
  #4  
Old August 15th 11, 03:11 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Ok,

I am a bit surprised I didn't respond yet to this posting of yours but that
might be because most of it is true and I have nothing further to add,
except ofcourse confirming that it was indeed a miss-understanding somewhat.

But it was also wishfull thinking.

I was hoping that

[mem + 0x04] somehow ment that it was using "mem" as a base address and
that "0x04" would contain another pointer which would point to the random
memory location.

I am not much of an x86 programmer and each assembler probably has it's own
pointer syntax/whatever.

I am glad you cleared this up.

I also don't hold your example against you because at the time of writing I
did probably not yet explain what the program was doing.

Only after your posting did I explain further...

So I think it's good of me to clearify that further just to be clear that
you were not trying to sabotage this thread with a wrong example.

At the time you probably simply didn't understand what I was trying to do...
or maybe you did understand but provided a sequential example anyway.

Only you know the thruth what was in your head at the time and true
intention.... though a sequential example which more or less assumes that
data is inside the cache could still be interesting.

So there could still be some value in this... perhaps my code is already
doing something like this... but I guess not because it's not really giving
any higher performance but that might be because the data set is too large.

If the elements were reduced from 8000 to 800 then 3 pairs of blocks might
fit in the cache and then maybe your code example might still have some
value.

However not really since the single program is probably already maxing out
the system/cpu.

It's not the cache or memory which is holding back the program it's simply
the ammount of instructions.

SSE is not going to help since there is no SSE instruction which can
retrieve multiple random memory access locations. So to bad for that.

If you truely want to write a faster program in assembler, and especially a
significantly more faster program than you would either:

1. Have to reduce the number of instructions further which will probably be
hard to do (inside the inner loop).

or

2. Find a way to "fetch" multiple elements per cycle (per instruction).

My take on it is: 1 might be possible, but 2 probably not.

Also:

3. Writing 64 bit program is probably useless, since 64 bit instructions
execute twice as slow, at least on my processor, perhaps you should test
this on your phenom and see what happens

However under different circumstances perhaps you can do better, like
not-fitting-in-cache-circumstances. But this again would probably require
some kind of "anti-stalling" "anti-waiting" code

Which was what my original posting was more or less about... letting code
proceed as much as possible and jumping back when memory results are in...
something in that trend...

Perhaps you are now starting to see that my posting is about something new
or extra which might not be possible with current hardware, though you do
keep insisting that it is possible I believe you 50% a little bit... even if
it's possible it will be little, you still have to convince me for 100%
though... my lack of time as still prevented me from running your program.

Perhaps I don't want to know results for now since it wouldn't really be
that helpfull I guess

However if you could somehow re-write your program from "optimized
assembler" back to "free pascal code" in such a way that the free pascal
compiler produces more or less the same code then that would be highly
interesting ! Especially if the code is indeed faster for certain
circumstances ! It would probably still not be interesting for my current
project but very maybe future projects.

However I see a big mistake in your reasoning of optimizing your program,
which is at the same time a big mistake in my reasoning and original post.

"CUDA/GPU'S" can probably already do what I propose to a certain extent at
least... and they probably do it much better than CPU... which means
x86/intel/amd potentially has huge problem, since GPU can do something which
their older processors apperently cannot which is: process very large data
sets in a random fashion.

However I am unsure about ATI cards and the newer AMD gpu's even INTEL seems
to have "gpu's embedded"... I find intel kinda secretive about that... they
are not coming forward with details... they probably did the same during the
80486 age... they keep "secrets" "secret" to give their selected programmers
an adventage over others... not a good practice for us it seems, which makes
me less enthousiatic to buy their processors.

nvidia's gpu's seem better documented especially cuda... but I am thinking
this is necessary to attract more programmers to it... cuda has little
benefits for now... but maybe that will change in future... nvidia in
bussiness with big companies: apple and microsoft and probably also google
and maybe even more... they positioned pretty well... I do wonder what will
happen to their documentation if they do make a big x86 killer... perhaps
they will become more secretive again which would suck. At least that's how
I experience it a little bit... perhaps the things I wrote about intel might
not be entirely true... but that is how I feel about the latest/greatest...
or maybe their documentation is just simply lagging behind a little bit...
or I didn't look... or it doesn't concern me yet since I don't have those
chips or proper simulators

I have been programming x86 for at least 18 years as well or so... and I
still don't have a proper x86 similator which kinda sucks !

Having one which could show "ammount of cycles" taken by programs would be
great. Quite strange that it apperently doesn't exist, or is too
****ty/complex/I can't understand it or whatever or takes too long to
execute ?! Hmm...

This is what I do like about "virtual instruction sets"... total insight
into how everything executes and such !

This probably explains why there are so many virtual instruction sets:

javascript, java, php, .net, flash, other scripting languages.

^ Sad thing is these all have security issue's so doesn't really help at
protecting computers, it takes only one hole to screw up a computer
But it does make computers run slower

Bye,
Skybuck.

  #5  
Old August 15th 11, 03:14 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 463
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

"
This probably explains why there are so many virtual instruction sets:

javascript, java, php, .net, flash, other scripting languages.

^ Sad thing is these all have security issue's so doesn't really help at
protecting computers, it takes only one hole to screw up a computer
But it does make computers run slower
"

Oh I forgot one, which could be a pretty serious one in futu

How could I forget my favorite one at the moment:

"PTX"

And perhaps I should also mention "redcode" but it's not really that serious
more ment for fun

But PTX is ofcourse serious.

And there is also AMD's version "vm" or something...

Which suddenly makes me wonder what intel's instruction set is for their
integrated cpu+gpu thingy... hmmm

Bye,
Skybuck.

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") Skybuck Flying[_7_] Nvidia Videocards 52 September 18th 11 09:34 AM
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") Skybuck Flying[_7_] Nvidia Videocards 0 August 1st 11 10:53 PM
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") Skybuck Flying[_7_] Nvidia Videocards 4 July 23rd 11 05:19 PM


All times are GMT +1. The time now is 05:16 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.
Copyright 2004-2018 ComputerBanter.com.
The comments are property of their posters.