PowerBASIC Forums
  PowerBASIC for Windows
  Dwords are bad m'kay?...

Post New Topic  Post A Reply
profile | register | preferences | faq | search

UBBFriend: Email This Page to Someone! next newest topic | next oldest topic
Author Topic:   Dwords are bad m'kay?...
Mike Joseph
Member
posted April 01, 2000 09:13 PM     Click Here to See the Profile for Mike Joseph     Edit/Delete Message   Reply w/Quote
... well dwords seem to be quite fast when used with SHIFT LEFT and SHIFT RIGHt. But otherwise, they seem to be very slow.

I've been spending quite a few hours today and yesterday trying to optimize a piece of code which I had translated over from C source. The C source was about roughly 6x as fast as mine and I couldnt figure out why. I was using DWords in every case where they were using "unsigned longs" When compiled, my app was taking ~40 seconds to go from start to finish, the C app was taking ~7 seconds. After converting all my Dwords to Longs, my code is now ~11 seconds which is much better. Im going to make a few more changes and hopefully I can catch or even beat 7 seconds.


Anyone know why the C code wasnt getting hammered when using the "unsigned long" variable (the PB equiv of dword right?) I compiled a version of the C code using "longs" instead of "unsigned longs" and it seemed to actually get slower by about 1.5 seconds... why could that be?

Here are the results to a few tests I was running. If you can give any insight into why some of the numbers are the way they are (some of which seem counter intuitive) please do!


Test Machine = Pentium II Celeron 266 w/ 128megs PC100 RAM, running Windows98 original release.
Data types tested = Integer, Long, Word, DWord, Single, Double
Declarations Tested = LOCAL, REGISTER, EXT

---------------------------------------
---------------------------------------
Test 1: XOR, AND, OR
50 million itterations
--------------------------------------
#COMPILE EXE
#REGISTER NONE
#DIM ALL

FUNCTION PBMAIN()

REGISTER TestVariable1 AS LONG
REGISTER TestVariable2 AS LONG

DIM Counter AS LONG
DIM lngTime AS LONG
DIM Itterations AS LONG

Itterations = 50000000

TestVariable1 = 1
TestVariable1 = 2

lngTime = TIMER
FOR Counter = 1 TO Itterations
TestVariable1 = TestVariable1 XOR TestVariable2
TestVariable2 = TestVariable2 AND TestVariable1
TestVariable1 = TestVariable2 OR TestVariable1
NEXT Counter

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."

END FUNCTION
--------
Results |
--------
Registered Longs 1.37 seconds
Registered Words 1.87 seconds
Local Words 3.33 seconds
Local Longs 3.84 seconds
Local Integers 4.12 seconds
Registered Integers 4.81 seconds (Note these were consistantly slower than LOCAL integers
EXT Precision Floats 18.49 seconds
Local Singles 19.98 seconds
Local Doubles 25.14 seconds
Registered Dwords 39.25 seconds
Local Dwords 43.55 seconds


---------------------------------------
---------------------------------------
Test 2: SHIFT LEFT , SHIFT RIGHT
50 million itterations
--------------------------------------
#COMPILE EXE
#REGISTER NONE
#DIM ALL

FUNCTION PBMAIN()

REGISTER TestVariable1 AS LONG
REGISTER TestVariable2 AS LONG

DIM Counter AS LONG
DIM lngTime AS LONG
DIM Itterations AS LONG

Itterations = 50000000

TestVariable1 = 16
TestVariable1 = 32

lngTime = TIMER
FOR Counter = 1 TO Itterations
SHIFT LEFT TestVariable1, 8&
SHIFT LEFT TestVariable2, 8&
SHIFT RIGHT TestVariable1, 8&
SHIFT RIGHT TestVariable2, 8&
NEXT Counter

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."

END FUNCTION
--------
Results |
--------
Local Words 6.89 seconds (~2x faster than registered word)
Local Integers 7.07 seconds (locals seem to be faster than their registered counterparts overall)
Local Dwords 7.89 seconds
Local Longs 7.90 seconds
Registered Dwords 8.20 seconds
Registered Longs 8.47 seconds
Registered Integers 12.20 seconds
Registered Words 12.23 seconds
EXT Precision Floats N/A
Local Singles N/A
Local Doubles N/A


---------------------------------------------
---------------------------------------------
TEST 3: MULIPLY, DIVIDE
5 million itterations
---------------------------------------------
#COMPILE EXE
#REGISTER NONE
#DIM ALL

FUNCTION PBMAIN()

REGISTER TestVariable1 AS LONG
REGISTER TestVariable2 AS LONG

DIM Counter AS LONG
DIM lngTime AS LONG
DIM Itterations AS LONG

Itterations = 5000000

TestVariable1 = 100
TestVariable2 = 200

lngTime = TIMER
FOR Counter = 1 TO Itterations
TestVariable1 = TestVariable1 * TestVariable2
TestVariable2 = TestVariable2 * TestVariable1
TestVariable1 = TestVariable1 / TestVariable2
TestVariable2 = TestVariable2 / TestVariable1
NEXT Counter

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."

END FUNCTION
--------
Results |
--------
Local Longs 5.26 seconds
Local Integers 5.29 seconds
Local Singles 9.06 seconds
Registered Longs 9.44 seconds (why are these slower than Local declared longs?)
EXT Precision Floats 9.66 seconds
Registered Words 9.88 seconds
Registered Integers 9.88 seconds (same as registered Words)
Local Doubles 10.65 seconds
Local Words 10.71 seconds
Registered DWords 12.13 seconds
Local DWords 13.22 seconds


---------------------------------------------
---------------------------------------------
TEST 4: ADDITIONS, SUBTRACTION, INCR
50 million itterations
---------------------------------------------
#COMPILE EXE
#REGISTER NONE
#DIM ALL

FUNCTION PBMAIN()

REGISTER TestVariable1 AS LONG
REGISTER TestVariable2 AS LONG

DIM Counter AS LONG
DIM lngTime AS LONG
DIM Itterations AS LONG

Itterations = 50000000

TestVariable1 = 999
TestVariable2 = 111

lngTime = TIMER
FOR Counter = 1 TO Itterations
TestVariable1 = TestVariable1 + TestVariable1
TestVariable2 = TestVariable2 + TestVariable2
TestVariable1 = TestVariable1 - TestVariable2
TestVariable2 = TestVariable1 - TestVariable2
NEXT Counter

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."

END FUNCTION
--------
Results |
--------
Registered Longs 1.34 seconds
Registered Words 2.74 seconds
Registered DWordS 2.96 seconds
Local Integers 3.74 seconds
Local Words 3.92 seconds
Local Longs 3.96 seconds
Local DWords 4.17 seconds
Registered Integers 6.18 seconds (slower than Local Integers, and significantly slower than registered Words)
EXT Precision Floats 91.99 seconds
Local Singles 92.34 seconds
Local Doubles 101.05 seconds

---------------------------------------------
---------------------------------------------
TEST 5: SQUARE ROOTS and EXPONENTS
50 million itterations
---------------------------------------------
#COMPILE EXE
#REGISTER NONE
#DIM ALL

FUNCTION PBMAIN()

REGISTER TestVariable1 AS LONG
REGISTER TestVariable2 AS LONG

DIM Counter AS LONG
DIM lngTime AS LONG
DIM Itterations AS LONG

Itterations = 50000000

TestVariable1 = 2
TestVariable2 = 3

lngTime = TIMER
FOR Counter = 1 TO Itterations
TestVariable1 = TestVariable1^1
TestVariable2 = TestVariable2^1
TestVariable1 = sqr(TestVariable2)
TestVariable2 = sqr(TestVariable1)
NEXT Counter

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds."

END FUNCTION
--------
Results |
--------
EXT Precision Floats 38.94 seconds EXT registered precision floating points top in this scenario
Local Singles 39.91 seconds
Local Longs 43.85 seconds
Local Integers 43.75 seconds
Local Doubles 45.27 seconds
Registered Words 46.20 seconds
Registered Integers 46.40 seconds
Registered Longs 46.41 seconds
Registered DWORDS 53.37 seconds
Local Words 55.92 seconds
Local DWORDS 60.02 seconds

-----------------------------------------------------

CONCLUSIONS
What I learned (i think)
- Avoid using Singles, Doubles and ESPECIALLY DWords and Words with bitwise operations like AND, OR, and XOR
- Floating points are extremely slow when it comes to addition and substraction
- When bitshifting, Local variables are faster than registered variables. (Anyone know why?)
- EXT Floating points, and Doubles and Singles are slightly faster than other types when it comes to square roots and raising power.
- During multiplication and divisions, register integers, longs and words are slower than local declared integers, longs and words. Anyone know why?

Thanks!
-Mike

[This message has been edited by Mike Joseph (edited April 01, 2000).]

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 02, 2000 05:22 PM           Edit/Delete Message   Reply w/Quote
quote:

What I learned (i think)
- Avoid using Singles, Doubles and ESPECIALLY DWords and Words with bitwise operations like AND, OR, and XOR
- Floating points are extremely slow when it comes to addition and substraction
- When bitshifting, Local variables are faster than registered variables. (Anyone know why?)
- EXT Floating points, and Doubles and Singles are slightly faster than other types when it comes to square roots and raising power.
- During multiplication and divisions, register integers, longs and words are slower than local declared integers, longs and words. Anyone know why?


1. It's bad for performance of your app and other apps running.
2. Yep
3. To put it simply, 2 extra copy operations per SHIFT call. SHIFT is a function and wants a variable pointer in Ebx. So the compiler creates a hidden LOCAL LONG, copies ESI/EDI into the long, calls SHIFT, copies the long back into ESI/EDI.
4. sqr is a floating. Power uses a function. (see #3)

I'm not too good with ASM, but I think PB thinks of DWORDs as floating point numbers trapped in an integer body.


' Integer divide with DWORD REGISTER
mov esi,20h ' var1 = 32
mov edi,10h ' var2 = 16
mov dword ptr [ebp-54h],edi ' edi into stack var
mov dword ptr [ebp-50h],0 ' 0 into stack var
fild qword ptr [ebp-54h] ' convert stack var to 80-bit float on float stack
mov dword ptr [ebp-54h],esi
mov dword ptr [ebp-50h],0
fild qword ptr [ebp-54h] ' convert stack var to 80-bit float on float stack
call 00401827 ' integer floating point divide function
00401827 frndint ' round float stack var to integer
00401829 fxch st(1) ' swap float stack
0040182B frndint ' round float stack var to integer
0040182D fdivp st(1),st ' floating divide
0040182F fldcw word ptr ds:[401516h] ' load float control register with constant?
00401835 frndint ' round float stack var to integer
00401837 fldcw word ptr ds:[401510h] ' load float control register with constant?
0040183D ret ' return
fistp qword ptr [ebp-54h] ' convert float stack var to 64-bit integer
mov esi,dword ptr [ebp-54h] ' copy stack var back to esi
' Integer divide with LONG REGISTER
mov esi,20h
mov edi,10h
mov eax,esi
cdq ' convert eax to 64-bit int
idiv eax,edi
mov esi,eax
' Multiply with DWORD REGISTER
mov esi,20h
mov edi,10h
mov dword ptr [ebp-54h],edi ' edi into stack var
mov dword ptr [ebp-50h],0 ' 0 into stack var
fild qword ptr [ebp-54h] ' convert stack var to 80-bit float on float stack
mov dword ptr [ebp-54h],esi
mov dword ptr [ebp-50h],0
fild qword ptr [ebp-54h] ' convert stack var to 80-bit float on float stack
fmulp st(1),st ' floating point multiply
fistp qword ptr [ebp-54h] ' convert float stack var to 64-bit integer
mov esi,dword ptr [ebp-54h] ' copy stack var back to esi
' Multiply with LONG REGISTER
mov esi,20h
mov edi,10h
mov eax,edi
imul esi
mov esi,eax

SUMMARY: DWORDs are slow for your app and other apps on the system! After thinking it over more, I would hope this is a compiler bug and not something the compiler does on purpose.

[This message has been edited by Enoch S Ceshkovsky (edited April 02, 2000).]

IP: Logged

Mike Joseph
Member
posted April 03, 2000 12:07 AM     Click Here to See the Profile for Mike Joseph     Edit/Delete Message   Reply w/Quote
Thanks for the feeback. I think you are right about DWord performance being a bug in the compiler. I ran the same 5 tests below against QUAD integers and even QUAD integers are significantly faster than DWORDS in 3 of the 5 tests.

Test1: XOR, AND, OR
Quad Integers = 29.41 seconds (faster than DWORDS by about ~10 secs)

Test2: SHIFT LEFT, SHIFT RIGHT
Quad Integers = 70.32 (quads are the slowest integer type for SHIFTing bits)

Test3: MULTIPLY, DIVIDE
Quad Integers = 5.77 seconds ( over twice as fast as DWORDS)

TEST4: ADDITIONS, SUBTRACTION
Quad Integers: 55.20 seconds (slowest of all integer types, but not as slow as floats)

TEST5: SQUARE ROOTS AND EXPONENTS
Quad Integers: 50.39 seconds (faster than DWORDS by ~9 seconds)

So if you need integers larger than LONG type and you dont plan on using SHIFT and are mostly doing multiplies/divides, bitwise operations with very few additions/subtractions, then you are better off going with QUAD integers over DWORDs.

-Mike

------------------

[This message has been edited by Mike Joseph (edited April 03, 2000).]

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 03, 2000 12:16 AM           Edit/Delete Message   Reply w/Quote
It appears PTR's are affected by this also, but worse they have to call to PB runtime functions that do the dword->quad->float, float->quad->dword. Quad should be faster because it doesnt have the extra step of dword->quad.
Regardless, PB shouldn't be using floating math with a quad, dword, or a pointer.

[This message has been edited by Enoch S Ceshkovsky (edited April 03, 2000).]

IP: Logged

Ron Pierce
Member
posted April 03, 2000 12:53 AM     Click Here to See the Profile for Ron Pierce     Edit/Delete Message   Reply w/Quote
Mike, try using inline asm to shift bits. I found a while ago that there was
something about PowerBASIC's Shift staement that needed optimizing. Hopefully
they'll change the Shift statement to a function which would permit more efficient
coding.

'SHIFT LEFT TestVariable1, 8&
'SHIFT LEFT TestVariable2, 8&
!shl TestVariable1, 8&
!shl TestVariable2, 8&
'SHIFT RIGHT TestVariable1, 8&
'SHIFT RIGHT TestVariable2, 8&
!shr TestVariable1, 8&
!shr TestVariable2, 8&


Ron

IP: Logged

Mike Joseph
Member
posted April 03, 2000 02:03 AM     Click Here to See the Profile for Mike Joseph     Edit/Delete Message   Reply w/Quote
Awesome, thank you! You made my day. The code ive been working on i was able improve the performance of processing a simulated 4 meg file to ~6.3 seconds. Switching to the inline assembly for shifting just dropped it down to ~4.8 seconds! I guess you really can squeeze blood out of a turnip

The VC6 compiled code I'm using for comparison is churning out ~2.9seconds (originally i thought it was ~7seconds, but i was seriously disappointed to find out i was compiling a debug version.. when i switched it to Release configuration, it improved from 7 to 2.9)

The LCC-Win32 compiler produces an exe that can process the file in ~4.9 seconds which makes the PB version faster by about .1 seconds.

Whats interesting is, i used the _asm shr x,8; in place of the x >>= 8; in both the VC compiled version and the LCC-Win32 versions and the performance stayed pretty much the same.

Guess now i'm going to have to start learnging asm in my spare time
Thanks again.
-Mike

------------------

IP: Logged

Ron Pierce
Member
posted April 03, 2000 05:07 AM     Click Here to See the Profile for Ron Pierce     Edit/Delete Message   Reply w/Quote
Mike, declare the same register variables in C and see if the results are different.
Were you able to do inline asm in LCC?

Ron

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 03, 2000 11:40 AM           Edit/Delete Message   Reply w/Quote
Another comment to make here.

MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(TIMER - lngTime,"####.##") & " seconds
Should be:
Results = Timer-lngTime
MSGBOX FORMAT$(Itterations,"#,###,###") & " itterations in " & FORMAT$(Results,"####.##") & " seconds

The first isnt very fair to PB if you want to compare it against VC6.

------------------

IP: Logged

Mike Joseph
Member
posted April 03, 2000 02:40 PM     Click Here to See the Profile for Mike Joseph     Edit/Delete Message   Reply w/Quote
Enoch, i did update the position of the timer as you suggested. Results seem more or less the same. Unless the change is at least .1 second worth its hard to notice since even during seperate time tests on the same exact .exe results in variations of +/- 0.2 seconds (seems OS timing related)

Ron, in my PB code im already using register variables but I wasnt in the C version. I did a couple of tests on the C version compiled with VC and i was able to improve it from 2.9 to just about 2.09.

Using register variables with the LCC-win32 compiled version didnt seem to help at all. Regarding the use of _asm in the LCC version, i was in error. I in fact i had to comment those out to get it to compile.

I've added !xor assembly operators in place of some of the basic XOR operator and now my PB version is ~4.3 seconds which is better than the LCC version by .7 seconds. I have a few other XOR location's id like to do this to, but Im not sure how to get it to work with a Pointer. Any ideas? Here is the line I'd like to convert to asm.

X = X XOR @data.P(i)

Is there a faster way to do this then the following?

temp = @data.P(i)
!xor x,temp

It also seems that !xor in PB only works if either the source or the destination is a register variable... the AND operator seems to work with both. Also, I have used asm on the AND operator and have replaced the following line:

d = x AND 255 REM this is the original line

with an asm version which looks as follows

!and x,255 REM this version requires 3 lines of code
d = x
x= xOrig

This gives me a speed boost, but because the results are stored in the x variable, it needs to be restored since i must retain the value of x. Is there a way to avoid this?

-Mike

------------------

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 03, 2000 03:32 PM           Edit/Delete Message   Reply w/Quote
Valid for XOR,OR,AND:

DIM A AS LONG, B AS LONG, pB as LONG PTR
REGISTER R as LONG
pB = VARPTR(B)
!XOR R,A ' R has result
!XOR R,255
!XOR A,R
!XOR A,255
!XOR R, OFFSET pB ' R = R XOR @pB
!XOR OFFSET pB, R ' @pB = @pB XOR R
' or this
!MOV R, B
!XOR A, R ' A has result
' or this
!MOV R, A
!XOR B, R ' B has result

Here's a website you might find useful:
http://www.ece.uiuc.edu/ece291/books/artofasm.html
There are also ASM tutorials for PB programmers, if you search the Programming Forum.

[This message has been edited by Enoch S Ceshkovsky (edited April 03, 2000).]

IP: Logged

Mike Joseph
Member
posted April 03, 2000 06:23 PM     Click Here to See the Profile for Mike Joseph     Edit/Delete Message   Reply w/Quote
Cool, thanks Enoch.

-Mike

------------------

IP: Logged

Steve Hutchesson
Member
posted April 03, 2000 08:18 PM     Click Here to See the Profile for Steve Hutchesson     Edit/Delete Message   Reply w/Quote
I have got much the same type of results as Ron Pierce with the difference
between using inline asm instructions and the intrinsic PB functions. The
times below are specific to my own computer which is a 600 PIII so if you
have either a faster or slower box, just alter the loop count so that the
minimum duration is over a half a second and you will get results within a
percent or so.

I have constructed the loop in asm to remove any interaction with
different loop types and variables and what I have found is that there is
no difference between DWORD and LONG when using a variable of that size.

In the 3 comparisons, the shl/shr instructions using a register are a lot
faster that using a memory variable and the intrinsic functions are slower
again.

The analysis by Enoch makes this slower operation clear as it is doing
more than the inline functions per loop.

The speed difference between a global variable and a local variable has
been documented by Intel with stack variables being generally faster. I
think it has some to do with the relocations in the PE header but even if
you need to use a global variable, copy it to a stack variable if it is
possible to try and get additional speed out of the loop.

Another factor from the Intel documentation is the order and size of stack
variables. It is generally preferred to put the DWORD/LONG size variables
first as they align on a 4 byte boundary and are accessed in one read.
put WORD, BYTE and others after it.

LOCAL var1 as LONG
LOCAL var2 as DWORD
LOCAL var3 as WORD
LOCAL var4 as BYTE

If you need to control the alignment, there is a trick in PowerBASIC where
you can use dummy structure (UDT) which is DWORD aligned so if you have
a long mess of different size data types on the stack, you can use the
structure as a method of aligning the following piece of data.

LOCAL dummy as My_DWORD_Aligned_Dummy_UDT

I tested the speed diference with Do loops and For loops and it seems that
the exit condition in a Do loop is what slows it down if it is done using
If <condition> Then Exit Do. A Do loop is faster if the exit is done with
a ! cmp REG, NUMBER.

If the loop is truly speed critical, construct the loop using a label and
a conditional jump at the end of the loop and use a register as a counter
if it is possible.

Regards,

hutch@pbq.com.au



SUB LoopTest

#REGISTER NONE

LOCAL lvar1 as LONG 'DWORD

lvar1 = 10000

tc& = GetTickCount()

! xor edx, edx ; zero counter
! mov eax, 10000

ltSt:
! shl eax, 8 ; 500 ms
! shr eax, 8
' ! shl lvar1, 8 ; 2005 ms
' ! shr lvar1, 8
' SHIFT LEFT lvar1, 8 ' 3860 ms
' SHIFT RIGHT lvar1, 8
! inc edx
! cmp edx, 100000000
! jne ltSt

MessageBox hWnd,ByCopy str$(GetTickCount() - tc&), _
"clock",%MB_OK or %MB_ICONINFORMATION

END SUB

------------------

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 04, 2000 10:48 AM           Edit/Delete Message   Reply w/Quote
I wonder if any of the PB staff have a comment about why DWORDs & QUADs are using floating math...

------------------

IP: Logged

Dave Navarro
Member
posted April 04, 2000 11:46 AM     Click Here to See the Profile for Dave Navarro     Edit/Delete Message   Reply w/Quote
I don't know why DWORD's are using floating point, but the reason that QUADs (64-bit) using floating point is because the last time I looked, Intel wasn't shipping a 64-bit CPU yet.

--Dave

------------------
PowerBASIC Support
support@powerbasic.com

IP: Logged

Enoch S Ceshkovsky
unregistered
posted April 05, 2000 11:25 AM           Edit/Delete Message   Reply w/Quote
quote:

the reason that QUADs (64-bit) using floating point is because the last time I looked, Intel wasn't shipping a 64-bit CPU yet.


Cute
Quads can be manipulated faster with integer math, but it's easier for the compiler to use floats. Same idea as multiplying 32-bit variables in 16-bit. Something to add to the bottom of the list of "PB Optimizations".

------------------

IP: Logged

All times are EasternTime (US)

next newest topic | next oldest topic

Administrative Options: Close Topic | Archive/Move | Delete Topic
Post New Topic  Post A Reply
Hop to:

Contact Us | PowerBASIC BASIC Compilers

Copyright © 1999-2007 PowerBASIC, Inc. All Rights Reserved.


Ultimate Bulletin Board 5.45c