PowerBASIC Forums
  Third-Party Addons
  A Very Small Fast Parser ( Assembly code) (Page 4)

Post New Topic  Post A Reply
profile | register | preferences | faq | search

UBBFriend: Email This Page to Someone!
This topic is 4 pages long:   1  2  3  4 
next newest topic | next oldest topic
Author Topic:   A Very Small Fast Parser ( Assembly code)
Michael Mattias
Member
posted August 09, 2006 10:32 AM     Click Here to See the Profile for Michael Mattias     Edit/Delete Message   Reply w/Quote
>File input was not the issue here

I could not help myself.

This is simply a golden opportunity to extol the virtues of memory-mapped files!

IP: Logged

Charles Pegge
Member
posted August 19, 2006 02:52 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
New: Fast Parser with Word List (PB Assembler)

Verily, this code addeth further functionality by enabling the use
of a Token word list in addition to the TokenType table. So thou
canst get specific tokens returned for specific words in ye list.

As before, it canst read the entire Bible in about 50-70 milliseconds
and returneth counts of various word types, but also provideth counts
on ye specified words.

(posting 19 Aug 2006) http://www.forum.it-berater.org/index.php?topic=158.0

------------------
Charles www.pegge.net

[This message has been edited by Charles Pegge (edited August 19, 2006).]

IP: Logged

Paul Squires
Member
posted August 19, 2006 03:31 PM     Click Here to See the Profile for Paul Squires     Edit/Delete Message   Reply w/Quote
Wow! I have a HUGE need for your ParserWords code. I am extremely
impressed with the speed at which it parses the code. It is
almost unbelievable. Thanks a million for sharing this with
us. I greatly appreciated it

------------------
Paul Squires
FireFly Visual Designer, Cheetah Database System, JellyFish Pro Editor
www.planetsquires.com

IP: Logged

Charles Pegge
Member
posted August 20, 2006 12:54 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Thanks Paul,
Always glad to help.

It's been a long time since I did assembler, but I am getting the
hang of it again. x86 processors are so sophisticated it is like
using a high level language but with added flexibility. With
hard coded string handling, iterative loops and floating point
maths in a variety of types, what else do we need? (Well maybe a
few more registers would be useful.)

If you have any ideas about semantics checking, which could be
handled by the parser, they would be very welcome.

------------------
Charles
www.pegge.net

[This message has been edited by Charles Pegge (edited August 20, 2006).]

IP: Logged

Nick Luick
Member
posted August 20, 2006 11:00 AM     Click Here to See the Profile for Nick Luick     Edit/Delete Message   Reply w/Quote
Charles,

I keep a working 20meg working ram disk, so exection & bible read
numbers are from ram.

247 asm, 83 bas, 330 msec total

barely let the enter key up and it was over.

on a P3-500mz win98 256 ram

------------------

IP: Logged

Paul Dixon
Member
posted August 20, 2006 11:41 AM     Click Here to See the Profile for Paul Dixon     Edit/Delete Message   Reply w/Quote
Charles,
I haven't read through the whole of your code but if you want it to be a little faster..

The LOOP opcode is slow compared to coding the same thing yourself. You should try to avoid using it.


When reading large blocks of data from start to finish, that data is always in main memory as the cache is too small to fit it all in. This runs a lot slower than cache.
To help out, there is a PREFETCH opcode which allows you to give the CPU a hint that you'll be needing data soon and, if the CPU has available memory bandwidth without slowing down your main code, it'll fetch the data into cache in advance of you needing it so when the data is needed the CPU doesn't stall while it's fetched from main memory.
It may be worth giving it a go in this program.

Unfortunately, the PB compiler doesn't recognise the PREFETCH opcodes so you have to assemble them by hand which can be a bit of a nuisance.

Paul.

------------------

IP: Logged

Charles Pegge
Member
posted August 20, 2006 05:27 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Paul,

I've just tried removing the LOOP instruction, replacing it with
DEC ECX: JNZ short.
The current demo is not sensitive enough to show any difference.

Couldnt get PREFETCH to improve performance but got it to slow
things down on various mem indexing registers, so maybe this Athlon is
already doing the right thing.

I need to check this further.


------------------
Charles
www.pegge.net

[This message has been edited by Charles Pegge (edited August 22, 2006).]

IP: Logged

Charles Pegge
Member
posted August 23, 2006 12:41 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Paul,

I confirm that LOOP takes a lot longer than DEC ECX: JNZ short ..

In an empty loop with 2gig repeats, the LOOP instruction took
3 seconds instead of 2 seconds (Athlon 3200).

I also found The best way to get a speed improvement is to run
an entire reading in pure ASM without using a Function call for
each token.

------------------
Charles
www.pegge.net

IP: Logged

Michael Mattias
Member
posted August 24, 2006 08:36 AM     Click Here to See the Profile for Michael Mattias     Edit/Delete Message   Reply w/Quote
quote:

I confirm that LOOP takes a lot longer than DEC ECX: JNZ short ..

In an empty loop with 2gig repeats, the LOOP instruction took
3 seconds instead of 2 seconds (Athlon 3200).


Um, don't the processor chip people publish tables of "clocks per instruction" anymore?


IP: Logged

Charles Pegge
Member
posted August 24, 2006 10:01 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Not on the Intel Architecture Software Developer’s Manual.
Too many CPUs, Clock speeds, architectures. Not like it used to
be.

Easier just to test

------------------
Charles
www.pegge.net

IP: Logged

Eddy Van Esch
Member
posted August 24, 2006 10:22 AM     Click Here to See the Profile for Eddy Van Esch     Edit/Delete Message   Reply w/Quote
quote:
Originally posted by Michael Mattias:
Um, don't the processor chip people publish tables of "clocks per instruction" anymore?


Due to pipelining and other optimisations, the sum of the clock cycles does not equal the amount of clock cycles needed to execute a series of commands.

Kind regards
------------------
Eddy
email: raimundo4u at yahoo dot com
www.devotechs.com -- HIME Huge Integer Math and Encryption library--

[This message has been edited by Eddy Van Esch (edited August 24, 2006).]

IP: Logged

Charles Pegge
Member
posted August 24, 2006 11:19 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
For a mind-bending discourse on what a Pentium can get up to: http://www.intel.com/design/pentiumii/manuals/24319202.pdf
(Volume 3) Chapter 14. Optimization.

Cacheing, pipelining, parallel execution, branch prediction. And
that is within a single core.

------------------
Charles
www.pegge.net

PS
What kind of coffee do they drink at Intel?

[This message has been edited by Charles Pegge (edited August 24, 2006).]

IP: Logged

Eddy Van Esch
Member
posted August 24, 2006 03:45 PM     Click Here to See the Profile for Eddy Van Esch     Edit/Delete Message   Reply w/Quote
quote:
Originally posted by Charles Pegge:
For a mind-bending discourse on what a Pentium can get up to: http://www.intel.com/design/pentiumii/manuals/24319202.pdf
(Volume 3) Chapter 14. Optimization.


Indeed. You can read there why 2 + 2 sometimes equals 3 ..

BTW you can order that book (and the other 4 volumes) as hardcopy for free at Intel on simple request.

Kind regards

------------------
Eddy
email: raimundo4u at yahoo dot com
www.devotechs.com -- HIME Huge Integer Math and Encryption library--

IP: Logged

Paul Dixon
Member
posted August 24, 2006 07:13 PM     Click Here to See the Profile for Paul Dixon     Edit/Delete Message   Reply w/Quote
Charles,
I can't get much more speed out of this without spending a lot of time making sure every jump target is aligned, and that's rarely worth the effort.

I did get a little improvement using a PREFETCH but it was only about 2% and it messed up code alignment elsewhere so it's probably not worth it.

Two things I did notice:
1) in your loop which calls GetNextToken you calculate STRPTR(dString) everytime when it never changes. You could save time by doing that once outside the loop and assigning the result to a LONG.

2) your timing uses GetTickCount which has a resolution of only 15-16ms. Since you're timing things in the order of a few tens of ms then you really need to use a better resolution timer such as the Performance Counters. Here's a simple example of how to use it:

#INCLUDE "win32api.inc"
FUNCTION WINMAIN (BYVAL hInstance AS DWORD,BYVAL hPrevInstance AS DWORD,BYVAL lpCmdLine AS ASCIIZ PTR,BYVAL iCmdShow AS LONG) AS LONG

LOCAL freq AS QUAD
LOCAL TimeStart AS QUAD
LOCAL TimeEnd AS QUAD
LOCAL r AS LONG
LOCAL j AS LONG

'Get timer frequency.
QueryPerformanceFrequency freq

QueryPerformanceCounter TimeStart

FOR r = 1 TO 100000000
'waste a bit of time
j=2*r
NEXT

QueryPerformanceCounter TimeEnd

PRINT "That took " +FORMAT$(INT((TimeEnd-TimeStart)/freq *1000000),"###,###,###")+"uS."

WAITKEY$

END FUNCTION

There are some instruction latency numbers in the Athlon Optimisation Manual:
www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf

There are more, including Pentium here, plus lots of other information:
http://www.agner.org/optimize/


<<I confirm that LOOP takes a lot longer than DEC ECX: JNZ short ..>>

If you read the Athlon Optimisation Guide it lists the LOOP instruction as having a latency of 8clks and it's a vector decode instruction which stops other instructions executing in parallel.
DEC ECX: JNZ short are listed with latencies of 1 and are directly encoded so they can run in parallel with other instructions.


And my figures after, on a 1900MHz Athlon, are Asm 49ms, Bas 17ms, Tot 66ms.

Paul.

------------------

IP: Logged

Charles Pegge
Member
posted August 25, 2006 11:52 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Many thanks Paul for the info and your investigations.

That AMD Volume on optimization is very useful, especially
the list of opcode latencies in the last chapter.

One further area of optimization is to exploit parallel execution
using pairs of instructions that do not immediately depend on each
other. (Athlons can do up to 3 DirectPath instructions simultaneously.)

At first glance, it seems that there are no further opportunities
of this available in the parsing routine.

But here is a test rig for investigating the effects of both
alignment and parallel execution. With this simple loop there
are no alignment effects but the twinning of instructions is
very obvious.

The absolute clock count is not significant, but is good for
making reliable comparative measurements. On the Core 2 Duo,
instructions can execute on a half clock cycle!


#COMPILE EXE
#DIM ALL
'
FUNCTION PBMAIN () AS LONG
LOCAL TimeStart AS QUAD, TimeEnd AS QUAD ' for time stamp, measuring cpu clock cycles
LOCAL st AS QUAD PTR , en AS QUAD PTR: st=VARPTR(TimeStart):en=VARPTR(TimeEnd)
'
'---------------------------'
! push ebx '
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,st ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
'---------------------------'
'
'
'------------------------------'
! db &h53,&h55 ' 2 ' push: ebx ebp
'------------------------------'
! db &hb9 ' 1 ' mov ecx, ...
! dd 10000 ' 4 ' number of loops dword
'! db &h90,&h90,&h90 ' x ' NOPs for alignment padding tests
'------------------------------'
repeats:
! db &hb8,&h00,&h00,&h00,&h00 ' mov eax,0
! db &hba,&h00,&h00,&h00,&h00 ' mov edx,0
'! db &hb2,&h00 ' mov dl,0

! db &hbb,&h00,&h00,&h00,&h00 ' mov ebx,0
! db &hbd,&h00,&h00,&h00,&h00 ' mov ebp,0

! db &hbe,&h00,&h00,&h00,&h00 ' mov esi,0
! db &hbf,&h00,&h00,&h00,&h00 ' mov edi,0
'
! db &h49 ' dec ecx
! jg repeats ' 3 ' jg repeats
'------------------------------'
! db &h5d,&h5b ' pop: ebp ebx
'------------------------------'
'
'
'---------------------------'
! mov ebx,[esp] ' restore ebx value without popping the stack
' ' approx because it is not a serialised instruction
' ' it may execute before or after other instructions
' ' in the pipeline.
! mov ebx,en ' var address where count is to be stored.
! db &h0f,&h31 ' RDTSC read time-stamp counter into edx:eax hi lo.
! mov [ebx],eax ' save low order 4 bytes.
! mov [ebx+4],edx ' save high order 4 bytes.
! pop ebx '
'---------------------------'
'
MSGBOX "That took "+STR$(TimeEnd-TimeStart)+" clocks."
END FUNCTION

------------------
Charles
www.pegge.net

[This message has been edited by Charles Pegge (edited August 25, 2006).]

IP: Logged


This topic is 4 pages long:   1  2  3  4 

All times are EasternTime (US)

next newest topic | next oldest topic

Administrative Options: Close Topic | Archive/Move | Delete Topic
Post New Topic  Post A Reply
Hop to:

Contact Us | PowerBASIC BASIC Compilers

Copyright © 1999-2006 PowerBASIC, Inc. All Rights Reserved.


Ultimate Bulletin Board 5.45c