PowerBASIC Forums
  Third-Party Addons
  A Very Small Fast Parser ( Assembly code) (Page 3)

Post New Topic  Post A Reply
profile | register | preferences | faq | search

UBBFriend: Email This Page to Someone!
This topic is 4 pages long:   1  2  3  4 
next newest topic | next oldest topic
Author Topic:   A Very Small Fast Parser ( Assembly code)
Paul Dixon
Member
posted August 05, 2006 06:18 PM     Click Here to See the Profile for Paul Dixon     Edit/Delete Message   Reply w/Quote
Charles,
I tried an alternative table solution with smaller tables but it didn't work too well so I gave up due to lack of time.
I tried tweaking your code and made it 20% faster so there are still savings to be made there.

If you're going to try jump tables, they don't need to be near the code.
With PB it's dificult to assemble the tables along with the code so I'd usually put them in an array.

Paul.

------------------

IP: Logged

Charles Pegge
Member
posted August 05, 2006 07:15 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Eros,

I am working on Paul's idea of a jump table system. It is more
complex to set up but should be faster still.

Meanwhile here is the existing code in SUB form.


' A Slightly Larger but Faster Parser (PB Assembler)
' enhanced version with look up table
' Presented as a SUB procedure for Eros Olmi

' Charles E V Pegge
' 6 August 2006
' charles@pegge.net
' www.pegge.net


' The table gives added ease of configuration and flexibility. The upper part of the table
' may be dispensed with if the your source text uses 7 bit ascii only.

' General parsing rules:
'
' space, and all lower ascii and all non alphanumerics are treated as delimiters. eg:
' But an underscore _ (ascii 95) is treated as an alphanumeric eg:
'
' FREE.LUNCH is interpreted as three words
' FREE_LUNCH is one word

' all other non alphanumerics including '-' (minus) are self delimiting eg:
'
' + - . , { } are always recognised as individual words even when unspaced
'
'
' But quoted text is treated as an entire word including the outer quotes.
' inner quotes can be embedded eg:
'
' "She said 'HI' "
' 'She said "HI" '
'
' this code uses the EAX ECX EDX EBX EBP and ESI registers -
' with minimal use of memory, most of the work is done in registers.
'
'
' Here we are defining the lookup table and its pointer as globals
' to incorporate into your initialisation code somewhere


' quick Ascii code reference:

' various control chars 0-31:
' null 0
' htab 9
' linefeed 10
' vtab 11
' formfeed 12
' creturn 13
' escape 27
' space 32
' !"#$%&' 33-39
' ()*+,-./ 40-47
' 0-9 48-57
' :;<=>?@ 58-64
' A-Z 65-90
' [\]^_` 91-96
' a-z 97-122
' {|}~ 123-126

#COMPILE EXE
#DIM ALL
#INCLUDE "WIN32API.INC"


DECLARE SUB GetNextToken(BYREF dString AS STRING, BYVAL lString AS LONG, BYREF TokenStart AS LONG , BYREF TokenLen AS LONG)


FUNCTION PBMAIN () AS LONG

' Lookup Table

DIM tbl AS GLOBAL STRING
'0123456789ABCDEF' ' codings
tbl= _ '
"0000000000000000" + _ '0 LOWER ASCII
"0000000000000000" + _ '1 space/ignore 0
"0121111211111111" + _ '2 symbols 1 general 2 quote marks
"5555555555111111" + _ '3 numbers 5
"1666666666666666" + _ '4 alpha uppercase 6
"6666666666611118" + _ '5 alpha uppercase 6
"1777777777777777" + _ '6 alpha lowercase 7
"7777777777711111" + _ '7 alpha lowercase 7
"0000000000000000" + _ '8 HIGHER ASCII
"0000000000000000" + _ '9
"0000000000000000" + _ 'A
"0000000000000000" + _ 'B
"0000000000000000" + _ 'C
"0000000000000000" + _ 'D
"0000000000000000" + _ 'E
"0000000000000000" + _ 'F
""

' other codings available:
' self delimiting types: (like sybols) 3 4
' other types (like alphanumerics) 8 9 eg underscore
DIM tblp AS GLOBAL BYTE PTR: tblp=STRPTR(tbl)

'test example

DIM dString AS STRING
DIM lString AS LONG
DIM TokenStart AS LONG: TokenStart=1
DIM TokenLen AS LONG
dString="Old_wine Free.Meal(Lunch,12.34) "+CHR$(34)+"bon mot (7)"+CHR$(34)+" "+CHR$(39)+CHR$(34)+"okay"+CHR$(34)+CHR$(39)
lString=LEN(dString)

' you can set TokenStart to any initial position
' and lString does not have to specify the whole string

DIM s AS STRING: s=dString
' lstring=0 ' testing null entry
WHILE TokenStart<=lString
GetNextToken(dString, lString, TokenStart, TokenLen)
s=s+$CRLF+MID$(dString,TokenStart,TokenLen) ' +str$(TokenLen)
TokenStart=TokenStart+TokenLen
WEND
MSGBOX s

END FUNCTION

SUB GetNextToken(BYREF dString AS STRING, BYVAL lString AS LONG, BYREF TokenStart AS LONG , BYREF TokenLen AS LONG)

DIM ip AS BYTE PTR: ip=STRPTR(dString) '
DIM ii AS LONG : ii=TokenStart ' byrefs must be transferred for Asm
DIM jj AS LONG

! push esi ' save register contents
! push ebx ' since we will be using them
! push ebp ' save original base pointer
! mov edx,ip ' pointer to start of string-1
! dec edx ' negative offset because StartPos starts at 1 not 0
! push edx ' hold this pointer on the stack
! push tblp ' hold the tbl pointer on the stack until all other variable
' have been loaded into registers
! mov eax,edx ' to find end of string
! add edx,lString ' points to the last character in the string
! add eax,ii ' points to the character indexed by StartPos
! mov ecx,eax ' default value for the start of word pointer
! dec eax ' prepare for pre incrementing loop
' with basic variables installed,
' it is now safe to overwrite the base pointer register
! mov ebp,eax ' move the char pointer to the base pointer
! xor eax,eax ' clear eax for another use
! pop ebx ' pop the tbl pointer into ebx ready to do lookups


next_space: ' for skipping leading spaces tabs etc
! inc ebp ' advance the char pointer
! cmp edx,ebp ' check if end of string
! jl short start_of_word ' exit loop if end of string
! mov al,[ebp] ' get character
! cmp al,32 ' is it a space or less ?
! jle short next_space ' if so then try text character

start_of_word: ' first non space char encountered

! mov ecx,ebp ' remember start position

! cmp edx,ebp ' check if end of string
! jl short end_of ' exit if end of string

! mov esi,eax ' use the char as an index
! mov al,[ebx+esi] ' to look up its type code in tbl
! cmp al,&h34 ' if it is more than a symbol
! jg short next_letter ' then it is an alphanumeric so try next char
! cmp al,&h32 ' is it a quote symbol?
! jz quote_string ' if so then process whole quote
! inc bp ' else increment char pointer and
! jmp short end_of ' return a self-delimiring symbol

quote_string: ' this is a quoted item, we need to find the end quote mark

quote_string_traverse: ' look for a matching quote mark
! mov al,[ebp] ' reload quote mark
! inc ebp ' advance pointer for next char
! cmp edx,ebp ' check if past the end of string
! jl short end_of ' finish if so
! cmp al,[ebp] ' check for matching quote mark
! jnz short quote_string_traverse
' if no match then try next
! inc ebp ' advance past end of quote
! jmp short end_of ' we have our quote so finish

next_letter: ' scanning an alpha numeric. Could be delimited by a symbol, space or lesser ascii

! inc ebp ' advance pointer
! cmp edx,ebp ' past end of string?
! jl short end_of ' if so then finish
! mov al,[ebp] ' get char
! mov esi,eax ' use the char as an index
! mov al,[ebx+esi] ' then lookup its type in tbl
! cmp al,&h34 ' if it is more than a space or symbol
! jg short next_letter ' then it is an alphanumeric so try then next char

end_of: ' results are ready to pass back to BASIC variables

! pop edx ' pop original position of string pointer
! mov eax,ebp ' hold the end position in eax before we pop the bp register
! pop ebp ' restore base pointer register
! pop ebx ' and the base register
! pop esi ' and the si index register
' so it is now safe to store to BASIC variables
! sub ecx,edx ' derive our start of word index
! mov ii,ecx ' store result in StartPos
! sub eax,edx ' derive the after-word index
! sub eax,ecx ' derive length of word
! mov jj,eax ' assign value to ii

' if the string is null or contains no words or symbols
' then StartPos will be greater than the length of the string

TokenStart=ii ' indirectly for byref vars
TokenLen=jj ' indirectly for byref vars


END SUB


------------------
Charles

IP: Logged

Michael Mattias
Member
posted August 06, 2006 11:14 AM     Click Here to See the Profile for Michael Mattias     Edit/Delete Message   Reply w/Quote
I just looked at the source code for the demo.

While that sample (one dinky little sentence) is nicely constructed to test various data condition, it is simply too tiny to claim any kind of 'performance benchmarking' data.

Why don't you test it on the "bible search" file used for the contest last year?

Let's see if I can find a link to that....yup, here it is...

Here's the original contest announcement, contains link to the package containing the test file:

http://www.powerbasic.com/support/forums/Forum12/HTML/001733.html

(I tested the download, and the file is still out there on my web site!)


IP: Logged

Charles Pegge
Member
posted August 06, 2006 12:52 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Michael,
I have just downloaded the zip. Thanks!

The little test string was there primarily to test functionality.
I hadnt anticipated that relative performance would become so
interesting.

I have just written a pure jump-table version of the parser
that uses no conditional logic at all. Surprisingly, it performed
a lot slower than the version you see above. The Athlon did not
seem to appreciate calculating indirect jumps for each character.

I'll try out the 2 versions on your text and see what happens.

Perhaps there ought to be a standard measure of parsing speed:
Bibles per Second?

Paul,
Thanks for looking at the code. Devising the jump table system
is not quite as straight forward as one might anticipate but an
educational exercise for me anyway.

How did you manage to tweak the original for an extra 20%?

------------------
Charles

IP: Logged

Brad D Byrne
Member
posted August 06, 2006 02:12 PM     Click Here to See the Profile for Brad D Byrne     Edit/Delete Message   Reply w/Quote
Charles, Impressive numbers!! Thanks!

also, as far as the jump table, wouldn't it be better to access it as
a data block instead of through a string? here's a snip from one of Hutch's
posts,

 

chtbl:
! db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
! db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ' 31
! db 0,1,0,1,1,1,1,0,0,0,0,0,0,0,0,0 ' 47
! db 1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,1 ' 63 ' numbers
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ' 79 ' upper case
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ' 95
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ' 111 ' lower case
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ' 127
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
! db 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

and then call ie,

! cmp BYTE PTR chtbl[eax], 1 ' test if BYTE is allowable character

also, I wouldn't be suprised if the tree type structure as your original
does not ultimately win out, havn't had time to play with it yet

------------------
Washington DC Area
Borje's "Poff's" is likely the BEST tool for learning PB.. http://www.tolkenxp.com/pb
And a few PBTool's & Beginner Help:http://sweetheartgames.com/PBTools/JumpStart.html
& Another Good Resource: http://www.fredshack.com/docs/powerbasic.html

IP: Logged

Michael Mattias
Member
posted August 06, 2006 02:17 PM     Click Here to See the Profile for Michael Mattias     Edit/Delete Message   Reply w/Quote
quote:

Perhaps there ought to be a standard measure of parsing speed: Bibles per Second?

I'm not sure we are ready for a new unit of measure for speed, but I certainly would not object to "the standard" for text-parsing, file access and other "Hey, look how fast my text processing is!" code were to be that same bible text file.

There's some sneaky data conditions in that file, too.


IP: Logged

Brad D Byrne
Member
posted August 06, 2006 02:23 PM     Click Here to See the Profile for Brad D Byrne     Edit/Delete Message   Reply w/Quote
just an idea, can we store the jump to label in the table, allowing us to
eliminate the compare? ie.


chtbl:
! db next_space:,next_space:,next_space:,next_space:.....
! db next_space:,next_space:,next_space:,next_space:.....
! db next_space:,delimitr:,delimitr:,delimitr:,delimitr:,....
! db want_char,want_char,want_char,want_char,want_char,....
! db want_char,want_char,want_char,want_char,want_char,....

etc.
then

! jmp BYTE PTR chtbl[eax] ' jmp to stored label

------------------
Washington DC Area
Borje's "Poff's" is likely the BEST tool for learning PB.. http://www.tolkenxp.com/pb
And a few PBTool's & Beginner Help:http://sweetheartgames.com/PBTools/JumpStart.html
& Another Good Resource: http://www.fredshack.com/docs/powerbasic.html

IP: Logged

Paul Dixon
Member
posted August 06, 2006 02:36 PM     Click Here to See the Profile for Paul Dixon     Edit/Delete Message   Reply w/Quote
Charles,
the extra speed from tweaking your code was straight forward.

First, esi.
You only use esi twice as follows:

! mov esi,eax               ' use the char as an index
! mov al,[ebx+esi] ' to look up its type code in tbl

Why don't you just use eax in the index and then you don't need to transfer the value to esi at all saving 2 opcodes:
! mov al,[ebx+eax]  

Of course now you aren't using esi at all so you no longer need to push/pop it to the stack saving another 2 opcodes.

Second, tblp.
At the start you have:

! push tblp                 ' hold the tbl pointer on the stack until all other variable
' have been loaded into registers
.. other code
! pop ebx ' pop the tbl pointer into ebx ready to do lookups

But you aren't using ebx between that first instruction and the last so why are you pushing tblp? Why not just load it into ebx as soon as ebx becomes available saving 1 opcode:
'! mov ebx,tblp               ' hold the tbl pointer on the stack until all other variable
' have been loaded into registers

Third. Use the size extension instructions, they're as fast as the normal ones but can save opcodes and time:
You have

! xor eax,eax               ' clear eax for another use  

Delete that saving 1 opcode and load eax with
! movzx eax,byte ptr [ebp]              ' get character  

instead. This will clear the upper part of the register for you.
More importantly, you should use the movzx version of the instruction anywhere that you use move to al and you don't need upper part of the register. This is because the CPU has trouble handling a reference to any full 32-bit register immediatley after part of that register has been written to.
So e.g. in your code where you have:
! mov al,[ebp]              ' get character 

you should change it to
! movzx eax,byte ptr [ebp]              ' get char 

Try them out and see if it improves your code speed.

Paul.

------------------

IP: Logged

Eros Olmi
Member
posted August 06, 2006 05:28 PM     Click Here to See the Profile for Eros Olmi     Edit/Delete Message   Reply w/Quote
Charles, thanks a lot for SUB version.
What about transforming it into a function or adding another BYREF param
in order to return Token type? Delimiter, Number, String, QuotedString, ...

Thanks again.

------------------
Eros Olmi
eros.olmi@thinbasic.com

IP: Logged

Charles Pegge
Member
posted August 06, 2006 06:58 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Paul,

Many thanks.

My last major encounter with x86 was back in 1984, designing
a processor board built around the 8088. (5 meg clocking then).
It even used an 8087 coprocessor, and 64k dram!

So all these luxury Pentium opcodes are a new experience for me.
I have some catching up to do. I've fed those changes into
the code but I cant detect any performance change on the Athlon 3200.
Must depend on how these devices are microcoded, whether movzx is
a compound operation.

But still happy to lose a few opcodes.

Incidently I found that you cant do this in PB:
! fld double ptr [ebx] ' etc
Am I missing spmething?


Brad,
Thanks for introducing me to inline datablocks.
Apart from lookup tables its A good opportunity for tweaking
opcode, when ASM doesnt give you what you want.

For the jump-table version, I used an array of LONGs and generated the
entries iteratively. (there were 1024 addresses to put in!).
As it does not perform as well as the other code, I'll work on it
a little longer before posting.

------------------
Charles

IP: Logged

Paul Dixon
Member
posted August 06, 2006 07:27 PM     Click Here to See the Profile for Paul Dixon     Edit/Delete Message   Reply w/Quote
Charles,
no speed increase? When I tried it on my Athlon 2600+ I got similar figures to your Athlon 3200, that's 23% faster.

<<! fld double ptr [ebx] >>

Try DWORD instead of DOUBLE.

Paul.

------------------

IP: Logged

Charles Pegge
Member
posted August 08, 2006 10:26 AM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
Eros, Paul, Michael

I have posted the latest version.

Managed to shed a few more opcodes but more functionality
added. And you can easily change the parsing rules. whole
Bible reading demo included, which the Athlon 3200 accomplishes
in about 1/18th second.

I think this has the right logistics to do useful parsing. But
putting it into a Function and interfacing a high level environment
does incur a speed penalty. I try to show this in the Message box
(with partial success). Not getting a stable time reading.

The code is quite heavily annotated so its too long to show here.

download from:
http://www.forum.it-berater.org/index.php?topic=158.0

------------------
Charles

IP: Logged

Michael Mattias
Member
posted August 08, 2006 12:29 PM     Click Here to See the Profile for Michael Mattias     Edit/Delete Message   Reply w/Quote
quote:

... whole Bible reading demo included, which the Athlon 3200 accomplishes in about 1/18th second.

I thought that 1/18th second number was bogus [well, the word which actually occurred to me started with 'b'] so I d/l'd the package again.

It's not really doing the "whole Bible reading demo" in that your timer does not include the time it takes to open the file and load it into a string.


' another sample with slighty different lexical rules
OPEN "bible12.txt" FOR BINARY ACCESS READ AS 1:GET$ 1,LOF(1),dString:CLOSE 1

Ok, so this is supposed to be showing off the pure "text parsing" code, not necessarily the time it takes to actually do something, which perforce includes the time required to open the file and load to a string.

But as long as I brought it up...instead of loading to a string (which takes time and resouces), since all you need is the address and length of the string data, this is an IDEAL place to use File Mapping.

Mapping a file gets you the address and length of the file data (as a pointer) without the overhead (both performance and user memory resources) of creating an OLE string.

And you know, it's just so darned easy to do. All you do is download this demo ....

Memory-Mapped File Version of LINE INPUT
Overview: http://www.powerbasic.com/support/forums/Forum6/HTML/004300.html

.. and call "MapThisFile" and bingo, there's your address and length.

MCM

IP: Logged

Charles Pegge
Member
posted August 08, 2006 05:29 PM     Click Here to See the Profile for Charles Pegge     Edit/Delete Message   Reply w/Quote
File input was not the issue here - just getting the most
efficient x86-friendly algorithm. Making comparative measurements..
And it's more geared towards computer scripts.

Strictly speaking it should be called a lexer rather than a
parser, since it does not itself translate words or check syntax
yet.

------------------
Charles

IP: Logged

Eros Olmi
Member
posted August 09, 2006 03:32 AM     Click Here to See the Profile for Eros Olmi     Edit/Delete Message   Reply w/Quote
Charles, thanks a lot for your code!

------------------
Eros Olmi
eros.olmi@thinbasic.com

IP: Logged


This topic is 4 pages long:   1  2  3  4 

All times are EasternTime (US)

next newest topic | next oldest topic

Administrative Options: Close Topic | Archive/Move | Delete Topic
Post New Topic  Post A Reply
Hop to:

Contact Us | PowerBASIC BASIC Compilers

Copyright © 1999-2006 PowerBASIC, Inc. All Rights Reserved.


Ultimate Bulletin Board 5.45c