Compilers
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
Go Back   Web Development Archives FAQs Compilers

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Web Development Archives Sponsor:
  #1  
Old June 6th, 2008, 10:20 AM
mailings
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Intricate problem with scannerless LALR(1) parser

Hi there!

I'm currently developing a LALR(1) parser generator which generates
parsers running directly on the input character stream, which means
that it is scannerless. Lexemes or higher syntacical elements are
directly recognized by the grammar itself. The development of the
generator has nearly finished, and I added many utilizing features to
it, for example whitespace recognition and a keyword-feature, matching
terminals which exist of more than one character using a definite
state machine.

However, that keyword-feature has one side effect which I would
discuss on the mailing list. Given the following grammar:

start: a

a: b 'XX'

b: c | '[' b ']'

c: 'X' | c 'X'


where "start", "a", "b" and "c" are nonterminals ("start" is goal
symbol) and "XX", "X", "[" and "]" are the terminals ("XX" is a so
called keyword, "X", "[" and "]" are characters), my generator
constructs the following LALR(1) states (I used the state reduction
algorithm as described by [Goose], using SHIFTREDUCE-Actions):

State 0

Kernel:
start: .a

State 1

Kernel:
a: b .'XX'

State 2

Kernel:
b: c . { 'XX' ']' }
c: c .'X'

State 3

Kernel:
b: '[' .b ']'

State 4

Kernel:
b: '[' b .']'


The configuration in State 2 is now the sticking point: The parser
will reduce here if it detects a keyword "XX", which is even made up
from the same character as the string matched by nonterminal "c". So a
parser with such a grammar, which has no shift-reduce or reduce-reduce
conflicts at all, will only accept the input strings "XXX" and
"[XXX]". The parse tree for "XXX":

start
|
a
/
b "XX"
/
c
/
"X"

Strings like "XXXX" or "[XXX]XX" cannot be parsed using this parsing
technique, althought they look valid.

My question is now: Is there a possiblity to detect problems of this
kind beginning from state 2? My first idea was that it must be tried
to parse the keyword that should be reduced (in this case: "XX") using
nonterminal "c", but how could my parser generator invoke this in any
possibly situation this kind of problems comes up? How can it find out
that "c" is the nonterminal which possibly matches the keyword? Please
note, that now shift-reduce conflict will come up, because "X" and
"XX" are two different terminals.

Any ideas on this topic?

Best regards
Jan
[Your grammar is ambiguous. To see where, replace "XX" with xx and
define it like this: xx: "X" "X"

Assuming you want to use normal tokenizing rules, your "X" token is
really "X followed by something other than a letter or digit, and if
the something is white space, skip over the white space. , and skip
comments, too." Now you know why we use separate lexer and parser
generators, because they need separate state machines to keep the
parser grammar frome exploding. -John]


Reply With Quote
  #2  
Old June 9th, 2008, 07:40 PM
kamal
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Intricate problem with scannerless LALR(1) parser

possibly situation this kind of problems comes up? How can it find out
that "c" is the nonterminal which possibly matches the keyword? Please
note, that now shift-reduce conflict will come up, because "X" and
"XX" are two different terminals.
>

Any ideas on this topic?
>

Best regards
Jan
[Your grammar is ambiguous. To see where, replace "XX" with xx and
define it like this: xx: "X" "X"

yes -if the grammar had terminal y instead of XX -then (I think) there
wouldn't have been a conflict. The LALR(1) grammar will resolve shift-
reduce conflicts in favour of shifts, and you can use yyerror()[error
recovery] to recover from a reduce-reduce conflict.

Further, I am not sure why you want a scannerless parser while most
people would want to use a scanner to provide syntactic sugar [as in
hide inherent ambiguities in the language].

regards
-kamal

Assuming you want to use normal tokenizing rules, your "X" token is
really "X followed by something other than a letter or digit, and if
the something is white space, skip over the white space. , and skip
comments, too." Now you know why we use separate lexer and parser
generators, because they need separate state machines to keep the
parser grammar frome exploding. -John]


Reply With Quote
  #3  
Old June 26th, 2008, 08:00 PM
glen herrmannsfeldt
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Intricate problem with scannerless LALR(1) parser

Paul B Mann wrote:

>>However, that keyword-feature has one side effect which I would
>>discuss on the mailing list. Given the following grammar:


>>start: a
>>a: b 'XX'
>>b: c | '[' b ']'
>>c: 'X' | c 'X'


>>[Your grammar is ambiguous. To see where, replace "XX" with xx and
>>define it like this: xx: "X" "X"

(snip)

I agree with John.

XX is valid, but ambiguous, either a keyword or two X's.

Are you sure it is ambiguous? Maybe it isn't LALR(1), though. Since
there is no a in b or c, it looks to me that it only matches strings
ending in 'XX'. If b is 'XX' then it still must be followed by
another 'XX'.

It looks to me that it matches 0 or more '['
followed by 1 or more 'X' followed by the same number
of ']' as there were '[', followed by 'XX'.

Looking again, though, it is ambiguous without enough
look ahead to see where the end is. You find out too
late that the last 'XX' are already gone. Easier to parse
from the end back to the start.

-- glen
[You're right. The grammar isn't really ambiguous, but it needs more than
one token of lookahead. -John]


Reply With Quote
  #4  
Old June 27th, 2008, 01:01 AM
Paul B Mann
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Intricate problem with scannerless LALR(1) parser

start: a
a: b 'XX'
b: c | '[' b ']'
c: 'X' | c 'X'
>
>XX is valid, but ambiguous, either a keyword or two X's.


Are you sure it is ambiguous? Maybe it isn't LALR(1), though.
Since there is no a in b or c, it looks to me that it only matches
strings ending in 'XX'. If b is 'XX' then it still must be followed
by another 'XX'.

You are right. I was wrong.

X XX is valid
XX XX is valid
XXX XX is valid.

The parser needs 2 characters of lookahead to correctly
parse this language, depending on how spaces are processed.


Paul B Mann

Reply With Quote
  #5  
Old June 28th, 2008, 03:20 PM
SLK Mail
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Intricate problem with scannerless LALR(1) parser

The grammar itself as defined in the original note is trivially LL(1)
after the left recursion is removed. A quick yacc check would probably
say it is LALR(1) as well. Another reason that scanners and parsers
are kept separate is that the more powerful pushdown automata is not
needed for scanning. If you use a sledge hammer to crack a walnut, do
not be too surprised if you end up with squashed nut meat.


Reply With Quote
Reply

Viewing: Web Development Archives FAQs Compilers > Intricate problem with scannerless LALR(1) parser


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2009 by Developer Shed. All rights reserved. DS Cluster 3 hosted by Hostway
Stay green...Green IT