|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Intricate problem with scannerless LALR(1) parser
Hi there!
I'm currently developing a LALR(1) parser generator which generates parsers running directly on the input character stream, which means that it is scannerless. Lexemes or higher syntacical elements are directly recognized by the grammar itself. The development of the generator has nearly finished, and I added many utilizing features to it, for example whitespace recognition and a keyword-feature, matching terminals which exist of more than one character using a definite state machine. However, that keyword-feature has one side effect which I would discuss on the mailing list. Given the following grammar: start: a a: b 'XX' b: c | '[' b ']' c: 'X' | c 'X' where "start", "a", "b" and "c" are nonterminals ("start" is goal symbol) and "XX", "X", "[" and "]" are the terminals ("XX" is a so called keyword, "X", "[" and "]" are characters), my generator constructs the following LALR(1) states (I used the state reduction algorithm as described by [Goose], using SHIFTREDUCE-Actions): State 0 Kernel: start: .a State 1 Kernel: a: b .'XX' State 2 Kernel: b: c . { 'XX' ']' } c: c .'X' State 3 Kernel: b: '[' .b ']' State 4 Kernel: b: '[' b .']' The configuration in State 2 is now the sticking point: The parser will reduce here if it detects a keyword "XX", which is even made up from the same character as the string matched by nonterminal "c". So a parser with such a grammar, which has no shift-reduce or reduce-reduce conflicts at all, will only accept the input strings "XXX" and "[XXX]". The parse tree for "XXX": start | a / b "XX" / c / "X" Strings like "XXXX" or "[XXX]XX" cannot be parsed using this parsing technique, althought they look valid. My question is now: Is there a possiblity to detect problems of this kind beginning from state 2? My first idea was that it must be tried to parse the keyword that should be reduced (in this case: "XX") using nonterminal "c", but how could my parser generator invoke this in any possibly situation this kind of problems comes up? How can it find out that "c" is the nonterminal which possibly matches the keyword? Please note, that now shift-reduce conflict will come up, because "X" and "XX" are two different terminals. Any ideas on this topic? Best regards Jan [Your grammar is ambiguous. To see where, replace "XX" with xx and define it like this: xx: "X" "X" Assuming you want to use normal tokenizing rules, your "X" token is really "X followed by something other than a letter or digit, and if the something is white space, skip over the white space. , and skip comments, too." Now you know why we use separate lexer and parser generators, because they need separate state machines to keep the parser grammar frome exploding. -John] |
|
#2
|
|||
|
|||
|
Intricate problem with scannerless LALR(1) parser
possibly situation this kind of problems comes up? How can it find out
that "c" is the nonterminal which possibly matches the keyword? Please note, that now shift-reduce conflict will come up, because "X" and "XX" are two different terminals. > Any ideas on this topic? > Best regards Jan [Your grammar is ambiguous. To see where, replace "XX" with xx and define it like this: xx: "X" "X" yes -if the grammar had terminal y instead of XX -then (I think) there wouldn't have been a conflict. The LALR(1) grammar will resolve shift- reduce conflicts in favour of shifts, and you can use yyerror()[error recovery] to recover from a reduce-reduce conflict. Further, I am not sure why you want a scannerless parser while most people would want to use a scanner to provide syntactic sugar [as in hide inherent ambiguities in the language]. regards -kamal Assuming you want to use normal tokenizing rules, your "X" token is really "X followed by something other than a letter or digit, and if the something is white space, skip over the white space. , and skip comments, too." Now you know why we use separate lexer and parser generators, because they need separate state machines to keep the parser grammar frome exploding. -John] |
|
#3
|
|||
|
|||
|
Intricate problem with scannerless LALR(1) parser
Paul B Mann wrote:
>>However, that keyword-feature has one side effect which I would >>discuss on the mailing list. Given the following grammar: >>start: a >>a: b 'XX' >>b: c | '[' b ']' >>c: 'X' | c 'X' >>[Your grammar is ambiguous. To see where, replace "XX" with xx and >>define it like this: xx: "X" "X" (snip) I agree with John. XX is valid, but ambiguous, either a keyword or two X's. Are you sure it is ambiguous? Maybe it isn't LALR(1), though. Since there is no a in b or c, it looks to me that it only matches strings ending in 'XX'. If b is 'XX' then it still must be followed by another 'XX'. It looks to me that it matches 0 or more '[' followed by 1 or more 'X' followed by the same number of ']' as there were '[', followed by 'XX'. Looking again, though, it is ambiguous without enough look ahead to see where the end is. You find out too late that the last 'XX' are already gone. Easier to parse from the end back to the start. -- glen [You're right. The grammar isn't really ambiguous, but it needs more than one token of lookahead. -John] |
|
#4
|
|||
|
|||
|
Intricate problem with scannerless LALR(1) parser
start: a
a: b 'XX' b: c | '[' b ']' c: 'X' | c 'X' > >XX is valid, but ambiguous, either a keyword or two X's. Are you sure it is ambiguous? Maybe it isn't LALR(1), though. Since there is no a in b or c, it looks to me that it only matches strings ending in 'XX'. If b is 'XX' then it still must be followed by another 'XX'. You are right. I was wrong. X XX is valid XX XX is valid XXX XX is valid. The parser needs 2 characters of lookahead to correctly parse this language, depending on how spaces are processed. Paul B Mann |
|
#5
|
|||
|
|||
|
Intricate problem with scannerless LALR(1) parser
The grammar itself as defined in the original note is trivially LL(1)
after the left recursion is removed. A quick yacc check would probably say it is LALR(1) as well. Another reason that scanners and parsers are kept separate is that the more powerful pushdown automata is not needed for scanning. If you use a sledge hammer to crack a walnut, do not be too surprised if you end up with squashed nut meat. |
![]() |
| Viewing: Web Development Archives > FAQs > Compilers > Intricate problem with scannerless LALR(1) parser |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|