|
|
|
|
|||||||||||||||||||||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Number of compiler passes
I was wondering, with the computers of today, with as much memory as they
have, is there still a reason to limit the amount of passes a compiler makes? Sure, with each additional pass there will be some overhead, but other than that, is there a disadvantage I'm missing? In particular, I'm working on a source-to-source compiler at the moment. >From my own language to C. It now has the following stages: * Flex + Bison to create AST * AST to find declarations (and re-declarations) * AST to detect which variables are used in which functions * AST to find undeclared references * AST to find the type of each expression and note type mismatches * AST to find the access-type (read/write/both) of each expression * AST to perform optimizations * AST to translate to target language Pass 2 to 4 may seem excessive, but I believe they are necessary because of the following features: * Var/function declarations anywhere in the code (incl. nested functions) * Forward referencing of constant functions and constant variables Some of the stages could be merged, I suppose. But it would not help code readability or provability to do so. I would like to hear some opinions about this. Thanks! -- Michiel Helvensteijn [Back in the era of coal fired mainframes, compilers were divided into multiple passes so the code for each pass could be overlaid. These days I agree that there isn't a strong argument either way except perhaps better cache data locality if you combine passes. -John] |
|
#2
|
|||
|
|||
|
Number of compiler passes
Michiel wrote:
I was wondering, with the computers of today, with as much memory as they have, is there still a reason to limit the amount of passes a compiler makes? Sure, with each additional pass there will be some overhead, but other than that, is there a disadvantage I'm missing? (snip) John wrote: [Back in the era of coal fired mainframes, compilers were divided into multiple passes so the code for each pass could be overlaid. These days I agree that there isn't a strong argument either way except perhaps better cache data locality if you combine passes. -John] Also, in those days the intermediate data usually went to disk. I know discussion about changes in the S/360 assembler, I believe from F to XF, went from four passes to three, one less write and read back from disk, with significant speed-up. Not that I think we should go back to those days, but I am still surprised at the amount of memory needed by some modern compilers. There are people trying to run gcc on MVS with 24 bit addressing, with the ability to compile itself, but it doesn't seem to fit. There was also a discussion on extended addressing for the PDP-10, such that one might be able to run gcc. Again, even with the full extended addressing of TPS-20, it seems that it won't fit. -- glen |
|
#3
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
Flex + Bison to create AST 1 AST to find declarations (and re-declarations) 2 AST to detect which variables are used in which functions 3 AST to find undeclared references 4 AST to find the type of each expression and note type mismatches 5 AST to find the access-type (read/write/both) of each expression 6 AST to perform optimizations 7 AST to translate to target language > You don't give much detail of your language, but I don't see how "readability or provability" would be harmed by combining passes 1 & 2 Assuming variables must be declared, processing declarations gives you all you need - the symbol table(s) you construct will provide scoping for non-local variables and functions. I don't see a need for two passes. In the source language, a function can use variables from outside its scope. More importantly, these variables could be declared after the function is. This is okay as long as this function is not called before the declaration of these variables. and passes 3 & 4. [I numbered them above for convenience] Similarly, undeclared references will not be represented in your symbol table. They will be caught when you look them up to determine their type. You can assign them a special nil type so going forward any expressions using them fail to type properly. You're right about these, though. They could be merged and I will certainly consider it. The only reason for me to keep them separate is because they are conceptually different operations. I don't know what you mean by "access-type" of an expression - unless you're talking about I/ in which case I still don't know what you mean. We could be having a terminology disconnect - to my mind an "expression" always computes or side effects, so it always reads operands and sometimes writes a result. Yes, I suppose I am stretching the definition a bit. An expression has a type (bool, int, etc.). But to me it also has an access type. This is either readonly, writeonly or readwrite. It basically specifies whether it is an r-value, an l-value or both. I would think that your optimization phase itself would consist of multiple passes as a number of optimizations are possible even on an AST form. Yep, probably. I haven't actually started on this stage yet. And since you are targeting C from a language that has nested functions, you might actually need more than one pass to affect the translation. Simple function nesting as in Pascal can be handled easily, but full closures as in Lisp requires constructing runtime data structures and a lot more analysis. For now, functions in the source language cannot be passed, assigned or returned yet. But in the future I hope to include functional aspects into the language. I'm already planning for it in the design. Thanks for your reply, -- Michiel Helvensteijn |
|
#4
|
|||
|
|||
|
Number of compiler passes
Tue, 2008-07-22 at 21:31 +0200, Michiel wrote:
For now, functions in the source language cannot be passed, assigned or returned yet. But in the future I hope to include functional aspects into the language. I'm already planning for it in the design. > Thanks for your reply, of pure curiosity: what kind of language are you trying to develop? What are you planning to do with it? That would interest me. Regards, Denis |
|
#5
|
|||
|
|||
|
Number of compiler passes
of pure curiosity: what kind of language are you trying to develop?
What are you planning to do with it? That would interest me. I have already answered you by mail, but I'll repeat it for the group (minus the question about your own language): It's a language I'm designing and implementing with a colleague for our Masters degree in Computer Science. the surface it's just an imperative language with some nice tricks (like the nested functions). But we're using it primarily for research on automatic correctness proving. The language has a special syntax for function preconditions, postconditions and invariants. The hope is that most of these assertions can be proved at compile time. This framework would also clear the way for a lot of new compiler optimizations. It also has other benefits. That's its most important feature. As a part of my research I hope to add new paradigms (P, functional) to the language one by one as long as they can still be supported by this framework. By the way, I'm still not sure what to name the language. Any ideas? -- Michiel Helvensteijn |
|
#6
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
>>In the source language, a function can use variables from outside its >>scope. More importantly, these variables could be declared after the >>function is. This is okay as long as this function is not called >>before the declaration of these variables. > As long as non-local variables are transitively in the lexical scope chain of the function that uses them you can still gather all the declarations in one pass. If they're not, you have a language that will be error prone and confusing to use. They are. I think I understand. You are proposing a breadth first traversal? We are using the visitor pattern right now, which lends itself best to a depth first traversal. The simplest way to do it is to organize your symbol "table" as a stack of lists[*]. This allows for lexical shadowing of same-named variables in enclosing scopes. This is pretty much what I'm doing already. :-) >>Yes, I suppose I am stretching the definition a bit. An expression has a >>type (bool, int, etc.). But to me it also has an access type. This is >>either readonly, writeonly or readwrite. It basically specifies whether it >>is an r-value, an l-value or both. > I believe you are overthinking the problem. > L-value or R-value is a matter of usage, not type. I've seen the terms used in both contexts. As in: A variable is an l-value. A constant is an r-value. I believe the Dragon book does this. it matters whether a particular variable is in an operand position or in a result position, but that depends on the class of the expression - unary, binary, ternary, etc. - and not on the operand or result types or the operators involved. Maybe I should clarify. This information decides whether an expression CAN be used as an l/r-value or not. If it cannot be an l-value, it is a readonly expression. You will see this often with a simple arithmetic operation or with a constant or literal. If it cannot be a an r-value, it is a writeonly expression. This is seen less often, but can occur for formal parameters of the 'out' direction. (As opposed to 'in' or 'inout'.) for properties that have a setter method but no getter method. If it can be both, it is a readwrite expression. The access-type of your basic variable. But also of some other expressions, like properties and array-subscriptings. This information is stored in the symbol table and is used to check for referencing errors. It may also help for some optimizations later. Since you are targeting C which already has a fairly sophisticated understanding of complex data types, your compiler really doesn't have to bother. If your source language says "A[X] := A[X] + 1", you can just say that directly in C after appropriately defining A and X. You don't need to transform it all the way down into address computations and dereferences. But I'm sure our source language handles a lot of constructs a little differently than C does. Also, we want to write our own error messages. Anyway, the translation to C is only a temporary solution. We're focussing on the design and proof of correctness concept right now, and do not want to have to bother with assembly or learning to use the gcc backend until later. Translating to C seemed like the easiest solution for now. But all processing of the AST is completely separate from the final translation. -- Michiel Helvensteijn |
|
#7
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
(snip) As long as non-local variables are transitively in the lexical scope chain of the function that uses them you can still gather all the declarations in one pass. If they're not, you have a language that will be error prone and confusing to use. The simplest way to do it is to organize your symbol "table" as a stack of lists[*]. You attach a local symbol list to each function's definition node in the AST. I think that is fine, but some languages and language features complicate the problem. interesting feature (though some might disagree on whether it really is a feature) is partial qualification for structure references in PL/I. Referencing a variable in a structure only needs enough qualification to be unambiguous. It gets more interesting with nested scope when a name could agree with partial qualification at more than one level. I believe it works such that, given the stack of lists as you indicate, you can go up the list and take the first one that agrees, within partial qualification, with the given symbol. That makes it relatively easy for compilers, and means that a change at a higher level of nesting won't change the meaning of a name at a lower level. -- glen |
|
#8
|
|||
|
|||
|
Number of compiler passes
Michiel wrote:
(snip) If it cannot be a an r-value, it is a writeonly expression. This is seen less often, but can occur for formal parameters of the 'out' direction. (As opposed to 'in' or 'inout'.) for properties that have a setter method but no getter method. I believe for Fortran INTENT(UT) dummy variables, you must write first, but you can then read the value just as any other variable. They are not write-only. Among other reasons, Fortran allows either call by reference or call by value result (sometimes called copy-in copy-out). In the latter case, INTENT(UT) might not do the copy in. It can also affect the way the optimizer works. -- glen |
|
#9
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
(snip) In practice, I think partial qualification is only an issue for languages with latent or no typing. (snip) As for shadowing definitions, I would simply go with the inner one. I can see some minor utility to searching until you find a definition that works, but I think it would make the language very confusing to allow simultaneous access to multiple types of the same name. Yes, I believe that is what PL/I does. (I believe that partial qualification came from CBL, but I don't know about that at all.) It came up once when someone discovered that it would use a partially qualified inner structure over a fully qualified structure at an outer level. It is a compilation error if it is ambiguous at the same level, but not at different levels. DCL B FIXED BIN(31); BEGIN; DCL 1 A, 2 B FLAT(6); B=3; PUT LIST(A.B); END; -- glen |
|
#10
|
|||
|
|||
|
Number of compiler passes
glen herrmannsfeldt wrote:
I believe for Fortran INTENT(UT) dummy variables, you must write first, but you can then read the value just as any other variable. They are not write-only. I thought about this. But there is no (easy) way at compile time to figure out if a variable has actually been written to. Think of this example: int f(out int a) { if (rnd(0,10) 5) { a <- 5; } result <- a; // is this allowed? } Note: * rnd returns a random number * <- is the assignment operator * the language has no return statement, but rather a result variable Among other reasons, Fortran allows either call by reference or call by value result (sometimes called copy-in copy-out). In the latter case, INTENT(UT) might not do the copy in. It can also affect the way the optimizer works. I believe the optimizer could do much with the guarantee that a variable will never be read. Anyway, there are ways around this, of course. Just assign to a local variable first, so you can use the value again. -- Michiel Helvensteijn |
|
#11
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
No, depth first is fine (and I think, preferred). As long as the compiler can tell a declaration from other expressions in the AST, the actual traversal order doesn't matter. Then I still don't understand how you want to merge those two stages (finding declarations + finding references). A variable may be declared *textually after* a function that references it. So it seems I have to traverse twice to fully qualify the reference, or I need to do a breadth first traversal. >>I've seen the terms used in both contexts. As in: A variable is an >>l-value. A constant is an r-value. I believe the Dragon book does this. > I've also seen this usage, but IM it can be confusing and I don't like it. The practical usage: L-value = result/target, R-value = operand, has some meaning. That's fine. I don't actually use these terms in the language docs. I use read-only, write-only and read/write. I only used the l-value and r-value terms to try to explain it here. , we have different definitions of expression. I consider an expression to compute something or cause a side effect like altering control flow. To me, a variable reference like "A[X]" is a subexpression that can't stand on its own. To me, a subexpression is still an expression. It has a type (the content-type of the type of array A), an access-type (most probably read/write) and a location in the source code. To find the type and access-type of an expression, you need this knowledge of its subexpressions. In the limited state our language is in now, I would report an error if X did not evaluate to an `int' expression at this point. (No associative arrays yet.) I see what now what you're getting at - it's your terminology that's throwing me. You want to know whether the variable is ever written to or only read, but that's different than _being_ an L-value or R-value. Yes, of course. I admit I don't know all the compiler-theory terminology. (Plus, I sometimes think the existing terminology is too limiting.) L-value and R-value at the expression level are not synonymous with "write" and "read" at the symbol level. Take, for example, a write-only function parameter passed by reference. How do you assign to that value? > <> > As you can clearly see, the "write-only" variable "A" is used only in an operand position, ie. as an R-value. It is never used (directly) as an L-value. But our syntax tree and our symbol table are both relatively high level. They understand the nature of arrays. And to them, indexing an array is not the same as reading from it. To be blunt, I'll worry about the implementation details later. As long as it works conceptually. -- Michiel Helvensteijn |
|
#12
|
|||
|
|||
|
Number of compiler passes
George Neuner wrote:
, we have different definitions of expression. I consider an expression to compute something or cause a side effect like altering control flow. To me, a variable reference like "A[X]" is a subexpression that can't stand on its own. I think a symbolic location reference like A[X] is an appropriate structure to have in an expression tree, and that it has a type, and that when evaluated for computation it acts as an rvalue, but it can also serve as an lvalue if it occurs e.g. as the operand to an address-of unary expression node, or on the lhs of an assignment. I suggest that it should be up to a later visitor that puts meaning to a tree like (index (symbol 'A') (symbol 'X')). The tree itself is just a symbolic description of a location; it is neither read-only nor write-only (unless the types and/or members have constness associated with them like C++). To make this rather subtle distinction clearer with a concrete example, an assignment might look like: (assign (symbol 't') (index (symbol 'A') (symbol 'X'))) "t := A[X]" or: (assign (index (symbol 'A') (symbol 'X')) (symbol 't')) "A[X] := t" That is, A[X] is just as much an expression that can stand on its own as any other expression - when computed for its value it returns A's Xth entry (presumably). However, a code generator might compute it for its *location*, in which case it acts differently - but the decision is in a visitor to the tree, not in the tree construction itself. In practice, however, it's probably best to annotate nodes based on whether they can be assigned to or not, so that errors like "(4 + 5) := 10" can be found nice and early; and similarly for rvalue-ness or not, if that's appropriate for the source language. -- Barry |
![]() |
| Viewing: Web Development Archives > FAQs > Compilers > Number of compiler passes |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|