Compilers
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
Go Back   Web Development Archives FAQs Compilers

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Web Development Archives Sponsor:
  #1  
Old July 21st, 2008, 11:50 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

I was wondering, with the computers of today, with as much memory as they
have, is there still a reason to limit the amount of passes a compiler
makes? Sure, with each additional pass there will be some overhead, but
other than that, is there a disadvantage I'm missing?

In particular, I'm working on a source-to-source compiler at the moment.
>From my own language to C. It now has the following stages:


* Flex + Bison to create AST
* AST to find declarations (and re-declarations)
* AST to detect which variables are used in which functions
* AST to find undeclared references
* AST to find the type of each expression and note type mismatches
* AST to find the access-type (read/write/both) of each expression
* AST to perform optimizations
* AST to translate to target language

Pass 2 to 4 may seem excessive, but I believe they are necessary because of
the following features:
* Var/function declarations anywhere in the code (incl. nested functions)
* Forward referencing of constant functions and constant variables

Some of the stages could be merged, I suppose. But it would not help code
readability or provability to do so.

I would like to hear some opinions about this.

Thanks!

--
Michiel Helvensteijn
[Back in the era of coal fired mainframes, compilers were divided into multiple
passes so the code for each pass could be overlaid. These days I agree that
there isn't a strong argument either way except perhaps better cache data locality
if you combine passes. -John]


Reply With Quote
  #2  
Old July 21st, 2008, 06:50 PM
glen herrmannsfeldt
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

Michiel wrote:

I was wondering, with the computers of today, with as much memory as they
have, is there still a reason to limit the amount of passes a compiler
makes? Sure, with each additional pass there will be some overhead, but
other than that, is there a disadvantage I'm missing?
(snip)

John wrote:
[Back in the era of coal fired mainframes, compilers were divided into multiple
passes so the code for each pass could be overlaid. These days I agree that
there isn't a strong argument either way except perhaps better cache data locality
if you combine passes. -John]

Also, in those days the intermediate data usually went to disk.

I know discussion about changes in the S/360 assembler, I believe
from F to XF, went from four passes to three, one less write and read
back from disk, with significant speed-up.

Not that I think we should go back to those days, but I am still
surprised at the amount of memory needed by some modern compilers.
There are people trying to run gcc on MVS with 24 bit addressing, with
the ability to compile itself, but it doesn't seem to fit.

There was also a discussion on extended addressing for the PDP-10,
such that one might be able to run gcc. Again, even with the full
extended addressing of TPS-20, it seems that it won't fit.

-- glen

Reply With Quote
  #3  
Old July 25th, 2008, 08:20 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:

Flex + Bison to create AST
1 AST to find declarations (and re-declarations)
2 AST to detect which variables are used in which functions
3 AST to find undeclared references
4 AST to find the type of each expression and note type mismatches
5 AST to find the access-type (read/write/both) of each expression
6 AST to perform optimizations
7 AST to translate to target language
>

You don't give much detail of your language, but I don't see how
"readability or provability" would be harmed by combining passes 1 & 2

Assuming variables must be declared, processing declarations gives you
all you need - the symbol table(s) you construct will provide scoping
for non-local variables and functions. I don't see a need for two
passes.

In the source language, a function can use variables from outside its
scope. More importantly, these variables could be declared after the
function is. This is okay as long as this function is not called
before the declaration of these variables.

and passes 3 & 4. [I numbered them above for convenience]

Similarly, undeclared references will not be represented in your
symbol table. They will be caught when you look them up to determine
their type. You can assign them a special nil type so going forward
any expressions using them fail to type properly.

You're right about these, though. They could be merged and I will
certainly consider it. The only reason for me to keep them separate is
because they are conceptually different operations.

I don't know what you mean by "access-type" of an expression - unless
you're talking about I/ in which case I still don't know what you
mean. We could be having a terminology disconnect - to my mind an
"expression" always computes or side effects, so it always reads
operands and sometimes writes a result.

Yes, I suppose I am stretching the definition a bit. An expression has a
type (bool, int, etc.). But to me it also has an access type. This is
either readonly, writeonly or readwrite. It basically specifies whether it
is an r-value, an l-value or both.

I would think that your optimization phase itself would consist of
multiple passes as a number of optimizations are possible even on an
AST form.

Yep, probably. I haven't actually started on this stage yet.

And since you are targeting C from a language that has nested
functions, you might actually need more than one pass to affect the
translation. Simple function nesting as in Pascal can be handled
easily, but full closures as in Lisp requires constructing runtime
data structures and a lot more analysis.

For now, functions in the source language cannot be passed, assigned or
returned yet. But in the future I hope to include functional aspects into
the language. I'm already planning for it in the design.

Thanks for your reply,

--
Michiel Helvensteijn

Reply With Quote
  #4  
Old July 25th, 2008, 01:20 PM
Denis Washington
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

Tue, 2008-07-22 at 21:31 +0200, Michiel wrote:
For now, functions in the source language cannot be passed, assigned or
returned yet. But in the future I hope to include functional aspects into
the language. I'm already planning for it in the design.
>

Thanks for your reply,

of pure curiosity: what kind of language are you trying to develop?
What are you planning to do with it? That would interest me.

Regards,
Denis


Reply With Quote
  #5  
Old July 26th, 2008, 07:10 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

of pure curiosity: what kind of language are you trying to develop?
What are you planning to do with it? That would interest me.

I have already answered you by mail, but I'll repeat it for the group
(minus the question about your own language):

It's a language I'm designing and implementing with a colleague for
our Masters degree in Computer Science. the surface it's just an
imperative language with some nice tricks (like the nested
functions). But we're using it primarily for research on automatic
correctness proving.

The language has a special syntax for function preconditions,
postconditions and invariants. The hope is that most of these
assertions can be proved at compile time. This framework would also
clear the way for a lot of new compiler optimizations. It also has
other benefits.

That's its most important feature. As a part of my research I hope to
add new paradigms (P, functional) to the language one by one as long
as they can still be supported by this framework.

By the way, I'm still not sure what to name the language. Any ideas?

--
Michiel Helvensteijn

Reply With Quote
  #6  
Old July 28th, 2008, 09:50 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:

>>In the source language, a function can use variables from outside its
>>scope. More importantly, these variables could be declared after the
>>function is. This is okay as long as this function is not called
>>before the declaration of these variables.

>

As long as non-local variables are transitively in the lexical scope
chain of the function that uses them you can still gather all the
declarations in one pass. If they're not, you have a language that
will be error prone and confusing to use.

They are.

I think I understand. You are proposing a breadth first traversal? We are
using the visitor pattern right now, which lends itself best to a depth
first traversal.

The simplest way to do it is to organize your symbol "table" as a
stack of lists[*].

This allows for
lexical shadowing of same-named variables in enclosing scopes.

This is pretty much what I'm doing already. :-)

>>Yes, I suppose I am stretching the definition a bit. An expression has a
>>type (bool, int, etc.). But to me it also has an access type. This is
>>either readonly, writeonly or readwrite. It basically specifies whether it
>>is an r-value, an l-value or both.

>

I believe you are overthinking the problem.
>

L-value or R-value is a matter of usage, not type.

I've seen the terms used in both contexts. As in: A variable is an l-value.
A constant is an r-value. I believe the Dragon book does this.

it
matters whether a particular variable is in an operand position or in
a result position, but that depends on the class of the expression -
unary, binary, ternary, etc. - and not on the operand or result types
or the operators involved.

Maybe I should clarify. This information decides whether an expression CAN
be used as an l/r-value or not.

If it cannot be an l-value, it is a readonly expression. You will see this
often with a simple arithmetic operation or with a constant or literal.

If it cannot be a an r-value, it is a writeonly expression. This is seen
less often, but can occur for formal parameters of the 'out' direction. (As
opposed to 'in' or 'inout'.) for properties that have a setter method
but no getter method.

If it can be both, it is a readwrite expression. The access-type of your
basic variable. But also of some other expressions, like properties and
array-subscriptings.

This information is stored in the symbol table and is used to check for
referencing errors. It may also help for some optimizations later.

Since you are targeting C which already has a fairly
sophisticated understanding of complex data types, your compiler
really doesn't have to bother. If your source language says "A[X] :=
A[X] + 1", you can just say that directly in C after appropriately
defining A and X. You don't need to transform it all the way down
into address computations and dereferences.

But I'm sure our source language handles a lot of constructs a little
differently than C does. Also, we want to write our own error messages.

Anyway, the translation to C is only a temporary solution. We're focussing
on the design and proof of correctness concept right now, and do not want
to have to bother with assembly or learning to use the gcc backend until
later. Translating to C seemed like the easiest solution for now. But all
processing of the AST is completely separate from the final translation.

--
Michiel Helvensteijn


Reply With Quote
  #7  
Old July 28th, 2008, 09:50 AM
glen herrmannsfeldt
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:
(snip)

As long as non-local variables are transitively in the lexical scope
chain of the function that uses them you can still gather all the
declarations in one pass. If they're not, you have a language that
will be error prone and confusing to use.

The simplest way to do it is to organize your symbol "table" as a
stack of lists[*]. You attach a local symbol list to each function's
definition node in the AST.

I think that is fine, but some languages and language features
complicate the problem.

interesting feature (though some might disagree on
whether it really is a feature) is partial qualification
for structure references in PL/I.

Referencing a variable in a structure only needs enough
qualification to be unambiguous. It gets more interesting
with nested scope when a name could agree with partial
qualification at more than one level. I believe it works
such that, given the stack of lists as you indicate, you can
go up the list and take the first one that agrees, within
partial qualification, with the given symbol.

That makes it relatively easy for compilers, and means
that a change at a higher level of nesting won't change
the meaning of a name at a lower level.

-- glen


Reply With Quote
  #8  
Old July 29th, 2008, 07:50 PM
glen herrmannsfeldt
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

Michiel wrote:
(snip)

If it cannot be a an r-value, it is a writeonly expression. This is seen
less often, but can occur for formal parameters of the 'out' direction. (As
opposed to 'in' or 'inout'.) for properties that have a setter method
but no getter method.

I believe for Fortran INTENT(UT) dummy variables, you must
write first, but you can then read the value just as any other
variable. They are not write-only.

Among other reasons, Fortran allows either call by reference
or call by value result (sometimes called copy-in copy-out).
In the latter case, INTENT(UT) might not do the copy in.
It can also affect the way the optimizer works.

-- glen


Reply With Quote
  #9  
Old August 1st, 2008, 07:10 AM
glen herrmannsfeldt
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:
(snip)

In practice, I think partial qualification is only an issue for
languages with latent or no typing.
(snip)

As for shadowing definitions, I would simply go with the inner one. I
can see some minor utility to searching until you find a definition
that works, but I think it would make the language very confusing to
allow simultaneous access to multiple types of the same name.

Yes, I believe that is what PL/I does. (I believe that partial
qualification came from CBL, but I don't know about that at all.)

It came up once when someone discovered that it would use
a partially qualified inner structure over a fully qualified
structure at an outer level.

It is a compilation error if it is ambiguous at the same
level, but not at different levels.

DCL B FIXED BIN(31);
BEGIN;
DCL 1 A, 2 B FLAT(6);
B=3;
PUT LIST(A.B);
END;

-- glen


Reply With Quote
  #10  
Old August 1st, 2008, 07:10 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

glen herrmannsfeldt wrote:

I believe for Fortran INTENT(UT) dummy variables, you must
write first, but you can then read the value just as any other
variable. They are not write-only.

I thought about this. But there is no (easy) way at compile time to figure
out if a variable has actually been written to. Think of this example:

int f(out int a) {
if (rnd(0,10) 5) {
a <- 5;
}
result <- a; // is this allowed?
}

Note:
* rnd returns a random number
* <- is the assignment operator
* the language has no return statement, but rather a result variable

Among other reasons, Fortran allows either call by reference
or call by value result (sometimes called copy-in copy-out).
In the latter case, INTENT(UT) might not do the copy in.
It can also affect the way the optimizer works.

I believe the optimizer could do much with the guarantee that a variable
will never be read.

Anyway, there are ways around this, of course. Just assign to a local
variable first, so you can use the value again.

--
Michiel Helvensteijn


Reply With Quote
  #11  
Old August 1st, 2008, 07:10 AM
Michiel
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:

No, depth first is fine (and I think, preferred). As long as the
compiler can tell a declaration from other expressions in the AST, the
actual traversal order doesn't matter.

Then I still don't understand how you want to merge those two stages
(finding declarations + finding references). A variable may be declared
*textually after* a function that references it. So it seems I have to
traverse twice to fully qualify the reference, or I need to do a breadth
first traversal.

>>I've seen the terms used in both contexts. As in: A variable is an
>>l-value. A constant is an r-value. I believe the Dragon book does this.

>

I've also seen this usage, but IM it can be confusing and I don't
like it. The practical usage: L-value = result/target, R-value = operand,
has some meaning.

That's fine. I don't actually use these terms in the language docs. I use
read-only, write-only and read/write. I only used the l-value and r-value
terms to try to explain it here.

, we have different definitions of expression. I consider an
expression to compute something or cause a side effect like altering
control flow. To me, a variable reference like "A[X]" is a
subexpression that can't stand on its own.

To me, a subexpression is still an expression. It has a type (the
content-type of the type of array A), an access-type (most probably
read/write) and a location in the source code.

To find the type and access-type of an expression, you need this knowledge
of its subexpressions. In the limited state our language is in now, I would
report an error if X did not evaluate to an `int' expression at this point.
(No associative arrays yet.)

I see what now what you're getting at - it's your terminology that's
throwing me. You want to know whether the variable is ever written to
or only read, but that's different than _being_ an L-value or R-value.

Yes, of course. I admit I don't know all the compiler-theory terminology.
(Plus, I sometimes think the existing terminology is too limiting.)

L-value and R-value at the expression level are not synonymous with
"write" and "read" at the symbol level. Take, for example, a
write-only function parameter passed by reference. How do you assign
to that value?
>

<>
>

As you can clearly see, the "write-only" variable "A" is used only in
an operand position, ie. as an R-value. It is never used (directly)
as an L-value.

But our syntax tree and our symbol table are both relatively high level.
They understand the nature of arrays. And to them, indexing an array is not
the same as reading from it.

To be blunt, I'll worry about the implementation details later. As long as
it works conceptually.

--
Michiel Helvensteijn

Reply With Quote
  #12  
Old August 1st, 2008, 07:10 AM
Barry Kelly
Guest
Dev Archives Newbie (0 - 499 posts)
 
Posts: n/a  
Time spent in forums:
Reputation Power:
Number of compiler passes

George Neuner wrote:

, we have different definitions of expression. I consider an
expression to compute something or cause a side effect like altering
control flow. To me, a variable reference like "A[X]" is a
subexpression that can't stand on its own.

I think a symbolic location reference like A[X] is an appropriate
structure to have in an expression tree, and that it has a type, and
that when evaluated for computation it acts as an rvalue, but it can
also serve as an lvalue if it occurs e.g. as the operand to an
address-of unary expression node, or on the lhs of an assignment.

I suggest that it should be up to a later visitor that puts meaning to a
tree like (index (symbol 'A') (symbol 'X')). The tree itself is just a
symbolic description of a location; it is neither read-only nor
write-only (unless the types and/or members have constness associated
with them like C++).

To make this rather subtle distinction clearer with a concrete example,
an assignment might look like:

(assign (symbol 't') (index (symbol 'A') (symbol 'X')))

"t := A[X]"

or:

(assign (index (symbol 'A') (symbol 'X')) (symbol 't'))

"A[X] := t"

That is, A[X] is just as much an expression that can stand on its own as
any other expression - when computed for its value it returns A's Xth
entry (presumably). However, a code generator might compute it for its
*location*, in which case it acts differently - but the decision is in a
visitor to the tree, not in the tree construction itself.

In practice, however, it's probably best to annotate nodes based on
whether they can be assigned to or not, so that errors like "(4 + 5) :=
10" can be found nice and early; and similarly for rvalue-ness or not,
if that's appropriate for the source language.

-- Barry


Reply With Quote
Reply

Viewing: Web Development Archives FAQs Compilers > Number of compiler passes


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2009 by Developer Shed. All rights reserved. DS Cluster 4 Hosted by Hostway
For more Enterprise Application Development news, visit eWeek