IronSmalltalk Compiler Services

This is a brief description of the compiler services.

The compiler and related components are divided into several sub-categories. Those are based on the X3J20 document structure, but do not adhere to it 100%.

  1. Scanner
  2. Parser
  3. Interchange Processor
  4. Installer
  5. Encoder
The scanner (or lexer) is documented primarily in X3J20 chapter 3.5 Lexical Grammar. See Lexical Grammer section below.
The parser is divided in two, the standard parser that can parse methods and initializers, as described in X3J20 3.4 Method Grammar and interchange format parser used by the interchange format processor. See Method Grammer section below.
The interchange format is the file format chosen for IronSmalltalk source code files. The interchange format processor is responsible for reading those files. Instead of using classical Smalltalk approach of evaluating chunks, we've implemented a special parser that has hardcoded knowledge of the format. Description of the format is found in X3J20 4. Smalltalk Interchange Format. The interchange processor uses the interchange format parser to create interchange parse nodes. Those nodes are converted into definitions and added to the installer. The interchange format processor does not modify the Smalltalk environment - the installer does that.
The installer accepts definitions from the interchange processor and installs them into Smalltalk environment (the image). It does this in several steps while doing validation. The rules are described in X3J20 3.3 Smalltalk Abstract Program Grammar, 3.4 Method Grammar and to small extent in 4. Smalltalk Interchange Format. The installer is not technically part of the compiler services, but is part of the runtime services. It is teoretically possible to create Smalltalk application programatically by code instead of reading source files. See Abstract Program Grammer section below as well as separate read me file in the Runtime folder.
The encoder is not required nor described in X3J20. It's entirely IronSmalltalk implementation specific component. It is responsible for converting the parse trees to compiled methods / initializers and into DLR specific expression trees.

The compiler service is implemented to adhere strictly to the X3J20 specifications. If it deviates, unless documented thoroughly, it is to be considered a bugnsidered a bug.

The compiler services are to be tolerant to source code errors. They should try to recover (but not automatically fix) mal-formatted source code. This is necessary if we in the future will use the same services in a development environment to process unfinished source code, i.e. to dynamically parse code as the user types it.

Compiler services are not allowed to throw exception due to inconsistencies or mal-formatted source code. They should use an error-sink to report errors to their client. If possible, errors should be descriptive, including the source location that we responsible for the issue and optionally a suggestion on how to fix that. It is however OK to throw exceptions due to internal inconsistencies, i.e. due to a bug in the C# implementation.

The compiler services are written in C#. Those make up what we call the bootstrap compiler, which is the default compiler service. This compiler is responsible for reading the IronSmalltalk base class library. Once enough of the base class library is processed, the implementation may give control and continue processing source code using a compiler service implemented in Smalltalk. This feature will be deferred to version 1.x.

Compiler services do not directly access Smalltalk elements. They communicate to the Smalltalk environment (the Smalltalk image in a legacy implementation) via a Smalltalk Environment Service. The Environment Service is responsible for manipulating the state of the environment, e.g. creating classes, defining globals, initializing etc. This decouples implementation and lets us change implementation specific features if we choose to.

We strive to avoid hardcoding constants inside the business logic source code. If possible, we'll define a class containing error message string constants and a class containing grammatical string constants.

Abstract Program Grammar

The Abstract Program Grammar is described in X3J20 section "3.3 Smalltalk Abstract Program Grammar". It is open to interpretation, but X3J20 includes the "4. Smalltalk Interchange Format", that is a possible concrete implementation of the program grammar. We've chosen to implement the interchange format as our program grammar.

To accommodate integration with the .Net framework, the grammar may be extended to implement features to allow easy integration with the framework. However, this will require the interchange files to contain specific interchange version. For the purpose, we've defined interchange version "IronSmalltalk 1.0". Using this interchange version tells the compiler services to accept the grammar extensions introduced by IronSmalltalk. In the future, we'd like to make it possible for 3rd parties to implement their own interchange version and extend the compiler services.

The program grammar is mostly implemented in the IronSmalltalk.Compiler.Interchange namespace. The portions of it which deal with manipulating the Smalltalk environment are implemented in the IronSmalltalk.Runtime.SmalltalkEnvironmentService.

The interchange format specification is finite and relatively small. We've decided instead of using the traditional technique of evaluating chunks of file-in code to hardcode all parts of the interchange format in C# classes. This is needed anyway to be able to bootstrap the Smalltalk environment.

The elements described in X3J20 "4. Smalltalk Interchange Format" are modelled in the IronSmalltalk.Compiler.Interchange.ParseNodes namespace as subclasses of the InterchangeParseNode base class. Each node instance that is subclass of the InterchangeUnitNode can be asked to Evaluate itself. That is analogues to evaluating a chunk of file-in code in traditional implementation. When evaluated, the node will manipulate the Smalltalk environment and if necessary read and process (file-in) more source code.

Processing an interchange format file is done by an InterchangeFormatProcessor instance. The InterchangeFormatProcessor is responsible for reading the given source code, splitting it into chunks and processing each chunk. The processor has a/is the context and knows about the Smalltalk environment it belongs to and is working on.

The InterchangeFormatProcessor delegates the parsing work to a InterchangeFormatParser instance. The interchange parser is a specialized (hard-coded) parser that knows exactly how to parse the interchange format. It generates the interchange nodes mentioned earlier.

Substantial part of the implementation of the program grammar is in the Smalltalk environment service. That service is responsible for manipulating the environment. It is documented in a separate file.

See: IronSmalltalk.Compiler.Interchange and IronSmalltalk.Compiler.Interchange.ParseNodes namespaces and IronSmalltalk.Runtime.SmalltalkEnvironmentService class.

Method Grammar

The method grammar is described in X3J20 "3.4 Method Grammar". It is implemented in the IronSmalltalk.Compiler.SemanticAnalysis and IronSmalltalk.Compiler.SemanticNodes namespaces.

The main component is the Parser, which is implemented in the IronSmalltalk.Compiler.SemanticAnalysis.Parser class. It contains almost all of the logic described in the X3J20 "3.4 Method Grammar". The parser's main function is to parse method definitions and initializers.

The parser produces an AST (Abstract Syntax Tree) of parse nodes. Those are defined in the IronSmalltalk.Compiler.SemanticNodes namespace. In general, each node class represents an element described in the X3J20 chapter.

ISSUE: How do the AST convert to source code?o source code?

See: IronSmalltalk.Compiler.SemanticAnalysis and IronSmalltalk.Compiler.SemanticNodes namespaces.

Lexical Grammar

The lexical grammar is described in X3J20 "3.5 Lexical Grammar". The logic is mainly implemented in the IronSmalltalk.Compiler.LexicalAnalysis and IronSmalltalk.Compiler.LexicalTokens namespaces.

The main component is the Scanner (or Lexer). It is implemented in the IronSmalltalk.Compiler.LexicalAnalysis.Scanner class. It is responsible for creating tokens out of the given source code. The main method is GetToken which returns the next token.

When the scanner reaches the end of the source code it returns an EofToken. If the client insists on getting more tokens, it returns null, i.e. EofToken is returned never more than once. This ensures that if the client is in some sort of loop and doesn't recognize EofToken correctly, it will exit gently or just crash instead of getting stuck in an infinite loop.

The GetMethod token has a preference argument that tells the scanner, in case of ambiguity, what token the client (the parser) would prefer. Example is "|" (vertical bar), which can either be a binary selector or a special character used for defining temporaries or inside blocks.

Tokens are defined in IronSmalltalk.Compiler.LexicalTokens namespace and model the descriptions in the X3J20 chapter. Some tokens are extension to the X3J20 specification, primarily the SpecialCharacter token that is often used for method grammar constants.

Some inconsistencies exist between current implementations (VSE, Squeak and may be VW) and the X3J20 standard. Namely X3J20 says:

  1. Each token is to be recognized as the longest string of characters that is syntactically valid, except where otherwise specified.
  2. Unless otherwise specified, white space or another separator must appear between any two tokens if the initial characters of the second token would be a valid extension of the first token.
  3. White space is not allowed within a token unless explicitly specified as being allowed.

I've identified two issues that deviate from some existing implementations:

The adherence to the X3J20 standard might create some headaches if people try to migrate code from an existing implementation. They must re-write their code, sorry! Future versions might implement lexer/parser that handles this differently, but the will also require an extension to the interchange format by defining a new implementation specific version. In other words, we will not bend X3J20 to introduce compatibility with legacy code.

See: IronSmalltalk.Compiler.LexicalAnalysis and IronSmalltalk.Compiler.LexicalTokens namespaces.