HP Forums

Full Version: monic part 5: writing C programs on the calculator
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Previous: monic part 4: arbitrary-precision rational arithmetic

Next: monic part 6: the mC programming language and compiler

I've started writing an editor and translator for a subset of the C programming language. I'm not sure what the proper computer science term for the translator is. Given a program, e.g. find the factorial of the positive integer on stack level 1:

Code:

double product=1.0;
double n;

int main(void) {
 n = level1; //pop value on stack level 1 into n
 do {
  product = product*n;
  n--; 
 } while (n>1.0);
}

the translator produces a sequence of codes. When the program is run the codes are fed to a switch statement, which executes the corresponding instruction. Does this make the translator a compiler, or is it an interpreter?
(01-01-2021 01:35 AM)F-73P Wrote: [ -> ]the translator produces a sequence of codes. When the program is run the codes are fed to a switch statement, which executes the corresponding instruction. Does this make the translator a compiler, or is it an interpreter?

More like an interpreter. An interpreter executes the "translated" program as it is converting it (as does your system). A compiler transforms the original into something else that you can store and then run at a later date once the "translation" is complete.

A couple of optimisations of your code, BTW. You can't do "n--" if n is a floating point number. Also, this assumes that the value of n has been checked for validity and that it is in fact a positive integer value (expressed as a floating point).

Code:
int main(void) {
 double product = 1.0;
 unsigned int n = (unsigned int)level1;
 for(; n > 1;) product *= n--;
}
(01-01-2021 01:35 AM)F-73P Wrote: [ -> ]the translator produces a sequence of codes. When the program is run the codes are fed to a switch statement, which executes the corresponding instruction. Does this make the translator a compiler, or is it an interpreter?

Very nice work, judging from your description!

This is not a trivial accomplishment.

I assume you are lexing/parsing the source code just once, so this is a compiler that produces (byte) codes. Your virtual machine executes them in a loop with a switch statement. This approach is in principle similar to the JVM. Just to compare, I wrote a small C compiler that produces JVM class files from "mini C" source code.
(01-01-2021 11:20 AM)grsbanks Wrote: [ -> ]You can't do "n--" if n is a floating point number.

"n--" compiles without errors for n of type double on the compilers I tried (and gives n-1.0), so I assume it's supported by the standard they are using. However it's not good programming style, as "n--" could be mistaken for the predecessor of the floating-point number, which in general is not equal to n-1.0, so your suggestion is better.

(01-05-2021 03:50 AM)robve Wrote: [ -> ]I wrote a small C compiler that produces JVM class files from "mini C" source code.

Great work! I also like the manual, comprehensive and very professional.

I wrote the code before I read anything about compilers. The front end consists of (what I now know is called) a lexical analyser and the parser is just Dijkstra's shunting yard algorithm.

I'm now wondering whether I should:

a) Continue with my ad hoc approach. This is the easiest option but there is no syntax or semantic checking, so it will never be a complete programming environment.

b) Learn about grammars, abstract syntax trees and top-down parsing and write the parser by hand. This sounds very difficult.

c) Find an open source compiler and try to port it across.

My preference is for option b), despite the difficulties. In that case I think I should choose a programming language with a simpler grammar than C as the source language.
Option b, definitely. You wrote a compiler without knowing anything about compilers-- imagine what you'll accomplish once you know what you're doing! Wink
(01-06-2021 06:34 AM)mpark Wrote: [ -> ]Option b, definitely. You wrote a compiler without knowing anything about compilers-- imagine what you'll accomplish once you know what you're doing! Wink

I too would recommend b. Tools like flex and bison can help tremendously. It's been a while since I used them last time but IIRC you get by with just cursory knowledge of the grammars theory.

Definitely option b). By contrast, it will quickly become much more complicated to complete option a) when the language gets more complex.

Option b) is not as difficult to comprehend as you may think.

You could implement this with a hand-written lexical analyzer and write a recursive descent parser to do the translation. Or use flex and bison. The first chapters of the https://en.wikipedia.org/wiki/Compilers:..._and_Tools aka the "(Red) Dragon Book" shows how. Here are some lecture notes to get started.

To limit the syntax to arithmetic syntax, operator precedence parsers are very handy and relatively easy to implement, see https://en.wikipedia.org/wiki/Operator-p...nce_parser

The syntax of Prolog is parsed with operator precedence parsing. The interesting part is that the language syntax itself can be extended on the fly by the user defining his/her prefix, index, and postfix operators.

Also Pakrat parsing has gained some popularity, see https://en.wikipedia.org/wiki/Parsing_ex...on_grammar

Hope this helps!
(01-06-2021 09:59 AM)vaklaff Wrote: [ -> ]Tools like flex and bison can help tremendously. It's been a while since I used them last time but IIRC you get by with just cursory knowledge of the grammars theory.

I second using Flex and Bison instead of rolling your own parser as it's *much* easier. You should get the book "flex & bison: Text Processing Tools" by by John Levine and published by O'Reilly. The book has a well written introduction to BNF grammars. I skipped that part though as I've read the "Dragon book" and due to my knowledge of theoretical computer science from uni. A fellow HP48 hacker and I were coding a "high-level assembler" for the Saturn processor with the "assembler" being a kind of C and Saturn assembly hybrid. I had bought both the paperback and Kindle editions of the book and it was a really great read -- it made coding the assembler / compiler or whatever-you-want-to-call-it *much* easier than rolling our own as the C99 grammar is very complicated. The project is on hold now as I've had numerous other things to deal with in my Real Life [TM], but I'll eventually get back to it and finish it, hopefully Wink

Regards,

Jonathan
Thank you to everyone for the links and great advice - option b) was definitely the best choice.

I've ported the source code for the "TINY" compiler presented in Compiler Construction, Principles and Practice to the calculator:

[Image: 50896808981_660dfcbc80.jpg]

Execution time is not too bad, with 1000! (2568 digits, on stack level 6) calculated in about 6 seconds:

[Image: 50896490623_85bd6b6928.jpg]

I haven't added scrolling yet so I reduced the font size to see more digits:

[Image: 50896097733_fffa246e02.jpg]

The "TINY" language has the following BNF grammar:

[Image: 50900539612_74fea8abef.jpg]

and the following extended BNF grammar:

[Image: 50900535492_85c2827121.jpg]

I added bytecode generation to the parser and this was pretty straightforward, except for if-then-else and repeat-until statements; for these I used a stack to keep track of the destination labels and jump-if-false bytecodes for nested statements.

I would now like to implement the "C-" language from the same book, which is more powerful than "TINY" ("C-" adds support for functions and arrays). Here is a factorial program written in "C-":

[Image: 50900418856_667b86024d.jpg]

"C-" has the following BNF grammar:

[Image: 50900538422_445b7496f8_z.jpg]
[Image: 50900418246_a36bc4a18b.jpg]

My impression is that a recursive descent parser is not too difficult to write provided the programming language is small and the grammar is written in extended BNF.

Flex and Bison are of course very powerful and useful tools but my philosophy with the "monic" calculator project has been to minimise the number of tools and libraries used.

So if I can rewrite the grammar for "C-" in extended BNF I can use the "TINY" parser as a guide to write a parser for "C-". If anyone is interested in helping me with this please let me know.

EDIT: Thank you to Jonathan Busby for translating the "C-" BNF grammar above into an ISO/IEC 14977 "standard" (see the Wikipedia article) EBNF grammar, which I will use to write the "C-" parser:

Code:
program = declaration-list ;
declaration-list = { declaration } , declaration ;
declaration = var-declaration , fun-declaration ;
var-declaration = type-specifier , ID , [ "[" NUM "]" ] , ";" ;
type-specifier = "int" | "void" ;
fun-declaration = type-specifier , ID , "(" , params , ")" , compound-stmt ;
params = ( { param , "," } , param ) | "void" ;
param = type-specifier , ID [ "[" , "]" ] ;
compound-stmt = "{" , local-declarations , statement-list , "}" ;
local-declarations = { var-declaration } ;
statement-list = { statement } ;
statement = expression-stmt | compound-stmt | selection-stmt | iteration-stmt | return-stmt ;
expression-stmt = [ expression ] , ";" ;
selection-stmt = if , "(" , expression , ")" , statement , [ "else" , statement ] ;
iteration-stmt =  while , "(" , expression , ")" , statement ;
return-stmt = "return" , [ expression ] , ";" ;
expression = { var , "=" } , simple-expression ;
simple-expression = { additive-expression , relop } , additive-expression ;
relop = "<=" | "<" | ">" | ">=" | "==" | "!=" ;
additive-expression = { term , addop } , term ;
addop = "+" | "-" ;
term = { factor , mulop } , factor ;
mulop = "*" | "/" ;
factor = "(" , expression , ")" | var | call | NUM ;
call = ID , "(" , args , ")" ;
args = [ { expression , "," } , expression ] ;
(02-01-2021 06:36 AM)F-73P Wrote: [ -> ][snip]

EDIT: Thank you to Jonathan Busby for translating the "C-" BNF grammar above into an ISO/IEC 14977 "standard" (see the Wikipedia article) EBNF grammar, which I will use to write the "C-" parser:

Code:
program = declaration-list ;
declaration-list = { declaration } , declaration ;
declaration = var-declaration , fun-declaration ;
var-declaration = type-specifier , ID , [ "[" NUM "]" ] , ";" ;
type-specifier = "int" | "void" ;
fun-declaration = type-specifier , ID , "(" , params , ")" , compound-stmt ;
params = ( { param , "," } , param ) | "void" ;
param = type-specifier , ID [ "[" , "]" ] ;
compound-stmt = "{" , local-declarations , statement-list , "}" ;
local-declarations = { var-declaration } ;
statement-list = { statement } ;
statement = expression-stmt | compound-stmt | selection-stmt | iteration-stmt | return-stmt ;
expression-stmt = [ expression ] , ";" ;
selection-stmt = if , "(" , expression , ")" , statement , [ "else" , statement ] ;
iteration-stmt =  while , "(" , expression , ")" , statement ;
return-stmt = "return" , [ expression ] , ";" ;
expression = { var , "=" } , simple-expression ;
simple-expression = { additive-expression , relop } , additive-expression ;
relop = "<=" | "<" | ">" | ">=" | "==" | "!=" ;
additive-expression = { term , addop } , term ;
addop = "+" | "-" ;
term = { factor , mulop } , factor ;
mulop = "*" | "/" ;
factor = "(" , expression , ")" | var | call | NUM ;
call = ID , "(" , args , ")" ;
args = [ { expression , "," } , expression ] ;

Note that the above EBNF grammar has a slight glitch, which is my fault. I accidentally incorrectly translated the "C-" BNF grammar arithmetic expressions into an EBNF grammar which made all the arithmetic expressions change from left-associative to right-associative. Here is the fix :

Code:
additive-expression = term , { addop , term } ;
term = factor , { mulop , factor } ;

Hope this doesn't cause any inconvenience or confusion.

N.B. Also, note that I assumed when writing the EBNF grammar that the Monic author would be implementing an LL(1) recursive descent parser, which reads its input from left to right and recurses in a depth-first manner. If this is not the case, then the EBNF grammar will need to be modified.

Regards,

Jonathan
Well, I found more absent-minded stupid associativity errors in the EBNF grammar for C- . Here is the corrected grammar :

Code:
program = declaration-list ;
declaration-list = declaration , { declaration } ;
declaration = var-declaration , fun-declaration ;
var-declaration = type-specifier , ID , [ "[" NUM "]" ] , ";" ;
type-specifier = "int" | "void" ;
fun-declaration = type-specifier , ID , "(" , params , ")" , compound-stmt ;
params = ( param , { "," , param } ) | void ;
param = type-specifier , ID [ "[" , "]" ] ;
compound-stmt = "{" , local-declarations , statement-list , "}" ;
local-declarations = { var-declaration } ;
statement-list = [ statement , { statement } ] ;
statement = expression-stmt | compound-stmt | selection-stmt | iteration-stmt | return-stmt ;
expression-stmt = [ expression ] , ";" ;
selection-stmt = if , "(" , expression , ")" , statement , [ "else" , statement ] ;
iteration-stmt =  while , "(" , expression , ")" , statement ;
return-stmt = "return" , [ expression ] , ";" ;
expression =  { var , "=" } , simple-expression ;
var = ID , [ "[" , expression , "]" ] ;
simple-expression = additive-expression , { relop , additive-expression } ;
relop = "<=" | "<" | ">" | ">=" | "==" | "!=" ;
additive-expression = term , { addop , term } ;
addop = "+" | "-" ;
term = factor , { mulop , factor } ;
mulop = "*" | "/" ;
factor = "(" , expression , ")" | var | call | NUM ;
call = ID , "(" , args , ")" ;
args = [ expression , { "," , expression } ] ;

Sorry for any inconvenience.

Regards,

Jonathan

NOTE #1 : I had also accidentally left out some non-superfluous non-terminals. Also, the "C-" assignment operator ( "=" ) is *right associative*, just like in full blown C. So, if that part of the grammar looks like it needs correcting, then just know that it doesn't because I intentionally made the EBNF version right associative to be fully compatible with the "C-" BNF grammar.

EDIT #1 : Incorporated changes mentioned in later posts into above EBNF grammar.
(02-12-2021 11:01 PM)Jonathan Busby Wrote: [ -> ]Well, I found more absent-minded stupid associativity errors in the EBNF grammar for C- . Here is the corrected grammar :

Code:
program = declaration-list ;
declaration-list = declaration , { declaration } ;
declaration = var-declaration , fun-declaration ;
var-declaration = type-specifier , ID , [ "[" NUM "]" ] , ";" ;
type-specifier = "int" | "void" ;
fun-declaration = type-specifier , ID , "(" , params , ")" , compound-stmt ;
params = ( param , { "," , param } ) | void ;
param = type-specifier , ID [ "[" , "]" ] ;
compound-stmt = "{" , local-declarations , statement-list , "}" ;
local-declarations = { var-declaration } ;
statement-list = [ statement , { statement } ] ;
statement = expression-stmt | compound-stmt | selection-stmt | iteration-stmt | return-stmt ;
expression-stmt = [ expression ] , ";" ;
selection-stmt = if , "(" , expression , ")" , statement , [ "else" , statement ] ;
iteration-stmt =  while , "(" , expression , ")" , statement ;
return-stmt = "return" , [ expression ] , ";" ;
expression =  { var , "=" } , simple-expression ;
var = ID , [ "[" , expression , "]" ] ;
simple-expression = additive-expression , { relop , additive-expression } ;
relop = "<=" | "<" | ">" | ">=" | "==" | "!=" ;
additive-expression = term , { addop , term } ;
addop = "+" | "-" ;
term = factor , { mulop , factor } ;
mulop = "*" | "/" ;
factor = "(" , expression , ")" | var | call | NUM ;
call = ID , "(" , args , ")" ;
args = [ expression , { "," , expression } ] ;

Sorry for any inconvenience.

Regards,

Jonathan

NOTE #1 : I had also accidentally left out some non-superfluous non-terminals. Also, the "C-" assignment operator ( "=" ) is *right associative*, just like in full blown C. So, if that part of the grammar looks like it needs correcting, then just know that it doesn't because I intentionally made the EBNF version right associative to be fully compatible with the "C-" BNF grammar.

The following can be simplified :

Code:
statement-list = statement , { statement } | empty ;

The simplified version is :

Code:
statement-list = [ statement , { statement } ] ;

Regards,

Jonathan

EDIT #1 : Incorporated above changes into EBNF grammar
Great work Jonathan!

I've added "++" and "--" to the grammar, as well as "for" loops. I've modified the scanner accordingly and completed a parser recogniser.

Now for the fun part: adding code generation.
(02-14-2021 11:40 AM)F-73P Wrote: [ -> ]I've added "++" and "--" to the grammar

Just make sure you used the correct associativity and precedence for the post and pre increment and decrement operators. The post increment and decrement operators ( eg. "x++" and "x--" ) are one level of operator precedence up from the pre increment and decrement operators and they're also left associative. For the lower precedence pre increment and decrement operators ( eg. "++x" and "--x" ) note they're right associative. Furthermore, you have to take into account the differences between the two classes of post and pre increment and decrement operators with respect to how they alter an expression. What I mean is that something like "y = x++" will assign x to y and then increment x whereas "y = ++x" will increment x first and then assign it to y. You should consult the C99 grammar which can be found here ( Look in Annex A ). If you already know all of this, then please don't interpret my comments as patronizing you in any way, I just like to be sure Smile

Regards,

Jonathan
Thanks Jonathan for your helpful advice, always appreciated.

I removed "int" from the type_specifier grammar production and replaced it with "apq" (arbitrary precision rational), "apf" (arbitrary precision float - not supported on the calculator yet) and "int32_t" (maybe?).

I also added "read" and "write" from the TINY programming language, and post increment and decrement operators (just to increment/decrement variables for now, not used in assignments).

There were a few aspects of the original BNF grammar I found confusing:

1) expression_stmt -> expression ; | ;

I couldn't figure out what the ; on its own is for, so I changed the production to

expression_stmt -> expression ;

2) expression -> var = expression | simple expression

At first I thought this was a typo as I couldn't see how you can have e.g.

expression -> var = expression
Replace non-terminal expression with var = expression to obtain:
expression -> var = var = expression

e.g. a=b=100

and so I changed the production to

expression -> var = simple expression | simple expression

But then I saw your comment explaining that = is right-associative and that a=b=100 is indeed legal. So I'll go back and fix that Smile

I added some code generation instructions to the parser and now simple C programs like the factorial program below can be executed:

[Image: 50951738358_9280a3f65a.jpg]
(02-17-2021 09:14 AM)F-73P Wrote: [ -> ]Thanks Jonathan for your helpful advice, always appreciated.


You're welcome Smile

Quote:I removed "int" from the type_specifier grammar production and replaced it with "apq" (arbitrary precision rational), "apf" (arbitrary precision float - not supported on the calculator yet) and "int32_t" (maybe?).

I also added "read" and "write" from the TINY programming language, and post increment and decrement operators (just to increment/decrement variables for now, not used in assignments).


Nice Smile Keep up the good work Smile

Quote:There were a few aspects of the original BNF grammar I found confusing:

1) expression_stmt -> expression ; | ;

I couldn't figure out what the ; on its own is for, so I changed the production to

expression_stmt -> expression ;

This is to allow empty expressions. You need this in cases such as :

Code:
while(x != somevalue) ;

Which could be used for busy waiting ( eg. a "spinlock" )

Quote:2) expression -> var = expression | simple expression

At first I thought this was a typo as I couldn't see how you can have e.g.

expression -> var = expression
Replace non-terminal expression with var = expression to obtain:
expression -> var = var = expression

e.g. a=b=100

and so I changed the production to

expression -> var = simple expression | simple expression

But then I saw your comment explaining that = is right-associative and that a=b=100 is indeed legal. So I'll go back and fix that Smile

Yep Smile In the EBNF grammar it is :

Code:
expression =  { var , "=" } , simple-expression ;

Quote:I added some code generation instructions to the parser and now simple C programs like the factorial program below can be executed:

[Image: 50951738358_9280a3f65a.jpg]

Nice work! Smile I look forward to seeing how your project evolves Smile

Regards,

Jonathan
(02-17-2021 10:48 PM)Jonathan Busby Wrote: [ -> ]
Quote:There were a few aspects of the original BNF grammar I found confusing:

1) expression_stmt -> expression ; | ;

I couldn't figure out what the ; on its own is for, so I changed the production to

expression_stmt -> expression ;

This is to allow empty expressions. You need this in cases such as :

Code:
while(x != somevalue) ;

Which could be used for busy waiting ( eg. a "spinlock" )

The above was just a random example. If you google for eg. "C empty expression uses" then you'll get many examples of different uses of "null" or empty expressions other than my simple ( and somewhat contrived ) example.

Regards,

Jonathan
(02-18-2021 06:53 PM)Jonathan Busby Wrote: [ -> ]
(02-17-2021 10:48 PM)Jonathan Busby Wrote: [ -> ]This is to allow empty expressions. You need this in cases such as :

Code:
while(x != somevalue) ;

Which could be used for busy waiting ( eg. a "spinlock" )

The above was just a random example. If you google for eg. "C empty expression uses" then you'll get many examples of different uses of "null" or empty expressions other than my simple ( and somewhat contrived ) example.

Regards,

Jonathan

Also, googling for "C while loop empty loop statement" turns up a number of excellent examples of the usage of while loops with empty loop statements ( ie. "while(something) ;" ), especially the StackOverflow answers.

Regards,

Jonathan
I've created a new type specifier ("local") for local variables (which can be declared at the beginning of a function and are valid only in that function) and another type specifier for arrays ("array"). I've removed the other type specifiers from the grammar, as the first word of each object pushed onto the stack determines the object type. I've also added Boolean data types. Below is the EBNF grammar for the mC programming language (tokens are in uppercase/quotation marks):

Code:

program -> code ENDFILE

code -> { array-declaration } { function-declaration } main-function-declaration

array-declaration -> ARRAY ID dimensions SEMICOLON

dimensions -> LEFTBRACKET NUM RIGHTBRACKET { LEFTBRACKET NUM RIGHTBRACKET }

function-declaration -> ID LEFTPARENTHESIS params RIGHTPARENTHESIS LEFTBRACE local-declarations statement-list RIGHTBRACE

params -> [ param { COMMA param } ]

param -> ID { LEFTBRACKET NUM RIGHTBRACKET }

main-function-declaration -> MAIN LEFTBRACE local-declarations statement-list RIGHTBRACE

local-declarations -> { LOCAL var-declaration }

var-declaration -> ID SEMICOLON | array-declaration

statement-list -> { statement }

statement -> compound-stmt | if-stmt | while-stmt | do-stmt | for-stmt | switch-stmt | return-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt 

compound-stmt -> LEFTBRACE statement-list RIGHTBRACE

if-stmt -> IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ]

while-stmt -> WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement

do-stmt -> DO statement WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS SEMICOLON

for-stmt -> FOR LEFTPARENTHESIS [ assignment ] SEMICOLON Boolean-expression SEMICOLON [ addditive-expression ] RIGHTPARENTHESIS statement 

switch-stmt -> SWITCH LEFTPARENTHESIS ID { LEFTBRACKET additive-expression RIGHTBRACKET } RIGHTPARENTHESIS LEFTBRACE { CASE NUM COLON statement-list BREAK SEMICOLON } [ DEFAULT COLON statement-list ] RIGHTBRACE  

return-stmt -> RETURN [ expression ] SEMICOLON    

read-stmt -> READ ID { LEFTBRACKET additive-expression RIGHTBRACKET } SEMICOLON 

write-stmt -> WRITE expression SEMICOLON 

assignment-stmt -> assignment SEMICOLON

expression-stmt -> [ expression ] SEMICOLON

goto-stmt -> GOTO ID SEMICOLON

label-stmt -> ID COLON statement

nop-stmt -> NOPR SEMICOLON

assignment -> ID { LEFTBRACKET additive-expression RIGHTBRACKET } "=" expression 

expression -> Boolean-expression | additive-expression
 
Boolean-expression -> logical-expression { boolop logical-expression } 

logical-expression -> additive-expression relop additive-expression | ID { LEFTBRACKET additive-expression RIGHTBRACKET } | TRUE | FALSE 

additive-expression -> term { addop term }

term -> [ NEGATE ] factor { mulop [ NEGATE ] factor }

factor -> LEFTPARENTHESIS additive-expression RIGHTPARENTHESIS | NUM | ID { LEFTBRACKET additive-expression RIGHTBRACKET } | ID LEFTPARENTHESIS args RIGHTPARENTHESIS | ID "++" | ID "--" 

args -> [ expression { COMMA expression } ]

boolop -> "and" | "or" | "not"

relop -> "<=" | "<" | ">" | ">=" | "==" | "!=" 

addop -> "+" | "-"

mulop -> "*" | "/"

Some test programs:

1) factorial using a while loop

Code:

/*factorial program 1*/

main{
   read x;                          
   factorial=1;
   while (x>0) {
     factorial = factorial*x;
     x--;                          
   }
   write factorial;
}

2) factorial using recursion

Code:

/*factorial program 2*/

fact(x){
   if (x==1) return 1;
   else return x*fact(x-1);
}  

main{                          
    read y;
    if (y>1) fact(y);
    else write 1;                        
}

3) squaring a number using nested for loops

Code:

/*square program*/

main{
   read x;
   product=0;    
   for(i=x;i>=1;i--) {
      for(j=1;j<=x;j++) {
        product++;
      }
   }                          
   write product;
}
As the compiler for the programming language (mC) introduced in the previous post nears completion, I would like to provide a brief overview of how the compiler works and share some source code.

The compiler is comprised of 3 main components - the scanner, parser and virtual machine. In this post I'll describe the scanner. Below is the C code for the scanner (which is based on the scanner in Kenneth C. Louden's excellent "Compiler Construction:Principles and Practice"):

Code:

//scanner.c v1.00

//DFA scanner - read the source code and return the next token to the parser

//includes

#include <stdbool.h>
#include <ctype.h>
#include <stdint.h>
#include <string.h>

#include "definitions.h"

//structs

struct { //lookup table of reserved words
char* str;
TokenType tok;
} reservedWords[RESERVED] = {{"main",MAIN},{"if",IF},{"else",ELSE},{"do",DO},{"while",WHILE},{"for",FOR}, {"return",RETURN},
                             {"read",READ},{"write",WRITE},{"and",AND},{"or",OR},{"not",NOT},{"true",TRUE},
                             {"false",FALSE},{"local",LOCAL},{"array",ARRAY},{"switch",SWITCH},
                             {"case",CASE},{"break",BREAK},{"default",DEFAULT},{"goto",GOTO},{"nop",NOPR}};
//global variables    
                                                         
char source[SOURCE+1]={"/*factorial program*/\n" //example program 

                     "main {\n"
                          "read x;\n"                            
                          "factorial=1;\n"
                          "while (x>0) {\n"
                            "factorial = factorial*x;\n"
                            "x--;\n"
                        "}\n"
                        "write factorial;\n"
                      "}\n"                              
                     };

uint32_t sourceIndex=0; //current position in source
uint16_t lineno=0; //current line number in source    
char lexemeString[MAXLEXEMELEN+1];

//prototypes
                                         
TokenType reservedLookup(char *lexemeString);

//functions

TokenType getToken(void) { //returns the next token in source    
    char c;    //current character in source
    
    uint16_t lexemeStringIndex = 0; //index for storing lexeme into lexemeString
    _Bool save; //flag to indicate save to lexemeString
    
    TokenType currentToken; //holds current token to be returned to the parser    
  
    StateType state = START; //current state - always begins at START

  while (state!=DONE) {
        c=source[sourceIndex++]; //next character in program
        save = true;
        switch (state) { //begin switch (state)
            case START:
                if (isdigit(c))
                    state = INNUM;
         else if (isalpha(c))
           state = INID;
                 else if (c == '<')
                     state = INLESSTHAN;
                 else if (c == '>')
                     state = INGREATERTHAN;
                 else if (c == '=')
                     state = INEQUAL;
                 else if (c == '!')
                     state = INNOTEQUAL;
                 else if (c == '/') {
                     save = false;
                     state = INDIV;
                 }
                 else if (c =='+') {
                     state = INPLUS;
                 }
                 else if (c =='-') {
                     state = INMINUS;
                 }
         else if ((c == ' ') || (c == '\t') || (c == '\n')) {
                     save = false;
                     if (c == '\n') lineno++;
                 }
         else
         { state = DONE;
           switch (c) {
                         case ENDFILE:
               save = false;
               currentToken = ENDFILE;
               break;
             case '*':
               currentToken = TIMES;
               break;
                         case ':':
               currentToken = COLON;
               break;
             case ';':
               currentToken = SEMI;
               break;
             case ',':
                             currentToken = COMMA;
               break;
             case '(':
               currentToken = LPAREN;
               break;
             case ')':
               currentToken = RPAREN;
               break;
                         case '[':
               currentToken = LBRACKET;
               break;
             case ']':
               currentToken = RBRACKET;
               break;
                         case '{':
               currentToken = LBRACE;
               break;
             case '}':
               currentToken = RBRACE;
               break;
                         case '&':
               currentToken = ADDR;
               break;
             default:
                 //insert error code
                 state = DONE;
           }
         }
         break;
            case INNUM:
         if (!isdigit(c)) { //backup in the input
                     if(c == 'e') {
                         state=INEXP;
                     }
           else if(c != '.') {
                         sourceIndex--;
                         save = false;
                         state = DONE;
                         currentToken = NUM;
                     }
         }
         break;
            case INEXP:
                state=INNUM;
                break;
      case INID: //ID is letters/digits (cannot start with a digit)
         if (!isalpha(c)) { //backup in the input
                     if (!isdigit(c)) {
                         sourceIndex--;
                         save = false;
                         state = DONE;
                         currentToken = ID;
                     }
         }
         break;
            case INLESSTHAN:
              state = DONE;
              if (c == '=') currentToken = LTE;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = LT;
        }
              break;
            case INGREATERTHAN:
              state = DONE;
              if (c == '=') currentToken = GTE;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = GT;
        }
              break;
            case INEQUAL:
                state = DONE;
              if (c == '=') currentToken = EQ;
              else {
                    sourceIndex--;
                    save=false;
                    currentToken = ASSIGN;
                }
                break;
            case INNOTEQUAL:
                state = DONE;
              if (c == '=') currentToken = NEQ;
              else {
                    //insert error code
                    state = DONE;
                }
              break;
            case INDIV:
        save = false;
              if (c == '*') state = INCOMMENT;
              else {
                    state = DONE;
                    sourceIndex--;
                    currentToken = DIV;                        
                    if (lexemeStringIndex < MAXLEXEMELEN) {
                        lexemeString[lexemeStringIndex++] = '/';
                        lexemeString[lexemeStringIndex] = '\0';
                    }
                    else {
                        //insert error code
                    }    
        }
              break;
            case INPLUS:
        state = DONE;
              if (c == '+') currentToken = INC;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = PLUS;
        }
              break;
            case INMINUS:
        state = DONE;
              if (c == '-') currentToken = DEC;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = MINUS;
        }
              break;
      case INCOMMENT:
              save = false;
              if (c == ENDFILE) { 
                    state = DONE;
                  currentToken = ENDFILE;
              }
              else if (c == '*')
                  state = OUTCOMMENT;
                else if (c == '\n') lineno++;
              break;
            case OUTCOMMENT:
                save = false;
                if (c == '/') state = START;
              else { //backup in the input
           sourceIndex--;
           state = INCOMMENT;
        }
                break;
      case DONE:
      default: //should never happen
               //insert error code
                 state = DONE;
         break;
    } //end switch (state)

        if (save) {
            if (lexemeStringIndex < MAXLEXEMELEN) lexemeString[lexemeStringIndex++] = c;
      else {
                //insert error code
            }                
        }

    if (state == DONE) {
            lexemeString[lexemeStringIndex] = '\0';
       if (currentToken == ID) currentToken = reservedLookup(lexemeString);
        }
  }
    return currentToken;
}

TokenType reservedLookup (char *lexemeString) { //lookup an identifier to see if it is a reserved word (linear search)
    uint16_t i;
    for (i=0;i<RESERVED;i++) {
        if (!strcmp(lexemeString,reservedWords[i].str)) return reservedWords[i].tok;
    }
    return ID;
}

The code for the "definitions.h" header file is:

Code:

//definitions.h v1.00

//tokens
typedef enum { ENDFILE, //book-keeping tokens
    MAIN,IF,ELSE,DO,WHILE,FOR,RETURN,READ,WRITE,AND,OR,NOT,TRUE,FALSE,LOCAL,ARR​AY,SWITCH, //reserved words
      CASE,BREAK,DEFAULT,GOTO,NOPR, //reserved words
    ID,NUM, //multicharacter tokens
    //special symbols
    PLUS,MINUS,TIMES,DIV,LT,LTE,GT,GTE,EQ,NEQ,ASSIGN,COLON,SEMI,COMMA,LPAREN,RP​AREN,LBRACKET,RBRACKET,
    LBRACE,RBRACE,INC,DEC,ADDR
} TokenType;

//states in scanner DFA
typedef enum {START,INNUM,INEXP,INID,INLESSTHAN,INGREATERTHAN,INEQUAL,INNOTEQUAL,INDIV,I​NPLUS,INMINUS,
              INCOMMENT,OUTCOMMENT,DONE} StateType;

//definitions
#define RESERVED 22 //the number of reserved words
#define SOURCE 1000 //maximum character length of source code
#define MAXLEXEMELEN 40 //maximum length of lexeme

The program source code is an array of characters terminated by the null character (i.e. a string). To improve readability, the string is padded with white space (blanks,tabs and newlines). Comments (text between /* and */) are ignored. The source code for the example program in "scanner.c v1.00" is:

Code:

/*factorial program 1*/

main{
   read x;                          
   factorial=1;
   while (x>0) {
     factorial = factorial*x;
     x--;                          
   }
   write factorial;
}

The numbers,letters,words and special symbols that make up the source code are called "lexemes". Each lexeme belongs to a special category called a "token". The scanner extracts each lexeme from the source code and determine its token type.

To test the scanner, add the following two files to the project:

Code:

//compiler.c v1.00

//prototypes

void parse(void);

//functions

int main (void) {
    parse();    
}

Code:

//parser.c v1.00

#include "definitions.h"

//global variables

extern char lexemeString[MAXLEXEMELEN+1];

//prototypes

TokenType getToken(void);

//functions

void parse(void) {
    TokenType token;
    
    while(token!=ENDFILE) {
        token=getToken();
    }        
}

Build the project, start a Debug Session (I use the Keil uVision5 simulator) and repeatedly step over the line "token=getToken();" in "parser.c" to extract the lexemes and determine the corresponding token (the lexemes are stored in the variable "lexemeString"):

Code:

lexeme                                                token

main                                                  MAIN
{                                                     LBRACE
read                                                  READ
x                                                     ID
;                                                     SEMI
factorial                                             ID
=                                                     ASSIGN
1                                                     NUM
;                                                     SEMI
while                                                 WHILE
(                                                     LPAREN
x                                                     ID
>                                                     GT
0                                                     NUM
)                                                     RPAREN
{                                                     LBRACE
factorial                                             ID
=                                                     ASSIGN
factorial                                             ID
*                                                     TIMES
x                                                     ID
;                                                     SEMI
x                                                     ID
--                                                    DEC
;                                                     SEMI
}                                                     RBRACE
write                                                 WRITE
factorial                                             ID
;                                                     SEMI
}                                                     RBRACE
0                                                     ENDFILE

This information is passed to the parser, which will be the subject of the next post.
Pages: 1 2
Reference URL's