Brass 2 source syntax - preliminary document

Stuff between -> <- are asides.

==============================================================================================================

Basic syntax:

==============================================================================================================

GENERAL

1. Label, module, or directive names are *not* case sensitive. They will never be, there will never be a
switch for this. Case sensitivity only serves to confuse matters, in my opinion.

==============================================================================================================

TOKENS

At the atomic level, a line of source is made up of tokens.
A token is represented by a series of characters that the parser attaches some significance to.
There are a variety of different sorts of token:

1. Constant token
This is a text-based token that the assembler doesn't attach any specific importance to. This could be, for
example, {LD}, {$7463} or {AF'} (braces for illustrative purposes).
You can try and drag a numeric value from it. If it's clear that it's a numeric constant it'll return that
value, if it's a label it'll look up the label.
For the {=} assignment operator, if no label exists one will be created (see section on expressions).

2. String token
Unlike the constant token, this is recognised as being a string constant. It is suitably escaped (same escape
characters as TASM - currently no support for Unicode escape sequences but I will probably add them).
They start and end with a matching " or ' symbol. Strings using the double quotes (") will be run through an
extra step by which each character value can be remapped to another (user-defined).
Examples: {'a'}, {'\n'}, {"AB\tCD"}

3. Operator token
This is a mathematical operator. Once swallowed by the parser the string representation of the operator is
ignored - the token is given a value from an enumeration.
The currently supported operators are:
{*}, {/}, {+}, {-}, {%}, {|}, {&}, {^}, {>}, {<}, {!}, {~}, {?}, {:}, {>>}, {<<}, {==}, {!=}, {>=}, {<=},
{**}, {&&}, {||}, {=}, {+=}, {-=}, {*=}, {/=}, {%=}, {&=}, {|=}, {^=}, {<<=}, {>>=}, {++}, {--}
Most of them should be recognisable from their C counterparts.

{?} and {:} are ternary conditional operators -> current note: ternary operators not yet supported <-
{**} is a power operator (2**8 returns 256).
See the section below on expressions for more information about the mathematics involved.

4. Punctuation token
These represent an item of punctuation. Recognised items of punctuation are:
{,} (Comma), {[} (OpenBracket), {]} (CloseBracket), {(} (OpenParenthesis),
{)} (CloseParenthesis), {\} (LineBreak)

5. Comment token
An entire comment is wrapped up into a single token. C-style (/* */) and assembly-style (;) will both be
supported. For example, {/* Comment */} or {; Comment}

All tokens know their index in the original string and their basic string representation.

Operator tokens also know their "parenthesis" index - how many opened parentheses are between them and the
start of the source line. For example;
 
 .  . .  .  .   .  .  .
3+(2+4/(4-3)+((5*4)/2)-1)
 0  1 1  2  1   3  2  1  <- each operator token's parenthesis index.

This is used by the expression parser.

==============================================================================================================

EXPRESSIONS

Some tokens can be grouped together to form entire expressions. An expression of more than one token will be
a mathematical expression. An expression with one token might or might not be.

A single-token expression could be any token, so they are not especially interesting. If you had the following
sequence of tokens, though:

ld a, (50 * -5) / $10 ; Load something into the accumulator

...you would end up with these expressions.

1. Constant    {ld}
2. Constant    {a}
3. Punctuation {Comma}
4. Punctuation {OpenParenthesis}
   Constant    {50}
   Operator    {Multiplication}
   Operator    {UnarySubtraction}
   Constant    {5}
   Punctuation {CloseParenthesis}
   Operator    {Division}
   Constant    {$10}
5. Comment     {; Load something into the accumulator}

4 is the one of interest, being made up of multiple tokens.
Expressions can be evaluated. The process for this is rather involved;

1. Cycle through the sequence of tokens. Create a list of all operator tokens, and strip out punctuation.

2. Sort the revised list of operators (OperatorToken : IComparable). The sort takes into consideration these
   factors, in order:
   - Are the parenthesis indices different? If so, compare them and return the result.
   - Are the operators different? If so, compare then and return the result.
   - Compare the token's index within the original string and return the result. For unary operators, the
     further right it is the higher precedence it is. For binary operators, the further left it is the
     higher precedence.
     
     -~1   <- the ~ has precedence over the -
     1+2-3 <- the + has precedence over the -
     
     See table below for operator order of precedence.
        
3. Go through the sorted list of operators. If it's binary:

       {4} {+} {5}
        |___ ___|
            |______ Grab the outer tokens and perform the operation on them.
            
       {4} {9} {5}
            |______ Replace the operator with the result
            
        .  {9}  .
        |_______|__ Delete the two constants.
		
	If it's unary:
	
        {-} {1}
             |___ Grab the rightmost token and perform the operation on it.
		     
         .  {-1}
              |__ Replace the token with the result and delete the operator.

4. Return the only remaining token's value.

Order of precedence:

                +-----------------+-----------------------------------+-------------------+
             -> | Unary           | + - ! ~ ++ --                     | Higher precedence |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Power           | **                                |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Multiplicative  | * / %                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Additive        | + -                               |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Shift           | << >>                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Relational      | < > <= >=                         |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Equality        | == !=                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Logical         | &                                 |                   |
                |                 | ^                                 |                   |
                |                 | |                                 |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Conditional     | &&                                |                   |
                |                 | ||                                |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
             -> | Assignment      | = += -= *= /= %= &= |= ^= <<= >>= | Lower precedence  |
                +-----------------+-----------------------------------+-------------------+
                    
Note that both unary and assignment operators are regarded as being right-associative; that is, they are
evaluated from right-to-left as opposed to from left-to-right when compared to operators of the same
precedence and at the same parenthesis level.

Note that ++ and -- are both unary operators. That is to say, ++x works, but x++ doesn't.

For a constant token to be able to be evaluated it needs to contain either;
    - a constant numeric value, or;
    - a label's name.

The parser will try to evaluate it as a constant token before trying to look it up as a label.

Numeric constants can have a prefix xor a suffix to denote a base. You may not have both.

                                        +------+--------+--------+
                                        | Base | Prefix | Suffix |
                                        +------+--------+--------+
                                        |    2 |   %    |   b    |
                                        |    8 |   @    |   o    |
                                        |   10 |        |   d    |
                                        |   16 |   $    |   h    |
                                        +------+--------+--------+

All suffixes can be in lower or upper case. As base 10 has no prefix, a number without prefix or suffix is
assumed to be a decimal constant.

The assignment operators will try and write back the result of the operation to the argument to the left of
the operator. For example (running Brass 2 in interactive mode):

    > $ = 10
    = 10
    > $ *= 2
    = 20
    > $ *= 2
    = 40
    > $ *= 2
    = 80
    > $ += $
    = 160
    > $ = 1.1
    = 1.1
    > $ *= $
    = 1.21
    > $ *= $
    = 1.4641
    > $ *= $
    = 2.14358881
    > $ *= $
    = 4.59497298635722
    > 4 = 3
    E Cannot assign to constants
    
(Here $ represents the label that is used to store the current instruction pointer).
 
Unlike Brass 1, where '=' was treated as a 'magic' directive (aliased to .equ), the new Brass 2 expression
parser handles assignment operators natively.
 
 -> Shouldn't be a problem: Brass 1 didn't accept = as an operator at all as far as I can tell, so potential
    pitfalls with code such as ".if x=1" shouldn't crop up. <-

As far as label names are concerned - if you put a colon on the END of the name, it means you are referring
to the label's internal value. If you put a colon at the BEGINNING of the name, it is assumed that you are
referring to the label's page number.
If you omit the colon completely it is assumed that you are referring to the value.

Hence:
    $  = 1 ~ sets value of current instruction pointer to 1
    $: = 2 ~ sets value of current instruction pointer to 2
    :$ = 3 ~ sets page of current instruction pointer to 3

So you can do things like this:

    > $: = 10
    = 10
    > :$ = 2
    = 2
    > :$ *= $:
    = 20
    > :$
    = 20
    > $:
    = 10
    
-> At which point it starts to look worryingly like $ := 10 and, well, 'nuff said. <-

If an assignment operator cannot write back to the argument to the left of it AND it is visible not a numeric
constant AND the token name before the operator is a valid name, a new label is created with the result of
the assignment. Otherwise, an error is displayed.
 
    > x = y
    E Invalid constant/label name 'y'
    > y = 5
    = 5
    > x = y += y
    = 10
    > x
    = 10
 
Note that an expression made up purely of a single token name, like this:
 
some_token
 
would be viewed as this:
 
some_token = $
 
...and thus given the latest version of the instruction counter.

-> Some sort of readonly label "locking" would seem useful to stop fixed labels from being modified <- 


==============================================================================================================

COMMANDS

A line of source code is made up of commands. There are a number of different types of command. For example:

.align 256 \ Function: ld a, 10 /* Load 10 into the accumulator */
1111111111   222222222 33333333 4444444444444444444444444444444444

1: Directive command.
Directives are made up of a constant token starting with either a '.' or '#' character. There is NO difference
in behaviour; all directives can be invoked using '.' or '#' variants. Picking one or the other would break
existing code. Can be used for clarity in code, if you so wish:

#if condition1
.if condition2
.else
.endif
#else
.if condition2
.else
.endif
#endif

Brass 1 used . for all directives, but offered # aliases for a few TASM ones for backwards compatibility. I
think the best action is just to allow both. Comments would be useful.

2: Expression command.
This is just an expression. As seen above, if the expression only consists of a label name then it will be
assigned with the current value of the instruction counter.
If no assignments were made by the expression in this context, an error is displayed (to stop duplicate
label names).


3: Assembler command.
The exact internal syntax depends on the currently loaded assembler plugin.

4: Comment command.
This just contains a comment.


To identify and group commands from expressions, the process is this:

1. Set all comment expressions as comment commands. These are easy.
2. Group all remaining expression groups by the {\} LineBreak punctuation character.
3. Detect whether it's a directive or an assembler command. For directives, is the first token a constant and 
   start with a '.'? If so, look up the directive from the current set of loaded plugins and see if there is a
   match. For assembler commands, pass them to the assembler plugin and see if it can make head or tail of them.
4. If we still don't know what it is, remove the first expression from the sequence - chances are it's a label.
   Try and evaluate that, and also try and work out whether what follows it is a directive or assembler command
   using the methods outlined in 3.

... more to come ...