Brass 2 source syntax - preliminary document Stuff between -> <- are asides. ============================================================================================================== Basic syntax: ============================================================================================================== GENERAL 1. Label, module, or directive names are *not* case sensitive. They will never be, there will never be a switch for this. Case sensitivity only serves to confuse matters, in my opinion. ============================================================================================================== TOKENS At the atomic level, a line of source is made up of tokens. A token is represented by a series of characters that the parser attaches some significance to. There are a variety of different sorts of token: 1. Constant token This is a text-based token that the assembler doesn't attach any specific importance to. This could be, for example, {LD}, {$7463} or {AF'} (braces for illustrative purposes). You can try and drag a numeric value from it. If it's clear that it's a numeric constant it'll return that value, if it's a label it'll look up the label. For the {=} assignment operator, if no label exists one will be created (see section on expressions). 2. String token Unlike the constant token, this is recognised as being a string constant. It is suitably escaped (same escape characters as TASM - currently no support for Unicode escape sequences but I will probably add them). They start and end with a matching " or ' symbol. Strings using the double quotes (") will be run through an extra step by which each character value can be remapped to another (user-defined). Examples: {'a'}, {'\n'}, {"AB\tCD"} 3. Operator token This is a mathematical operator. Once swallowed by the parser the string representation of the operator is ignored - the token is given a value from an enumeration. The currently supported operators are: {*}, {/}, {+}, {-}, {%}, {|}, {&}, {^}, {>}, {<}, {!}, {~}, {?}, {:}, {>>}, {<<}, {==}, {!=}, {>=}, {<=}, {**}, {&&}, {||}, {=}, {+=}, {-=}, {*=}, {/=}, {%=}, {&=}, {|=}, {^=}, {<<=}, {>>=}, {++}, {--} Most of them should be recognisable from their C counterparts. {?} and {:} are ternary conditional operators -> current note: ternary operators not yet supported <- {**} is a power operator (2**8 returns 256). See the section below on expressions for more information about the mathematics involved. 4. Punctuation token These represent an item of punctuation. Recognised items of punctuation are: {,} (Comma), {[} (OpenBracket), {]} (CloseBracket), {(} (OpenParenthesis), {)} (CloseParenthesis), {\} (LineBreak) 5. Comment token An entire comment is wrapped up into a single token. C-style (/* */) and assembly-style (;) will both be supported. For example, {/* Comment */} or {; Comment} All tokens know their index in the original string and their basic string representation. Operator tokens also know their "parenthesis" index - how many opened parentheses are between them and the start of the source line. For example; . . . . . . . . 3+(2+4/(4-3)+((5*4)/2)-1) 0 1 1 2 1 3 2 1 <- each operator token's parenthesis index. This is used by the expression parser. ============================================================================================================== EXPRESSIONS Some tokens can be grouped together to form entire expressions. An expression of more than one token will be a mathematical expression. An expression with one token might or might not be. A single-token expression could be any token, so they are not especially interesting. If you had the following sequence of tokens, though: ld a, (50 * -5) / $10 ; Load something into the accumulator ...you would end up with these expressions. 1. Constant {ld} 2. Constant {a} 3. Punctuation {Comma} 4. Punctuation {OpenParenthesis} Constant {50} Operator {Multiplication} Operator {UnarySubtraction} Constant {5} Punctuation {CloseParenthesis} Operator {Division} Constant {$10} 5. Comment {; Load something into the accumulator} 4 is the one of interest, being made up of multiple tokens. Expressions can be evaluated. The process for this is rather involved; 1. Cycle through the sequence of tokens. Create a list of all operator tokens, and strip out punctuation. 2. Sort the revised list of operators (OperatorToken : IComparable). The sort takes into consideration these factors, in order: - Are the parenthesis indices different? If so, compare them and return the result. - Are the operators different? If so, compare then and return the result. - Compare the token's index within the original string and return the result. For unary operators, the further right it is the higher precedence it is. For binary operators, the further left it is the higher precedence. -~1 <- the ~ has precedence over the - 1+2-3 <- the + has precedence over the - See table below for operator order of precedence. 3. Go through the sorted list of operators. If it's binary: {4} {+} {5} |___ ___| |______ Grab the outer tokens and perform the operation on them. {4} {9} {5} |______ Replace the operator with the result . {9} . |_______|__ Delete the two constants. If it's unary: {-} {1} |___ Grab the rightmost token and perform the operation on it. . {-1} |__ Replace the token with the result and delete the operator. 4. Return the only remaining token's value. Order of precedence: +-----------------+-----------------------------------+-------------------+ -> | Unary | + - ! ~ ++ -- | Higher precedence | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Power | ** | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Multiplicative | * / % | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Additive | + - | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Shift | << >> | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Relational | < > <= >= | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Equality | == != | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Logical | & | | | | ^ | | | | | | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Conditional | && | | | | || | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | -> | Assignment | = += -= *= /= %= &= |= ^= <<= >>= | Lower precedence | +-----------------+-----------------------------------+-------------------+ Note that both unary and assignment operators are regarded as being right-associative; that is, they are evaluated from right-to-left as opposed to from left-to-right when compared to operators of the same precedence and at the same parenthesis level. Note that ++ and -- are both unary operators. That is to say, ++x works, but x++ doesn't. For a constant token to be able to be evaluated it needs to contain either; - a constant numeric value, or; - a label's name. The parser will try to evaluate it as a constant token before trying to look it up as a label. Numeric constants can have a prefix xor a suffix to denote a base. You may not have both. +------+--------+--------+ | Base | Prefix | Suffix | +------+--------+--------+ | 2 | % | b | | 8 | @ | o | | 10 | | d | | 16 | $ | h | +------+--------+--------+ All suffixes can be in lower or upper case. As base 10 has no prefix, a number without prefix or suffix is assumed to be a decimal constant. The assignment operators will try and write back the result of the operation to the argument to the left of the operator. For example (running Brass 2 in interactive mode): > $ = 10 = 10 > $ *= 2 = 20 > $ *= 2 = 40 > $ *= 2 = 80 > $ += $ = 160 > $ = 1.1 = 1.1 > $ *= $ = 1.21 > $ *= $ = 1.4641 > $ *= $ = 2.14358881 > $ *= $ = 4.59497298635722 > 4 = 3 E Cannot assign to constants (Here $ represents the label that is used to store the current instruction pointer). Unlike Brass 1, where '=' was treated as a 'magic' directive (aliased to .equ), the new Brass 2 expression parser handles assignment operators natively. -> Shouldn't be a problem: Brass 1 didn't accept = as an operator at all as far as I can tell, so potential pitfalls with code such as ".if x=1" shouldn't crop up. <- As far as label names are concerned - if you put a colon on the END of the name, it means you are referring to the label's internal value. If you put a colon at the BEGINNING of the name, it is assumed that you are referring to the label's page number. If you omit the colon completely it is assumed that you are referring to the value. Hence: $ = 1 ~ sets value of current instruction pointer to 1 $: = 2 ~ sets value of current instruction pointer to 2 :$ = 3 ~ sets page of current instruction pointer to 3 So you can do things like this: > $: = 10 = 10 > :$ = 2 = 2 > :$ *= $: = 20 > :$ = 20 > $: = 10 -> At which point it starts to look worryingly like $ := 10 and, well, 'nuff said. <- If an assignment operator cannot write back to the argument to the left of it AND it is visible not a numeric constant AND the token name before the operator is a valid name, a new label is created with the result of the assignment. Otherwise, an error is displayed. > x = y E Invalid constant/label name 'y' > y = 5 = 5 > x = y += y = 10 > x = 10 Note that an expression made up purely of a single token name, like this: some_token would be viewed as this: some_token = $ ...and thus given the latest version of the instruction counter. -> Some sort of readonly label "locking" would seem useful to stop fixed labels from being modified <- ============================================================================================================== COMMANDS A line of source code is made up of commands. There are a number of different types of command. For example: .align 256 \ Function: ld a, 10 /* Load 10 into the accumulator */ 1111111111 222222222 33333333 4444444444444444444444444444444444 1: Directive command. Directives are made up of a constant token starting with either a '.' or '#' character. There is NO difference in behaviour; all directives can be invoked using '.' or '#' variants. Picking one or the other would break existing code. Can be used for clarity in code, if you so wish: #if condition1 .if condition2 .else .endif #else .if condition2 .else .endif #endif Brass 1 used . for all directives, but offered # aliases for a few TASM ones for backwards compatibility. I think the best action is just to allow both. Comments would be useful. 2: Expression command. This is just an expression. As seen above, if the expression only consists of a label name then it will be assigned with the current value of the instruction counter. If no assignments were made by the expression in this context, an error is displayed (to stop duplicate label names). 3: Assembler command. The exact internal syntax depends on the currently loaded assembler plugin. 4: Comment command. This just contains a comment. To identify and group commands from expressions, the process is this: 1. Set all comment expressions as comment commands. These are easy. 2. Group all remaining expression groups by the {\} LineBreak punctuation character. 3. Detect whether it's a directive or an assembler command. For directives, is the first token a constant and start with a '.'? If so, look up the directive from the current set of loaded plugins and see if there is a match. For assembler commands, pass them to the assembler plugin and see if it can make head or tail of them. 4. If we still don't know what it is, remove the first expression from the sequence - chances are it's a label. Try and evaluate that, and also try and work out whether what follows it is a directive or assembler command using the methods outlined in 3. ... more to come ...