Tokeniser

Introduction

BBC BASIC programs are internally stored in a tokenised format. Certain components of a line of program code, such as keywords, are replaced with a single-byte token. For example, the keyword ENDPROC is represented by the single byte value &E1. This saves space (ENDPROC only takes up 1 byte instead of 7) and allows the interpreter to recognise keywords more efficiently.

Tokeniser Process

The tokeniser reads the input string from left to right and substitutes recognised keywords with their tokens (as listed in the Keyword Tokens section below).

Keyword Flags

The tokeniser maintains a state which controls the way that it tokenises statements. Each keyword has a number of flags which can be used to control the state of the tokeniser.

Other Symbols

The following symbols also affect tokenising:

Line Numbers

Certain keywords are followed by a line number. These line numbers are tokenised to speed up decoding. However, due to some constraints, such as each byte in the tokenised line number needing to below &80 (to prevent it looking like another token) and above &20 (to prevent it from being &0D, the program line terminator) converting a line number into its tokenised form is a relatively complex procedure.

Each tokenised line number is four bytes in size, and is made up like this:

Byte 1 is more complicated and is formed as shown:

Byte 1 Bit Value
0 0
1 0
2 bit 6 of MSB (inverted)
3 bit 7 of MSB
4 bit 6 of LSB (inverted)
5 bit 8 of MSB
6 1
7 0

In a C-like programming language the conversion may be carried out using the following code:

ushort line = 1234; // Line number to convert.
byte byte0 = 0x8D;
byte byte1 = (((line & 0x00C0) >> 2) | ((line & 0xC000) >> 12)) ^ 0x54;
byte byte2 = ((line >> 0) & 0x3F) | 0x40;
byte byte3 = ((line >> 8) & 0x3F) | 0x40;

The process can be reversed to retrieve the original line number.

Keyword Tokens

Token Keyword Flags
80 AND --------
81 DIV --------
82 EOR --------
83 MOD --------
84 OR --------
85 ERROR -----S--
86 LINE --------
87 OFF --------
88 STEP --------
89 SPC --------
8A TAB( --------
8B ELSE ---L-S--
8C THEN ---L-S--
8D line no. --------
8E OPENIN --------
8F PTR -P----MC
90 PAGE -P----MC
91 TIME -P----MC
92 LOMEM -P----MC
93 HIMEM -P----MC
94 ABS --------
95 ACS --------
96 ADVAL --------
97 ASC --------
98 ASN --------
99 ATN --------
9A BGET -------C
9B COS --------
9C COUNT -------C
9D DEG --------
9E ERL -------C
9F ERR -------C
Token Keyword Flags
A0 EVAL --------
A1 EXP --------
A2 EXT -------C
A3 FALSE -------C
A4 FN ----F---
A5 GET --------
A6 INKEY --------
A7 INSTR --------
A8 INT --------
A9 LEN --------
AA LN --------
AB LOG --------
AC NOT --------
AD OPENUP --------
AE OPENOUT --------
AF PI -------C
B0 POINT( --------
B1 POS -------C
B2 RAD --------
B3 RND -------C
B4 SGN --------
B5 SIN --------
B6 SQR --------
B7 TAN --------
B8 TO --------
B9 TRUE -------C
BA USR --------
BB VAL --------
BC VPOS -------C
BD CHR$ --------
BE GET$ --------
BF INKEY$ --------
Token Keyword Flags
C0 LEFT$( --------
C1 MID$( --------
C2 RIGHT$( --------
C3 STR$ --------
C4 STRING$( --------
C5 EOF -------C
C6 AUTO ---L----
C7 DELETE ---L----
C8 LOAD ------M-
C9 LIST ---L----
CA NEW -------C
CB OLD -------C
CC RENUMBER ---L----
CD SAVE ------M-
CE PUT --------
CF PTR --------
D0 PAGE --------
D1 TIME --------
D2 LOMEM --------
D3 HIMEM --------
D4 SOUND ------M-
D5 BPUT ------MC
D6 CALL ------M-
D7 CHAIN ------M-
D8 CLEAR -------C
D9 CLOSE ------MC
DA CLG -------C
DB CLS -------C
DC DATA --R-----
DD DEF --------
DE DIM ------M-
DF DRAW ------M-
Token Keyword Flags
E0 END -------C
E1 ENDPROC -------C
E2 ENVELOPE ------M-
E3 FOR ------M-
E4 GOSUB ---L--M-
E5 GOTO ---L--M-
E6 GCOL ------M-
E7 IF ------M-
E8 INPUT ------M-
E9 LET -----S--
EA LOCAL ------M-
EB MODE ------M-
EC MOVE ------M-
ED NEXT ------M-
EE ON ------M-
EF VDU ------M-
F0 PLOT ------M-
F1 PRINT ------M-
F2 PROC ----F-M-
F3 READ ------M-
F4 REM --R-----
F5 REPEAT --------
F6 REPORT -------C
F7 RESTORE ---L--M-
F8 RETURN -------C
F9 RUN -------C
FA STOP -------C
FB COLOUR ------M-
FC TRACE ---L--M-
FD UNTIL ------M-
FE WIDTH ------M-
FF OSCLI ------M-