Replacing the parser code on the teletype

I’m planning on replacing the code that’s used on the teletype to parse user input.

The current version

The current function parse uses strtok, strcmp and strtol. An input is split into tokens using strtok, then if it starts with a digit (or -) it is parsed as a number, otherwise it’s strcmp'd one by one with each entry in to ops table until a match is found.

There are a few downsides to this:

  • strtol is very permissive with inputs (i.e. it will parse 123ADD as 123), on the plus side you can type hex and octal numbers in, e.g. ADD 0x100 0x10 will work on your teletype!

  • Using strcmp to identify ops is probably a bit inefficient. I’m not sure how much difference it really makes though, as parsing isn’t done very often.

  • strtok has hidden state! It’s definitely one of those “what were they thinking” kind of functions. Due to it’s simplicity it really limits what we can do with the language, the most obvious case being the : separator, which currently requires spaces around it.

The new version

I’ve spent the last few days playing with Ragel, I’ll be honest, my head hurts quite a bit now. Ragel can construct quite complex state machines in C (or other languages), but one of it’s uses is as a lexer (or scanner). Anyway, the plan is to use it in 2 ways, in fact pretty similarly to how the current code works.

  • The first will tokenise the input, much as strtok did, but with the ability to recognise different delimiters. This function will be very liberally in what it allows as a token. That token will be handed over to…

  • The second function will identify the token as either a number or an op, replacing the current use of strcmp. This function will be extremely conservative in what it allows to match, e.g. the entire token must be consumed to form a match.

Challenges

The biggest downside will be the additional complexity of building and contributing to the teletype code. Everyone will need to install Ragel on their computers (FYI, brew install ragel, pacman -Sy ragel).

Another complexity is informing Ragel what ops are available and how it returns information on what op it has found. Currently an op is referenced by it’s index in the ops table (stored as a 16bit int), the simplest solution would be for Ragel to return a pointer directly to the op struct, but that would require storing a 32bit pointer, doubling the size that a scene/script takes up in RAM/ROM. My current plan is to write a small C program that reads the op table a generates the Ragel file mapping op name to op index, e.g.

%%{
    "ADD" => { return 0; };
    "SUB" => { return 1; };
}%%

Another option is to keep using strcmp for the 2nd function.

Anyway, school run beckons…

9 Likes

ambitious and wonderful!

to address the challenges:

i don’t think this should present any considerable barriers to entry for those interesting in contributing code.

i haven’t done any calculations, but off the top of my head i don’t think doubling the script size would be an issue as we’re still talking about pretty small byte counts. of course another solution could be moving over to some sort of dynamic allocation of flash and ram, which is a much bigger undertaking.

1 Like

i started building a (simple) bees patching language in lax/yacc, not specifically intending to run it on the hardware (though it could do so, taking maybe too much code space.)

i’m curious if you have strong opinions about ragel vs lex/yacc. i’ll start looking at the ragel manual… would be nice to have something easier to use for that purpose.

[edit] on the one hand, ragel seems much better at producing minimal code. on the other, it’s not as mature. is it in widespread use?

2 Likes

Good point, I’ll do an actual calculation tomorrow, it’s not like it’s that hard to do.

AFAIK you need to jump through more hoops with (f)lex to get it to work on embedded systems, it needs a bunch of stuff from <stdin.h>, fread, stdin, etc. From my cursory examination of lex, you get more nuts and bolts, but it’s way more verbose.

Apparently Lemon is a common parser to use with it.

Heh, no idea. I’m so far outside of my coding comfort zone.

it’s true, but they aren’t that bad. you can write your i/o functions. this article is interesting. as far as i can tell the output only really depends on malloc and realloc. but for sure, the whole system is verbose as hell.

i am a total noob to this stuff, btw. been curious for years about lexers/parsers, finally motivated enough to try and educate myself. will look at lemon.

Here is the tokeniser / scanner code written in Ragel:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const size_t kMaxTokenLength = 256;

%%{
    machine scanner;

    seperator = ' ';

    mod_seperator = ' : ';

    token = any+ -- (seperator | mod_seperator);

    action print_token {
        char buf[kMaxTokenLength];
        size_t len = te-ts;
        if (len > kMaxTokenLength) len = kMaxTokenLength;
        memcpy(buf, ts, len);
        buf[len] = '\0';
        printf("token: '%s'\n", buf);
    }

    main := |*
        seperator;
        mod_seperator => { printf("mod sep\n"); };
        token => print_token;
    *|;
}%%

%% write data;

void scanner(const char* data, size_t len) {
    int cs;
    int act;
    const char* ts;
    const char* te;

    const char* p = data;
    const char* pe = data + len;
    const char* eof = pe;

    %% write init;

    %% write exec;

    if (cs == scanner_error) {
        printf("ERROR");
    }
}

int die() {
    printf("Need exactly 1 argument\n");
    return -1;
}

int main(int argc, const char **argv) {
    if (argc != 2) return die();

    const char* input = argv[1];
    scanner(input, strlen(input));
    return 0;
}

You can compile and run it with:

ragel -C scanner.rl
cc scanner.c -o scanner
./scanner "this is my: sample : input"

It definitely produces very small code, scanner.c using the default -T0 is 252 lines long, with -G2 it’s only 149 lines.

Useful links that I’ve found along the way…

(I’ve also made my own version of the Ragel manual without using Computer Modern as the font - I detest Computer Modern - shout out if anyone wants a copy)

4 Likes

Right. I’ve successfully managed to replace strtok with a custom Ragel scanner. The code is a bit yucky still, I’ll tidy it up before I push it to my parser branch. Basically I extracted out a match_token function from parse, then replaced parse with Ragel code, while keeping match_token.

One quick question, should we commit the Ragel generated .c files? Or put it in the .gitignore file? My preferences is to not commit it, Ragel appears to be available in all Unix package managers, as well as Homebrew and MSYS2/MinGW on Windows. I expect it’s also available under the Windows Subsystem for Linux.

Secondly, what to do about the : separator? Currently we put a space on either side of it (due to how strtok works). Options are:

  1. Keep it as it is, space on either side
  2. Only have a space after the :, gain an extra character per line
  3. No spaces surrounding the : but maybe render it dimmer on the LCD, gain 2 extra characters!

If we do change it, the old format will still be accepted for input, but any extra spaces will be removed. We’ll need to update print_command to output the command correctly.

I favour either option 2 or 3. One weirdness, currently fonts are rendered propertionataly on the teletype! : is 1 pixel wide, ; is 2.

agreed that ragel .c files probably don’t need commit.

i’m up for removing the spaces around the separator. of course people can always add space if they prefer, for readability.

one argument in favour of commiting the ragel generated .c files is that if somebody is not making changes in that area they wouldn’t have to install/run ragel locally. but if it’s a simple enough to set up the toolchain and it could be incorporated into make then yeah, it wouldn’t be necessary.

i wonder if ragel would also be useful for the code to deserialize presets stored as text on usb.

2 Likes

Whitespace isn’t saved… (excess whitespace is ignored on input).

typedef enum { NUMBER, MOD, SEP, OP } tele_word_t;

The amount of whitespace rendered is controlled by print_command. So if we remove all whitespace then that’s what everyone will see for saved commands.

my vote would be for not removing whitespace, it’s nice to have the option to format the code to make it more readable (but also being able to gain 2 extra spaces when needed…).

I have managed to integrate compiling the .rl file (and cleaning the generated code) into the Makefile. I suspect make will have a major grumble if ragel isn’t installed and on the path, it would come down to how a git clone manages file timestamps… On OSX/Linux it’s really easy to get Ragel, it’s Windows where it’s more complex (I assume…)

Probably… it’s a good idea! I know people have written JSON parsers with Ragel.

The current codebase ignores whitespace, e.g. if you type the following into a script

ADD<space><space>1<space>1

when you re-edit it will only have a single space:

ADD<space>1<space>1

The code is saved internally as data only, when it needs to be redisplayed the print_command (in command.c) is run to convert it back to a string. Remembering whitespace would come at the cost of a approximate doubling of the size a script takes up (probably not a bit deal). But print_command would get a lot more complicated as it would become possible to construct internal code representations that can’t be converted to valid string representations (can supply an example if you’re interested).

1 Like

you’d just have to store the string representation for each script as well, no? but not sure it’s worth it for the extra space needed.

1 Like

:blush:

I’ve got to do some calculations for the next bit relating to how large the stored representation of the code is. I’ll add saving the string representation into the mix.

1 Like

tbh not sure if it adds that much, the scripts are only 6 lines anyway… perhaps just add a pixel or 2 on each side of :? curious you say the font is proportional, never specifically noticed! so i guess the max line length is based on the widest character? in this case there should be some room to play…

Max line length is no. of characters not how many pixels used. I think the single pixel width for : is a bug really.

I can’t type any more : on that line.

The data adds up though, 32 scenes * 10 scripts * 6 lines * 31 chars = 59520 bytes

(I’m not 100% sure it’s 31 characters, ± 1)

edit: maybe it’s not a bug, the . is also 1px wide, anyway, as you say we could manually adjust the : glyph to be wider, or use an alternative glyph for rendering

1 Like

also note how N takes 4 pixels (‘M’ is probably five? don’t have a tt nearby right now…), so it’s definitely proportional. that’s why i think the max length right now is coupled to how wide the widest character is to adhere to the constraint of not having horizontal scrolling. which is a nice constraint to have from the perspective of having a complete script fit on one screen (and forcing you to think creatively…) but maybe scrolling wouldn’t be so bad.

The up arrow is a ^ (also it’s a single quote followed by a double quote just before the ?).

I think I’ll get my Ragel code working with the existing <space>:<space> rules, then try to mull over the other options.

i’m happy to consider modifying the font, by the way. i’d defer to any type designers, if they are here. i do have some modification suggestions in my e-mail inbox, by they relate to legibility of a few letter characters.

re: storing scripts as text, seems like overkill. perhaps better to correct the typeface.

Not sure if this is the right place to ask, but what’s the rationale for the 6 line limit? Is this a memory footprint issue?