My gut reaction was that it is in the USB code, but it’s not based on any hard evidence. Just some odd interactions between pressing keys and the “event queue full” message displayed over serial.

Yeah we should. I just haven’t got the time at the moment.

I’ll cut a new Teletype 2.0 beta tomorrow or Wednesday with the change in it, in the hopes that more people test the code. (So that I can be sure it really does fix the issues we’ve seen.)

2 Likes

It sounds very reasonable to me - especially if it keeps the unit from getting totally borked and requiring a reboot. It is clearly a “test it and see” kind of change at this point given your early validation.

I’ll make it a point to ensure that those using expanders are totally clear on the fact that three or more devices requires a powered bus board in order to ensure stability as well. Not related - but can be confused with other problems due to the similarity of symptoms.

You’ve done some damn fine work on this stuff, @sam.

trying to wrap my head around it a bit. i searched for all ISR registrations in TT/asf code and only see the following:

SYS_IRQ_PRIORITY 1 - not used anywhere
APP_TC_IRQ_PRIORITY 2 - timer interrupts
UI_IRQ_PRIORITY 3 - trigger and front panel button interrupts
UDD_USB_INT_LEVEL 0 - USB
UHD_USB_INT_LEVEL 0 - USB
CONF_TWI_IRQ_LEVEL 3 - i2c

hitting trigger inputs at high rate would mean that trigger interrupts are disabled most of the time, which would result in screen not getting refreshed / keyboard less responsive since those are driven by timers (and even then they would probably fail to add an event since the queue is likely full from trigger events). but i don’t understand how changing to SYS_IRQ_PRIORITY results in less crashing… perhaps it’s simply because timer interrupts are not disabled anymore?

how does masking work exactly? does it mask the specified group only or everything above / or below as well? also, what happens if a function disables an interrupt level but while it’s executed another ISR is called which also disables/enables the same level? does that mean that the 2nd ISR would re-enable it while the first ISR still expects it to be disabled?

edit: found some info, if i’m reading correctly 3 is the highest priority and masking does disable the specified level and anything below. so it looks like the change simply stops timer interrupts being disabled when events are added/retrieved… which is strange as i would expect it to be less stable if an event was added while it was already in event_post, and not disabling timers would make this more likely to happen.

3 Likes

Awesome @scanner_darkly

AVR32101: Configuring the AVR32 Interrupt Controller

According to that PDF, interrupts from highest priority to lowest are (page 3):

  • NMI
  • 3
  • 2
  • 1
  • 0

I’ll admit, going by the names in conf_tc_irq.h (below), I’d assumed that SYS_IRQ_PRIORITY had higher priority than UI_IRQ_PRIORITY. My mistake?

#define SYS_IRQ_PRIORITY       1
#define APP_TC_IRQ_PRIORITY    2
#define UI_IRQ_PRIORITY        3

That seems too high then?

Also, reading that document, I’m not entirely sure that masking a given interrupt will disable lower priority ones. Only that it disables the given interrupt.

Also, what exactly is the deal with interrupt groups?

I’m going to do some more tests today.

tl;dr: see the next post

Right while you’re all asleep, I’m going to start systematically working through this…

Test will be an output from Dr. Octature (i.e. self oscillating filter sine wave) into trigger 1.

The contents of script 1 will vary.

I’ll test for a crash by removing the trigger, replugging the keyboard, and seeing if the front panel button works.


09:51am

Clean checkout of b1817b6.

With an empty script 1, crash after a few minutes.

With TR.P 1 in script 1, either a crash or stuck trigger out.


10:11am

Clean checkout of b1817b6. Plus change to init_teletype.c as follows (make the trigger interrupts use APP_TC_IRQ_PRIORITY):

   gpio_enable_pin_interrupt( A07,      GPIO_RISING_EDGE);

   // PA00-A07
-  INTC_register_interrupt( &irq_port0_line0, AVR32_GPIO_IRQ_0, UI_IRQ_PRIORITY);
+  INTC_register_interrupt( &irq_port0_line0, AVR32_GPIO_IRQ_0, APP_TC_IRQ_PRIORITY);

   // PA08 - PA15
-  INTC_register_interrupt( &irq_port0_line1, AVR32_GPIO_IRQ_0 + (AVR32_PIN_PA08 / 8), UI_IRQ_PRIORITY);
+  INTC_register_interrupt( &irq_port0_line1, AVR32_GPIO_IRQ_0 + (AVR32_PIN_PA08 / 8), APP_TC_IRQ_PRIORITY);

   // PB08 - PB15
   // INTC_register_interrupt( &irq_port1_line1, AVR32_GPIO_IRQ_0 + (AVR32_PIN_PB08 / 8), UI_IRQ_PRIORITY);

With TR.P 1 in script 1, no crashes until you do something with the keyboard. And even then typing text works fine, but up or down arrow, or tab crashes immediately. Is it the keyboard or is it the screen drawing? Is it possible that we’re deadlocking on APP_TC_IRQ_PRIORITY somehow?


10:17am

Retesting clean checkout of b1817b6.

Can use the keyboard without causing an immediate crash.


10:27am

Clean b18117b6 with the following change (re-enable GPIO interrupt after event is posted).

diff --git a/src/init_teletype.c b/src/init_teletype.c
index 1dca206..2f762c3 100644
--- a/src/init_teletype.c
+++ b/src/init_teletype.c
@@ -77,10 +77,10 @@ __attribute__((__interrupt__))
 static void irq_port0_line0(void) {
   for(int i=0;i<8;i++) {
     if(gpio_get_pin_interrupt_flag(i)) {
-      gpio_clear_pin_interrupt_flag(i);
       // print_dbg("\r\n # A00");
       event_t e = { .type = kEventTrigger, .data = i };
       event_post(&e);
+      gpio_clear_pin_interrupt_flag(i);
     }
   }
 }
@@ -89,10 +89,10 @@ static void irq_port0_line0(void) {
 __attribute__((__interrupt__))
 static void irq_port0_line1(void) {
     if(gpio_get_pin_interrupt_flag(NMI)) {
-      gpio_clear_pin_interrupt_flag(NMI);
       // print_dbg("\r\n ### NMI ### ");
       event_t e = { .type = kEventFront, .data = gpio_get_pin_value(NMI) };
       event_post(&e);
+      gpio_clear_pin_interrupt_flag(NMI);
     }
 }

Wow, makes a big difference to responsiveness. Some keyboard input get’s through. Sometimes the screen redraws don’t always complete, is that code re-entrant safe?

With TR.P 1 in script 1, trigger output still gets stuck. Haven’t seen any crashes yet though. Regardless I think this change is worth keeping.


10:37am

As 10.27am, but also commenting out all the IRQ masking in events.c

diff --git a/src/events.c b/src/events.c
index 34bff5a..1b2c241 100644
--- a/src/events.c
+++ b/src/events.c
@@ -52,7 +52,7 @@
 // Returns non-zero if an event was available
 u8 event_next( event_t *e ) {
   u8 status;
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
+  //cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
   
   // if pointers are equal, the queue is empty... don't allow idx's to wrap!
   if ( getIdx != putIdx ) {
@@ -66,7 +66,7 @@ u8 event_next( event_t *e ) {
     status = false;
   }
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  //cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
   return status;
 }
 
@@ -79,7 +79,7 @@ u8 event_post( event_t *e ) {
    // print_dbg("\r\n posting event, type: ");
    // print_dbg_ulong(e->type);
 
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
+  //cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
   
   // increment write idx, posbily wrapping
   saveIndex = putIdx;
@@ -93,7 +93,7 @@ u8 event_post( event_t *e ) {
     putIdx = saveIndex;
   } 
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  //cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
 
   if (!status)
     print_dbg("\r\n event queue full!");

No crashes. Feels dirty. System responsiveness is impressive though.


10:44am

Combination of 10.27am and 10.11am, but not 10.37am.

e.g. trigger inputs at APP_TC_IRQ_PRIORITY and re-enabling GPIO interrupt after event is posted.

Instant lockup as soon as trigger occurs.


11:10am

As 10.27am, but with aggressive interrupt masking:

diff --git a/src/events.c b/src/events.c
index 34bff5a..3901f12 100644
--- a/src/events.c
+++ b/src/events.c
@@ -52,8 +52,9 @@
 // Returns non-zero if an event was available
 u8 event_next( event_t *e ) {
   u8 status;
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
-  
+
+  irqflags_t flags = cpu_irq_save();
+
   // if pointers are equal, the queue is empty... don't allow idx's to wrap!
   if ( getIdx != putIdx ) {
     INCR_EVENT_INDEX( getIdx );
@@ -66,7 +67,8 @@ u8 event_next( event_t *e ) {
     status = false;
   }
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  cpu_irq_restore(flags);
+
   return status;
 }
 
@@ -79,8 +81,8 @@ u8 event_post( event_t *e ) {
    // print_dbg("\r\n posting event, type: ");
    // print_dbg_ulong(e->type);
 
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
-  
+  irqflags_t flags = cpu_irq_save();
+
   // increment write idx, posbily wrapping
   saveIndex = putIdx;
   INCR_EVENT_INDEX( putIdx );
@@ -93,7 +95,7 @@ u8 event_post( event_t *e ) {
     putIdx = saveIndex;
   } 
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  cpu_irq_restore(flags);
 
   if (!status)
     print_dbg("\r\n event queue full!");

Bomb proof again.

cpu_irq_save implicitly disables all interrupts.

FYI @zebra cpu_irq_save is used in a few places in the ASF, e.g. in sysclk, osc, twi, and usbb_host as well as pdca.

11:15am

Have tested the 11:10am code with also changing the GPIO interrupts to APP_TC_IRQ_PRIORITY (see 10.11am).

Everything is still rock solid (c.f. 10.44am)

Conclusions coming up…

11:40am

Once more for luck, so 11:10am code, plus less aggressive IRQ masking.

But saving and restoring mask status

diff --git a/src/events.c b/src/events.c
index 34bff5a..b7777cb 100644
--- a/src/events.c
+++ b/src/events.c
@@ -52,8 +52,13 @@
 // Returns non-zero if an event was available
 u8 event_next( event_t *e ) {
   u8 status;
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
-  
+
+  bool app_tc = cpu_irq_level_is_enabled(APP_TC_IRQ_PRIORITY);
+  bool ui = cpu_irq_level_is_enabled(UI_IRQ_PRIORITY);
+
+  if (app_tc) cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
+  if (ui) cpu_irq_disable_level(UI_IRQ_PRIORITY);
+
   // if pointers are equal, the queue is empty... don't allow idx's to wrap!
   if ( getIdx != putIdx ) {
     INCR_EVENT_INDEX( getIdx );
@@ -66,7 +71,9 @@ u8 event_next( event_t *e ) {
     status = false;
   }
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  if (app_tc) cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  if (ui) cpu_irq_enable_level(UI_IRQ_PRIORITY);
+
   return status;
 }
 
@@ -79,8 +86,12 @@ u8 event_post( event_t *e ) {
    // print_dbg("\r\n posting event, type: ");
    // print_dbg_ulong(e->type);
 
-  cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
-  
+  bool app_tc = cpu_irq_level_is_enabled(APP_TC_IRQ_PRIORITY);
+  bool ui = cpu_irq_level_is_enabled(UI_IRQ_PRIORITY);
+
+  if (app_tc) cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
+  if (ui) cpu_irq_disable_level(UI_IRQ_PRIORITY);
+
   // increment write idx, posbily wrapping
   saveIndex = putIdx;
   INCR_EVENT_INDEX( putIdx );
@@ -93,7 +104,8 @@ u8 event_post( event_t *e ) {
     putIdx = saveIndex;
   } 
 
-  cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  if (app_tc) cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
+  if (ui) cpu_irq_enable_level(UI_IRQ_PRIORITY);
 
   if (!status)
     print_dbg("\r\n event queue full!");

Rock solid

3 Likes

Okay here is what I’ve figured out.

When you mask an interrupt level, it only masks that level, not lower priority ones too.

(unless someone can point to a document where it contradicts that, proving negative statements is hard…)

Conclusion, either mask all applicable levels, or all interrupts.

Interrupt masks are stored in the ‘status register’ (SR) and are not shadowed/saved in an interrupt

So if you use cpu_irq_disable_level then you should save the existing value to restore it when you cpu_irq_enable_level.


Proposal 1

Make the following change in init_teletype.c:

diff --git a/src/init_teletype.c b/src/init_teletype.c
index 1dca206..2f762c3 100644
--- a/src/init_teletype.c
+++ b/src/init_teletype.c
@@ -77,10 +77,10 @@ __attribute__((__interrupt__))
 static void irq_port0_line0(void) {
   for(int i=0;i<8;i++) {
     if(gpio_get_pin_interrupt_flag(i)) {
-      gpio_clear_pin_interrupt_flag(i);
       // print_dbg("\r\n # A00");
       event_t e = { .type = kEventTrigger, .data = i };
       event_post(&e);
+      gpio_clear_pin_interrupt_flag(i);
     }
   }
 }
@@ -89,10 +89,10 @@ static void irq_port0_line0(void) {
 __attribute__((__interrupt__))
 static void irq_port0_line1(void) {
     if(gpio_get_pin_interrupt_flag(NMI)) {
-      gpio_clear_pin_interrupt_flag(NMI);
       // print_dbg("\r\n ### NMI ### ");
       event_t e = { .type = kEventFront, .data = gpio_get_pin_value(NMI) };
       event_post(&e);
+      gpio_clear_pin_interrupt_flag(NMI);
     }
 }

This isn’t specific to the issue we’re seeing but I think it makes sense, I’d also make it on all the other devices. In effect it also acts as a slight rate limiter on triggers too.

Proposal 2

Mask and restore IRQ levels in events.c. Either by masking/restoring both APP_TC_IRQ_PRIORITY & UI_IRQ_PRIORITY, or by masking/restoring all interrupts.

I’m in favour of masking all interrupts. It’s done it plenty of other places in the ASF (git grep cpu_irq_save), including sysclk, osc, usbb, twi, pdca. I can conceive of scenarios where only masking the 2 levels will still result in issues unless we have constant vigilance against them. So unless someone wants to step up for that responsibility I’d rather take the more extreme approach and make my life easier.

@zebra if you think this is going to cause Aleph big problems, let me know and we can see what we can work out.

I’m going to cut a new release of the Teletype 2.0 beta with those changes in it, just to get it in peoples’ hands for testing. I’ve only got till Saturday before I need to move my Teletype back into it’s case, and then testing becomes really hard.

2 Likes

The following commit screws up Ansible:

Basically a TR.PULSE 5, locks the trigger high, and it never comes back down. Though a TR 5 0 will bring it back to 0.

So close… anyone know the Ansible code well enough to tell me why?

edit:

Right, there are a bunch of

cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);

calls through out the Ansible source code, and also in libavr32.

If my above theories are correct, then that will not be compatible with my flag saving way of doing to same thing.

I’ve commented out all of those calls, and that’s made the trigger pulses work again. That’s obviously very crude. I will try writing a more correct replacement later on today. But I’m would really appreciate some help with testing the Ansible stuff, otherwise I’m fighting on too many fronts.

no, it should be ok.

the whole reason we have separate levels is just to be able to explicitly mask some but not others at certain times, like when we want to stop UI and timer processing while saving to sdcard. the event q crticial section isn’t actually one of those times (though it is important to do something to make sure BEES patches can’t accidentally halt sdcard transfers in the middle of an operation… not your problem)

maybe a clarification about priority levels: as you say, they aren’t nested and masking one doesn’t affect the others. the difference between level 0 and level 3 is that the latter has more shadowed registers, so context changes are faster. thus, higher levels are for interrupts requiring lower latency. SYS_IRQ_LEVEL doesn’t need lowest latency. it’s arguable that UI and APP should be switched.

Basically a TR.PULSE 5, locks the trigger high, and it never comes back down. Though a TR 5 0 will bring it back to 0.

i am not familiar with the ansible code, will have a look though…

1 Like

I’ve just edited my post. I think I know the cause. Would still love any help / guidance you can give.

so, while i don’t like that this isn’t wrapped in a function, there is a performance consideration: we don’t actually want to halt UI interrupts any time any code anywhere manipulates the timer queue (say.) it would mess with encoder responsiveness pretty severely.

i only have aleph and TT. i guess maybe aleph requirements are too divergent from the modules that don’t have high-density encoders.

1 Like

I’m not planning on blocking all interrupts. It’s more that calling cpu_irq_disable_level(APP_TC_IRQ_PRIORITY) doesn’t save the current state of the interrupt. I think that’s the issue that’s been causing problems on the Teletype.

I’ve just done the school run, and that’s given me a bit of time to think.

I’ve previously added the following functions:

void timers_resume( void ) {
    cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
}

void timers_pause( void ) {
    cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
}

What I’m going to do is replace those with:

bool timers_pause() {
    bool state = cpu_irq_level_is_enabled(APP_TC_IRQ_PRIORITY);
    cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
    return state;
}

void timers_restore(bool state) {
    if (state)
        cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
    else
        cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
}

then I can replace all the

cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);

with:

tc_state_t state = timers_pause();

....

timers_restore(state);
2 Likes

will comment in a bit, for now a quick question - i’m trying to clone your TT fork and when i switch to 2.0 branch and do git submodule update i get the following error:

Fetched in submodule path 'libavr32', but it did not contain c621c1518f4d48c135105d3105a98ec93b6e2084. Direct fetching of that commit failed.```

i tried doing a manual fetch/merge on libavr32 but doesn't seem to help (says already up-to-date). i can see the commit on github but looks like it's on its own tree - not sure how to pull that in?
cd libavr32
git remote add samdoshi https://github.com/samdoshi/libavr32.git   # or use ssh url if you'd rather
git fetch samdoshi
cd ..
git checkout 2.0
1 Like

perfect, that worked! i was confused by the fact that if i browse your tt fork on github (switched to 2.0 branch) it links to monome\libavr32, not samdoshi/libavr32.

Yeah, that’s because .gitmodules links to monome/libavr32 (which it should). Once the changes have been merged back the problem will go away.

ok, i see.

so, you think this is a case where some TT code calls timers_pause() (or equivalent) twice without a corresponding timers_resume()?

in aleph someone fixed this with a nesting counter:
(https://github.com/catfact/aleph/blob/master/avr32_lib/src/app.c#L21)

anyways, great! i hope that is actually the reason.

1 Like

Yep, same problem different solution.

In fact even though it has the global variable, that might be better way to do it, as my solution will still have issues with nested interrupts unless we’re careful.

edit: @ngwese you made the commit, care to comment?

I’m pretty confident it is, it’s definitely a bug even if that’s not all of it. You can easily see how a trigger input (at UI_IRQ_PRIORITY) can re-enable the APP_TC_IRQ_PRIORITY IRQ before it should be.

Just need to get Ansible fixed for the new order. Unfortunately I don’t really know the first thing about it’s source code…

1 Like

great looking progress so far.

let me look through the code and figure out when/why the IRQ enabling/disabling is happening.

From what I can tell so far.

I’ve added the following code to timers.c in libavr32:

static volatile s8 timer_pause_resume_nesting = 0;

void timers_pause() {
    print_dbg("timer pause\r\n");
    if(timer_pause_resume_nesting == 0) {
        cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
        print_dbg("timer paused\r\n");
    }
    timer_pause_resume_nesting++;
    print_dbg("timer depth:");
    print_dbg_ulong(timer_pause_resume_nesting);
    print_dbg("\r\n");

}

void timers_restore() {
    print_dbg("timer resume\r\n");
    timer_pause_resume_nesting--;
    print_dbg("timer depth:");
    print_dbg_ulong(timer_pause_resume_nesting);
    print_dbg("\r\n");
    if(timer_pause_resume_nesting == 0) {
        cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
        print_dbg("timer resumed\r\n");
    }
}

And replaced all calls to cpu_irq_disable_level(APP_TC_IRQ_PRIORITY) with timers_pause()

And also calls to cpu_irq_enable_level(APP_TC_IRQ_PRIORITY) with timers_restore().

Serial debug on boot as follows:

// ansible ////////////////////////////////
== FLASH struct size: 80808
i2c addr: 000000A0timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer pause
timer depth:2
timer pause
timer depth:3
timer pause
timer depth:4

> mode tt

So… too many calls to timers_pause. Now I just need to figure out where and why… (oh for gdb!)

Woohoo fixed.

There is an early return in timers_remove.

You can see it here:

Notice there is no call to cpu_irq_enable_level(...) before the return statement.

Will tidy up and commit in the morning…

5 Likes