Yep, same problem different solution.

In fact even though it has the global variable, that might be better way to do it, as my solution will still have issues with nested interrupts unless we’re careful.

edit: @ngwese you made the commit, care to comment?

I’m pretty confident it is, it’s definitely a bug even if that’s not all of it. You can easily see how a trigger input (at UI_IRQ_PRIORITY) can re-enable the APP_TC_IRQ_PRIORITY IRQ before it should be.

Just need to get Ansible fixed for the new order. Unfortunately I don’t really know the first thing about it’s source code…

1 Like

great looking progress so far.

let me look through the code and figure out when/why the IRQ enabling/disabling is happening.

From what I can tell so far.

I’ve added the following code to timers.c in libavr32:

static volatile s8 timer_pause_resume_nesting = 0;

void timers_pause() {
    print_dbg("timer pause\r\n");
    if(timer_pause_resume_nesting == 0) {
        cpu_irq_disable_level(APP_TC_IRQ_PRIORITY);
        print_dbg("timer paused\r\n");
    }
    timer_pause_resume_nesting++;
    print_dbg("timer depth:");
    print_dbg_ulong(timer_pause_resume_nesting);
    print_dbg("\r\n");

}

void timers_restore() {
    print_dbg("timer resume\r\n");
    timer_pause_resume_nesting--;
    print_dbg("timer depth:");
    print_dbg_ulong(timer_pause_resume_nesting);
    print_dbg("\r\n");
    if(timer_pause_resume_nesting == 0) {
        cpu_irq_enable_level(APP_TC_IRQ_PRIORITY);
        print_dbg("timer resumed\r\n");
    }
}

And replaced all calls to cpu_irq_disable_level(APP_TC_IRQ_PRIORITY) with timers_pause()

And also calls to cpu_irq_enable_level(APP_TC_IRQ_PRIORITY) with timers_restore().

Serial debug on boot as follows:

// ansible ////////////////////////////////
== FLASH struct size: 80808
i2c addr: 000000A0timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer resume
timer depth:0
timer resumed
timer pause
timer paused
timer depth:1
timer pause
timer depth:2
timer pause
timer depth:3
timer pause
timer depth:4

> mode tt

So… too many calls to timers_pause. Now I just need to figure out where and why… (oh for gdb!)

Woohoo fixed.

There is an early return in timers_remove.

You can see it here:

Notice there is no call to cpu_irq_enable_level(...) before the return statement.

Will tidy up and commit in the morning…

5 Likes

good find!

so ansible itself doesn’t explicitly enable/disable IRQ as you likely discovered already.

you’re saying that the ansible is back working now?

It is. More stable too I think.

I need to try the same libavr32 changes on Teletype too, but I’m feeling hopefully.

edit: finally managed to crash Ansible, but it took an awful lot of abuse.

2 Likes

amazing. you’ve really pulled together this huge step forward, with the expanders and everything!

will merge the PR’s after the next one arrives for this timer fix…

Watching this over the last few days has been like watching a suspenseful investigation drama (but then I work at a software company!)

So! Very! Excited!

Sam, how can we reward you (and anyone else) for your efforts? Come to Boston, we will find you and feed you.

3 Likes

Bug reports! :stuck_out_tongue: Need to get as many as possible by this weekend as my time is running out…

(thank you for the kind words)

“AVR32101: Configuring the AVR32 Interrupt Controller” is where i got the info about masking. under 2.2.2 Masking it says:

Whenever an interrupt occur, hardware automatically sets the mask bits corresponding to all sources with equal or lower priority.

and under 5.1 Nested interrupts:

As all interrupts with the same and lower priority is automatically masked whenever an interrupt is riggered, lower priority interrupts will not preempt the current running interrupt. An interrupt with a higher priority may nevertheless preempt the current running interrupt and thus nested interrupts is enabled by default.

which seems to imply that when you call cpu_irq_disable_level it will disable the specified level and anything below. but maybe it’s not the case…


going over your changes:

  • updating event_post to disable SYS_IRQ_PRIORITY / removing IRQ masking completely: this would mean not blocking the timer interrupts when an event gets added, which i think we agree would cause issues if, say, a timer would try to add to the event queue while a trigger interrupt was doing the same thing. not sure why it worked.
  • registering the trigger interrupts at the same level as the timer interrupts: this would mean with high rate triggers post_event would be called all the time blocking both the trigger and the timer interrupts, and as soon as they would be enabled another trigger would arrive, so makes sense this would mean everything is blocked most of the time.
  • moving gpio_clear_pin_interrupt_flag until after event_post means no 2 trigger interrupts can try adding to the queue at the same time, so makes sense this should have a positive effect on stability.

i think the bottom line is - event_post needs to be protected from corrupting the queue. this is the whole reason why interrupts are disabled there. but i wonder if this should be achieved by checking if the queue is full before modifying it and then restoring it as the existing code does. there might still be a possible race condition where 2 interrupts check the queue one after another, both confirm it’s not full and both add, so checking if it’s full should still be within a block that disables interrupts, but it’s a smaller block, don’t need to disable all for the whole event_post duration.

i like the counter idea - clever! and i think it should take care of all the scenarios with nested interrupts. and good catch on timer_remove!


i2c and USB should have the highest priority imo. i think it’s more dangerous to not let a transaction complete and leave devices hanging in some unknown state. the downside of this is that an i2c transmission has the power to block everything, and fixing that would be hard (as you’d need some timer based solution to reset i2c after a timeout, very tricky). but hopefully the hardware improvements should make i2c code reliable enough that this shouldn’t happen (btw, monome teletype repo still references livabr32 with pullups enabled).

i also agree with @zebra that UI should be higher priority - i’d rather drop more triggers (we’re talking about super high rate anyway!) than have a less responsive UI. perhaps what should be done is not making screen refresh / keyboard / grid / arc handlers timer based (so, not use events for them at all), but process them in the main event loop - this could still be done based on some frequency, by storing the timestamp and comparing against tcTicks.

1 Like

which seems to imply that when you call cpu_irq_disable_level it will disable the specified level and anything below. but maybe it’s not the case…

no, it’s not the case. this describes what happens in the interrupt controller when an IRQ is entered. the cpu_irq_disable_level macro just atomically sets the relevant mask bits directly.

that’s interesting about lower priorities being preempted by higher priorities. i could have sworn this was not the case by default. will have a harder look at the datasheet - IIRC there are some subtle differences between UC3A and UC3B here.

on a 2nd thought we don’t need to optimize event_post as we also need to protect actually modifying the queue, so might as well just disable the timer interrupts for its whole duration.

so, disabling the timer interrupts in event_post and moving gpio_clear_pin_interrupt_flag until after event_post is called in the trigger interrupt ISR should ensure that event_post is never called from a timer or trigger ISR while it’s executing already (interestingly, that’s exactly what init_ansible.c and init_trilogy.c already do). but it’s still possible for HID/USB to try and post an event while event_post is executing. so we either have to disable all IRQs (which would result in poor responsiveness) or switch all the UI stuff to not use the event queue (have a separate queue or just process in the main loop with needed frequency).

i’ll try to do some testing tonight as well - been working on clock mult/div code and need a break from trying to fix a stubborn bug (and secretly hoping it’s caused by the same issues we’re discussing here).

@zebra - ah, make sense! i couldn’t find a good description of what cpu_irq_disable_level did exactly.

As soon as I get home…as soon as I get home…

1 Like

this bit of comment cruft might be interesting:
(https://github.com/catfact/aleph/blob/master/avr32_lib/src/interrupts.c#L133-L140)

was a time when everything worked as you suggest: cpu_irq_disable() in event_post(), and clear the interrupt flag afterwards. somehow the fact that cpu_irq_disable() also has the effect of clearing pending flags, caused an infinite loop of interrupt requests. (or so i believed at the time.) i can’t follow the reason right now.


you’re right, the datasheet basically says higher-priority modes can preempt lower-priority. (i swear there is some reason i didn’t think this would happen.)

Just some quick comments, I’ll try to read the replies more fully tomorrow. It’s bed time here in the UK.

Yes as @zebra says, the lower priority masking inside event_post only applies if it is called from an IRQ.

Consider the following:

  1. I manually run event_post.
  2. It disables IRQs at priority 2.
  3. A priority 1 IRQ happens, and inside that IRQ event_post is called.

Also, as it stands with beta 9, the UI is really really responsive still. Even with me masking all interrupts in the event code.

You have to start sending i2c at 10ms intervals, or using high audio rate triggers to slow it down. And most importantly it doesn’t crash when you do.

(fyi, in 1ms you can send 12.8 bytes over i2c at 100kbs - are we using 100kbs??)

I think i2c bus recovery is a much bigger issue.

Plus some simple tweaks to the Teletype metro code to stop it completely taking over the event queue.

1 Like

we should be at 400kbs, which is the max. but i see that we’re at 132k (??)

i2c bus recovery, yes. i suspect the ansible/trilogy are the issue, crashing while holding the i2c lines hostage.

yeah, still confused about masking then… so, the lower priority masking only happens inside an ISR?

regardless, we might still have cases where a non masked interrupt tries to add an event while an ISR is in progress trying to do the same thing. so we either have to disable all interrupts, or if there are interrupts we’d rather not disable we have to make them stop using the event code.

also, looking at the latest timers_pause and timers_restore i think we might still have the following scenario:

  • process_timer is called, it’s the timer ISR, so it disables the timer interrupts
  • one of the timer callbacks adds an event
  • event_post calls timers_pause, timer_pause_resume_nesting is 0 so it increments and disables the timer interrupts again
  • event_post calls timers_resume which decrements the count and enables the timer interrupts
  • process_timer goes through the rest of the timers - but at this point the timer interrupts are already enabled, which means process_timer can be called again while it’s already in progress…

i wonder if your initial idea of storing the state from cpu_irq_level_is_enabled is more stable.


i still feel i’d rather have a more responsive UI than process more triggers… even though it doesn’t crash (which is a great improvement on its own!) it’s not entirely intuitive to realize you need to lower the rate to get it to respond better again.

in events.c should putIdx, getIdx and sysEvents be declared volatile since they can be changed from an ISR?

I don’t think this is possible, because process_timer is only called from one place, inside the timer interrupt handler, which doesn’t clear the timer IRQ flag until after process_timer has finished running:

  process_timers();
  // clear interrupt flag by reading timer SR
  tc_read_sr(APP_TC, APP_TC_CHANNEL);

The only gotcha, is if calling cpu_irq_enable_level(APP_TC_IRQ_PRIORITY) will also clear the interrupt flag.

Probably… I’ll have a look. I’m a bit clueless when it comes to that stuff if I’m honest…


@scanner_darkly, I’ve read through your comments about changes to how the event queue works… I’m afraid I’m running out of time (and sanity) for bigger changes like that. The 2.0 thread was started on the 17th of March, and the coding goes back even further to the 13 of November.

What I really want to do is make the simplest changes to get the system as stable as possible. Bigger changes will have to wait till 2.1.

It sounds like you’re up for doing some experiments though, here are some of the thoughts I’ve had…

  • Maybe the Teletype should get it’s own event code rather than the version shared with all the other modules. Priority queue?
  • AFAIK the AVR32 supports memory barriers… (git grep barrier in libavr32) which is the requirement for lock free queues, etc, (I think). Probably impossible to do with a JTAG though.
  • I’m also wondering if init_teletype / init_trilogy / etc, should all be moved out of libavr32 into their respective modules’ repos.
1 Like

Okay…

libavr32 changes have been updated in the following PR:

Ansible PR to update libavr32 and make changes to use new timers_pause and timers_resume functions (as well as the older msc changes):

This is the Ansible firmware hex built from that PR: ansible.zip (79.4 KB)

I have (briefly) tested Teletype, Kria and Meadowphysics modes. I can’t test the Arc ones.

@tehn can you test the Ansible firmware in more depth (either the zip or build it yourself from my repo)?

I have also built a version of the latest Teletype beta but using the new changes to libavr32, no problems to report either.

Huff. Time for a cup of tea.

2 Likes