It is. More stable too I think.

I need to try the same libavr32 changes on Teletype too, but I’m feeling hopefully.

edit: finally managed to crash Ansible, but it took an awful lot of abuse.

2 Likes

amazing. you’ve really pulled together this huge step forward, with the expanders and everything!

will merge the PR’s after the next one arrives for this timer fix…

Watching this over the last few days has been like watching a suspenseful investigation drama (but then I work at a software company!)

So! Very! Excited!

Sam, how can we reward you (and anyone else) for your efforts? Come to Boston, we will find you and feed you.

3 Likes

Bug reports! :stuck_out_tongue: Need to get as many as possible by this weekend as my time is running out…

(thank you for the kind words)

“AVR32101: Configuring the AVR32 Interrupt Controller” is where i got the info about masking. under 2.2.2 Masking it says:

Whenever an interrupt occur, hardware automatically sets the mask bits corresponding to all sources with equal or lower priority.

and under 5.1 Nested interrupts:

As all interrupts with the same and lower priority is automatically masked whenever an interrupt is riggered, lower priority interrupts will not preempt the current running interrupt. An interrupt with a higher priority may nevertheless preempt the current running interrupt and thus nested interrupts is enabled by default.

which seems to imply that when you call cpu_irq_disable_level it will disable the specified level and anything below. but maybe it’s not the case…


going over your changes:

  • updating event_post to disable SYS_IRQ_PRIORITY / removing IRQ masking completely: this would mean not blocking the timer interrupts when an event gets added, which i think we agree would cause issues if, say, a timer would try to add to the event queue while a trigger interrupt was doing the same thing. not sure why it worked.
  • registering the trigger interrupts at the same level as the timer interrupts: this would mean with high rate triggers post_event would be called all the time blocking both the trigger and the timer interrupts, and as soon as they would be enabled another trigger would arrive, so makes sense this would mean everything is blocked most of the time.
  • moving gpio_clear_pin_interrupt_flag until after event_post means no 2 trigger interrupts can try adding to the queue at the same time, so makes sense this should have a positive effect on stability.

i think the bottom line is - event_post needs to be protected from corrupting the queue. this is the whole reason why interrupts are disabled there. but i wonder if this should be achieved by checking if the queue is full before modifying it and then restoring it as the existing code does. there might still be a possible race condition where 2 interrupts check the queue one after another, both confirm it’s not full and both add, so checking if it’s full should still be within a block that disables interrupts, but it’s a smaller block, don’t need to disable all for the whole event_post duration.

i like the counter idea - clever! and i think it should take care of all the scenarios with nested interrupts. and good catch on timer_remove!


i2c and USB should have the highest priority imo. i think it’s more dangerous to not let a transaction complete and leave devices hanging in some unknown state. the downside of this is that an i2c transmission has the power to block everything, and fixing that would be hard (as you’d need some timer based solution to reset i2c after a timeout, very tricky). but hopefully the hardware improvements should make i2c code reliable enough that this shouldn’t happen (btw, monome teletype repo still references livabr32 with pullups enabled).

i also agree with @zebra that UI should be higher priority - i’d rather drop more triggers (we’re talking about super high rate anyway!) than have a less responsive UI. perhaps what should be done is not making screen refresh / keyboard / grid / arc handlers timer based (so, not use events for them at all), but process them in the main event loop - this could still be done based on some frequency, by storing the timestamp and comparing against tcTicks.

1 Like

which seems to imply that when you call cpu_irq_disable_level it will disable the specified level and anything below. but maybe it’s not the case…

no, it’s not the case. this describes what happens in the interrupt controller when an IRQ is entered. the cpu_irq_disable_level macro just atomically sets the relevant mask bits directly.

that’s interesting about lower priorities being preempted by higher priorities. i could have sworn this was not the case by default. will have a harder look at the datasheet - IIRC there are some subtle differences between UC3A and UC3B here.

on a 2nd thought we don’t need to optimize event_post as we also need to protect actually modifying the queue, so might as well just disable the timer interrupts for its whole duration.

so, disabling the timer interrupts in event_post and moving gpio_clear_pin_interrupt_flag until after event_post is called in the trigger interrupt ISR should ensure that event_post is never called from a timer or trigger ISR while it’s executing already (interestingly, that’s exactly what init_ansible.c and init_trilogy.c already do). but it’s still possible for HID/USB to try and post an event while event_post is executing. so we either have to disable all IRQs (which would result in poor responsiveness) or switch all the UI stuff to not use the event queue (have a separate queue or just process in the main loop with needed frequency).

i’ll try to do some testing tonight as well - been working on clock mult/div code and need a break from trying to fix a stubborn bug (and secretly hoping it’s caused by the same issues we’re discussing here).

@zebra - ah, make sense! i couldn’t find a good description of what cpu_irq_disable_level did exactly.

As soon as I get home…as soon as I get home…

1 Like

this bit of comment cruft might be interesting:
(https://github.com/catfact/aleph/blob/master/avr32_lib/src/interrupts.c#L133-L140)

was a time when everything worked as you suggest: cpu_irq_disable() in event_post(), and clear the interrupt flag afterwards. somehow the fact that cpu_irq_disable() also has the effect of clearing pending flags, caused an infinite loop of interrupt requests. (or so i believed at the time.) i can’t follow the reason right now.


you’re right, the datasheet basically says higher-priority modes can preempt lower-priority. (i swear there is some reason i didn’t think this would happen.)

Just some quick comments, I’ll try to read the replies more fully tomorrow. It’s bed time here in the UK.

Yes as @zebra says, the lower priority masking inside event_post only applies if it is called from an IRQ.

Consider the following:

  1. I manually run event_post.
  2. It disables IRQs at priority 2.
  3. A priority 1 IRQ happens, and inside that IRQ event_post is called.

Also, as it stands with beta 9, the UI is really really responsive still. Even with me masking all interrupts in the event code.

You have to start sending i2c at 10ms intervals, or using high audio rate triggers to slow it down. And most importantly it doesn’t crash when you do.

(fyi, in 1ms you can send 12.8 bytes over i2c at 100kbs - are we using 100kbs??)

I think i2c bus recovery is a much bigger issue.

Plus some simple tweaks to the Teletype metro code to stop it completely taking over the event queue.

1 Like

we should be at 400kbs, which is the max. but i see that we’re at 132k (??)

i2c bus recovery, yes. i suspect the ansible/trilogy are the issue, crashing while holding the i2c lines hostage.

yeah, still confused about masking then… so, the lower priority masking only happens inside an ISR?

regardless, we might still have cases where a non masked interrupt tries to add an event while an ISR is in progress trying to do the same thing. so we either have to disable all interrupts, or if there are interrupts we’d rather not disable we have to make them stop using the event code.

also, looking at the latest timers_pause and timers_restore i think we might still have the following scenario:

  • process_timer is called, it’s the timer ISR, so it disables the timer interrupts
  • one of the timer callbacks adds an event
  • event_post calls timers_pause, timer_pause_resume_nesting is 0 so it increments and disables the timer interrupts again
  • event_post calls timers_resume which decrements the count and enables the timer interrupts
  • process_timer goes through the rest of the timers - but at this point the timer interrupts are already enabled, which means process_timer can be called again while it’s already in progress…

i wonder if your initial idea of storing the state from cpu_irq_level_is_enabled is more stable.


i still feel i’d rather have a more responsive UI than process more triggers… even though it doesn’t crash (which is a great improvement on its own!) it’s not entirely intuitive to realize you need to lower the rate to get it to respond better again.

in events.c should putIdx, getIdx and sysEvents be declared volatile since they can be changed from an ISR?

I don’t think this is possible, because process_timer is only called from one place, inside the timer interrupt handler, which doesn’t clear the timer IRQ flag until after process_timer has finished running:

  process_timers();
  // clear interrupt flag by reading timer SR
  tc_read_sr(APP_TC, APP_TC_CHANNEL);

The only gotcha, is if calling cpu_irq_enable_level(APP_TC_IRQ_PRIORITY) will also clear the interrupt flag.

Probably… I’ll have a look. I’m a bit clueless when it comes to that stuff if I’m honest…


@scanner_darkly, I’ve read through your comments about changes to how the event queue works… I’m afraid I’m running out of time (and sanity) for bigger changes like that. The 2.0 thread was started on the 17th of March, and the coding goes back even further to the 13 of November.

What I really want to do is make the simplest changes to get the system as stable as possible. Bigger changes will have to wait till 2.1.

It sounds like you’re up for doing some experiments though, here are some of the thoughts I’ve had…

  • Maybe the Teletype should get it’s own event code rather than the version shared with all the other modules. Priority queue?
  • AFAIK the AVR32 supports memory barriers… (git grep barrier in libavr32) which is the requirement for lock free queues, etc, (I think). Probably impossible to do with a JTAG though.
  • I’m also wondering if init_teletype / init_trilogy / etc, should all be moved out of libavr32 into their respective modules’ repos.
1 Like

Okay…

libavr32 changes have been updated in the following PR:

Ansible PR to update libavr32 and make changes to use new timers_pause and timers_resume functions (as well as the older msc changes):

This is the Ansible firmware hex built from that PR: ansible.zip (79.4 KB)

I have (briefly) tested Teletype, Kria and Meadowphysics modes. I can’t test the Arc ones.

@tehn can you test the Ansible firmware in more depth (either the zip or build it yourself from my repo)?

I have also built a version of the latest Teletype beta but using the new changes to libavr32, no problems to report either.

Huff. Time for a cup of tea.

2 Likes

ok! pull and testing coming up within the next couple hours.

yeah, that’s what i meant. i need to read up more on exactly when the flag is set/cleared and how does the masking work…

re: volatile - this seems to be the recommendation. i don’t think it’ll make a difference (we’d probably see some issues already if it did) but seems like a prudent thing to do, i can make this change when i experiment with doing keyboard and screen refresh on their own priority. i have an idea on how to do it without modifying event code.

memory barriers - need to read up on this as well. i thought declaring volatile was enough but now looks like maybe adding memory barriers would also be needed… going through http://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence

i should be able to test your version tonight (almost done setting up the test case with a bunch of telex modules) and then can experiment with some of these ideas.

Continuing from here:

I think @scanner_darkly is now going to try to take us up to the next level of fast. It looks like he’s already made some decent progress too. Ludicrous speed ho!

Some thoughts… (apologies if this is all known to you, but it’s good info to share with everyone else).

I think you/we are going to need a proper testing protocol (luckily USB saving / loading works well now!).

I’d suggest:

  • audio rate triggers only (I was using a Dr Octature sine, AFAIK that will go well above hearing ranges, best to try both triggers and a waveform).
  • audio rate triggers sending i2c.
  • i2c in fast M
  • i2c in fast M plus audio rate triggers
  • i2c in fast M plus audio triggers sending i2c

Anything else? It might a good idea to list the specifics of what you do too.

It’s pain, but I think we’re at the stage where we might fix one thing and break another. I know it also seems redundant to do the lower level tests (e.g. audio rate only), but I think you’ll hit different performance limits in each of the tests.

Can you make each change you do a separate commit. If we do get a regression it makes it easier to track down (and git bisect is magic).

Don’t forget to test with Ansible! We can deal with the Trilogy modules later.

You mentioned changing the i2c IRQ priority. Do you want to add a new I2C_IRQ_PRIORITY define to conf_tc_irq.h and then reference that in conf_twi.h. I think it will be helpful to keep the level definitions in one place.

Lastly I think it’s worth keeping the encapsulation of timers_pause / timers_resume, even if all they do is call cpu_irq_disable_level / cpu_irq_enable_level without the nesting stuff. You could also think about changing them to app_pause / app_resume if you think that makes more sense (and move them to a new file app.c), similar to how it was done in the Aleph code base.

1 Like

I did not follow the multiple threads about this over the last days due to massive day job demands and a lack of understanding for all those irq stuff you are discussing.

I have some time now over the next days and can follow the suggested protocol. Just in case I missed something, what are the actual firmware versions for Ansible and Teletype I should use?

Don’t worry, I feel like that too sometimes :stuck_out_tongue:

That’s more for the future changes that @scanner_darkly is going to make.

Most likely there will be a 2.0b10 out later today, that would be the best thing to test. (That wont include the changes that @scanner_darkly is making though, they will come later, probably in 2.1).

I would suggest trying to test all the features in 2.0, as I’d like to get a release made once the documentation is up to speed. Concentrate on sensible i2c testing rather than the 10ms stuff that’s going on here.

There is no new Ansible firmware to test really. That came up as some of the changes I made to the lower level library for all modules caused issues for Ansible, so it required a fix for future releases. The zip file several posts above contains that firmware. In theory it should add some extra stability to Ansible, but it might not.

1 Like