update: it appears i changed the i2c interrupt priority in the wrong file, so my change from 3 to 1 didn’t have any effect. if i do change it in the correct location (libavr32\conf\conf_twi.h) i can get it to freeze very quickly.

this gives more weight again to my previous theory, i2c exchanges getting interrupted cause the system to lock, and the i2c code isn’t very resilient to that. so instead i reverted the i2c priority to 3 and changed both the trigger and timer interrupts to 2. this version proved to be stable so far - it ran over 14 hours with no issues, until i stopped it.

to remove one of the variables i also disconnected ansible from i2c, so the test above didn’t include any ansible remotes. i now reconnected ansible and added some remotes to metro. running it on 2ms (!) metro with no issues so far. at such a high rate keyboard response is not very reliable, takes a few times to register each keypress, but it still works.

what i’m planning to do next:

  • update ansible to use the same libavr32 code
  • try reverting some of the other changes i made to see if they make a difference as well (triggers ISR change)
  • look into updating SPI code to be more atomic
  • increase i2c rate.

i’ve pushed my code to git:


https://github.com/monome/libavr32/tree/irqs (just realized i branched in monome repo. d’oh!)

… and it just locked. double d’oh!

4 Likes

If you hit delete branch in the web UI, it will be as if it never happened.

Otherwise…


In this commit:

Reordering stack local variables after the IRQ mask won’t make a difference. AFAIK there is no danger of those variables being overwritten by another thread with a call to the same function. Each function invocation will have it’s own stack frame. It’s only static and global variables that you need to worry about.


Did this change make much difference? Which appears to be a reversion of this commit, which I’ll be honest I made on the off chance that it improved performance.


To what extent is the code just WIP experiments so that we can follow along? (i.e. should I be nagging you about putting a .c file in the conf directory?)

1 Like

i’ll transfer it at a later point. or maybe i should just leave it there until it’s ready for merge - wouldn’t that commit history disappear if i delete the branch without merging? and there are already commits in my tt branch referencing those.

iirc my worry was compiler reordering operations, but yeah in this case doesn’t make a difference. doesn’t hurt either though.

in one of my tests it did but i plan to test reverting it back to what you had. this is from the exploratory testing i did when i was changing multiple things, so it’s possible it only made a difference together with some other changes.

at this point yeah, mostly WIP. style/refactoring comments will be probably easier for me to track when i do a proper PR. i’ll transfer the 2 functions from /conf to /src

2 Likes

Just push it to your own fork on GitHub instead. (.gitmodules is only about the initial checkout, you can manually add a remote inside the sub module)

I can go into this further if you want?

:thumbsup:

If you do keep this change:

static volatile irqflags_t irq_flags;

void irqs_pause( void ) {
  irq_flags = cpu_irq_save();
}

void irqs_resume( void ) {
  cpu_irq_restore(irq_flags);
}

It might be wise to return irq_flags rather than save it in a global. I’d be worried about nested calls to irqs_pause otherwise.

3 Likes

that’s what i did, i have a remote added in the submodule and i have a branch in that remote i’m using for libavr32 changes. but somehow i missed the fact i created it in monome repo instead of my own. at this point i can probably just create another branch on top of monome/irqs, and just make sure that’s what i have checked out in the submodule folder, and no changes will be needed in my tt branch - correct?

this was my worry also. but then you should never have a condition where these calls are nested since all irqs are disabled.

i thought i did this because there were some cases where they don’t get called from the same function, but looking now i don’t see it, i’ll change as you suggest - agreed it’s a better way!

ran some more tests with ansible connected to i2c - inconclusive so far. part of the problem is that it’s getting into the territory now where it can run for hours before freezing, which makes testing somewhat time consuming…

1 Like

When you clone the Teletype repo, what you need to do is change the origin of the submodule to point somewhere else, that way when you git push it will create it there instead of at Monome. (git push defaults to origin as the remote, unless told otherwise)

cd libavr32
git remote set-url origin https://github.com/scannerdarkly/libavr32.git  # assuming this exists
git remote add upstream https://github.com/monome/libavr32.git           # so you can fetch from 'upstream'

On the one hand I always feel bad about setting it up this way, as the submodules always cause confusion. On the other hand, learning how to use git is definitely a transferable skill!


This is the example I had in mind:

// in file 1
void set_cv() {
    irqs_pause();
    do_something();
    irqs_resume();
}

// in file 2
void run_script() {
    irqs_pause();
    set_cv();       // uh oh!
    irqs_resume();
}

A call to run_script will cause problems. There is no code in there like that now, but if it were introduced later…


That’s brilliant progress, but I do feel your pain too. Do you have a gut feeling as to what change has made the biggest improvement?

2 Likes

ah, that explains it. i had origin set to monome repo and had 2 other remotes added for your and my repos. at this point submodules feel pretty natural, this was the part i missed.

yep, you’re right, this would definitely cause problems. i’ll make the change.

my feeling is that the biggest change is changing UI_IRQ_PRIORITY to a lower level so that it doesn’t compete with i2c. but i really need to prove it by running more tests. another thing that seems to make a difference is having ansible connected - testing this now… if this is the case there might still be i2c related bugs in ansible.

if my initial theory is correct (i2c communications get interrupted by some ISR with the same priority level, which leaves i2c hanging waiting for a response and thus blocking everything else) then this means that i2c communications should have the highest level and should never be allowed to be masked. but this also means that if there are other i2c related bugs they can cause the whole system to freeze. giving i2c lower priority would mean that it doesn’t block anything else but it can easily get corrupted and once it does all the remotes will stop working, not a good scenario either…

one argument against this theory is that the latest implementation with cpu_irq_save would also mask i2c interrupts which doesn’t seem to cause any issues…

2 Likes

back to basics: testing with a simpler set up.
disconnected everything from i2c except one TXo connected directly to TT.

5427406 beta 10 version

  • one trigger updating CV 1 and TO.CV 1 - seems fine
  • 2 triggers, updating their respective CV and TO.CV - managed to crash only once, seemed fine otherwise.
    interestingly, at some point it updated TO.TR output 3 at some point even though the scripts didn’t touch it - possibly pointing to i2c corruption?
  • added TO.TR.PULSE to each of the scripts
    fairly easy to get it to crash, very reproducible
    as the trigger rate increases it seems to lock completely (keyboard unresponsive, CV updates stop) but decreasing the rate fixes it, but at some point it locks completely
    adding 2 more triggers or metro with i2c doesn’t seem to make much difference

5427406 beta 10 version with UI_IRQ_PRIORITY changed to 2

  • with 4 triggers and metro at 10ms seems fine, consistent with the behaviour above except i can’t manage to crash it

5427406 beta 10 version with UI_IRQ_PRIORITY changed to 1

  • same as above

this seems to confirm that i2c must not have anything interfere with it (and as a consequence of that i2c bugs including ansible might affect tt operation as well).

@sam - i tried both with this change and without and didn’t seem to make a difference. however, there was this interesting comment from @zebra: Does your teletype stop responding after an hour or two?

i do notice a difference in behaviour between just changing UI_IRQ_PRIORITY level vs https://github.com/scanner-darkly/teletype/tree/dev - at extreme trigger rates the latter seems to be more responsive. also with everything reconnected to i2c and super i2c heavy scripts the latter doesn’t crash but the simpler version one does. likely it’s the irqs_pause / irqs_resume change.

at this point would be great to have somebody else also try the version which changes UI_IRQ_PRIORITY level. if somebody could test this it’d be super helpful (you just need one TXo):

script 1:

TO.CV 1 RAND 10000
TO.TR.PULSE 1```

script 2:
```CV 2 RAND 10000
TO.CV 2 RAND 10000
TO.TR.PULSE 2```

and then try hitting it with 2 fast triggers into 1&2. play with the trigger rate, see if you can get it to lock with the official beta 10. 

if you can reproduce locking with the above conditions then try this version instead and see if it makes a difference: <a class="attachment" href="/uploads/default/original/2X/0/06006533f7351448c00b4510696c983afeaa76ef.hex">teletype.hex</a> (323.0 KB)

if you ran into any other locking issues you could also give this version a try and see if it fixes your issue. this is the official beta 10 version with one change described above (disclaimer: this is not the ultimate solution, there is still a long path to happiness ahead but this will help a lot to get there faster!)
1 Like

great!

the best way to test is to have triggers temporarily disabled, then enter the scripts above, confirm they work by manually triggering them using F1 and F2 (you should see the trigger outputs on TXo pulse and both teletype and TXo CV outputs 1&2 will be set to a random value). then connect something that can produce fast triggers to trigger inputs 1&2 (you can use any LFO here), start with a slow rate and then observer it as you increase the rate.

at some point it should stop updating the outputs and the keyboard will likely become unresponsive. when that happens lower the rate again and see if it unlocks. if it does try increasing the rate again. repeat this several times and see if you can get it into a state where it stays unresponsive even if you lower the rate or disconnect the triggers altogether.

if you manage to get it to this state try repeating this process several times to make sure it’s reproducible, and then try the version i posted and see if you can reproduce it with that version.

after that i’d just try any scripts that involve i2c, remotes or TX commands, try using a metro script at 25ms or 10ms, and see if you can get it to freeze with each version.

awesome, thanks for your help!

redraw errors will happen at higher rates - this is a known issue and there is a couple of things we could try once we get it not to freeze anymore.

for i2c i’d try doing lots of loops and commands that involve both reading and writing over i2c (although i don’t think there are any ES or MP remotes for reading, i don’t have i2c on either though so not 100% sure).

How fast is the trigger input?

yes, definitely. i’ll try to post versions for ES and MP firmwares that will have the same fix tonight or tomorrow night.

do you use a iiBackpack? or is everything connected directly to TT i2c? when it locks does it happen quickly or does it take a while? how reproducible is it - can you get it to lock every time?

Nah, think you might have. The question I has was looking for soemething more specific. I’ve found that you can run TO writes at up to about 1100Hz before things break down, so was wondering if you had similar experience.

if you could try it a few more times that would definitely be helpful! just want to make sure that the steps to reproduce can consistently get it to lock - this will help when testing new versions.

could you try these 2 versions and see if you can still get it to lock? (standard warning - updating firmware will erase your presets!)

teletype.hex (322.7 KB)
https://github.com/scanner-darkly/teletype/commit/414758d27db798f415a4133218ed0fd084a34391

meadowphysics.hex (102.9 KB)
https://github.com/scanner-darkly/meadowphysics/commit/79de8b21d5a4103cb6e4d20b623320ad64a3c011

there are some changes in teletype that should make it more stable. meadowphysics version has the same changes applied to it (plus some other ones it wasn’t yet updated with). i did a quick test - seems to be functional, but i can’t test remote MP commands unfortunately (mine doesn’t have the i2c header, need to do that…).

that’s great news!

my next step is to try a couple of other things that i think should make it more stable still and then focus on getting ansible working (the last several tests i did with it i was able to get tt to freeze fairly quickly when using ansible remotes).

if you run into a problem with earthsea i could also post an updated firmware using the latest libavr32, but i think makes sense to update the trilogy once the libavr32 changes are finalized.

i can do it tonight - it’s not much work, and if you don’t mind giving it a try this will actually help to confirm the fix works. my main concern was having to test again as it’s possible it might regress as i introduce more changes, so made sense to save more extensive testing for later once things will be more finalized.

speaking of which, i should have a new version of TT in the next few days.

3 Likes

here is the earthsea version that includes the same tweaks as the above versions of teletype and meadowphysics, could you give this a try if you get a chance?

earthsea.hex (133.4 KB)

2 Likes

so when ES crash this way sounds like it doesn’t affect TT? TT is still operational? is it the same behaviour you had with the previous version of ES?