might be worth trying i2c rate increase to 400k if you already have the scope out!

i can hook up i2c to an oscilloscope but not sure i’ll be able to determine if it’s within the i2c spec… what’s the max time it should take according to the i2c spec? i could try and measure that.

but i’m not convinced anymore that freezing has something to do with i2c, since i gave it lower priority, so in theory if the i2c communication got corrupted somehow this should only affect the remote commands but everything else should continue working. in my testing though even with lower i2c priority when it froze tt became completely unresponsive, including triggers / front panel button / keyboard (even after disconnecting/reconnecting). which makes me think maybe there is something else at play.

@tehn i can give this a try. do you mean this should make it easier to flush out i2c issues?

It would seriously reduce the time that the TT has to wait around for i2c responses. I had thought it was running at 400k per previous discussions, but had never verified it in the TT’s firmware. A 3x rate increase on these communications should really help, one would think. The TXi already returns as fast as humanly possible. :slight_smile:

1 Like

The pull up resistors effect the rise time of the SDA and SCL lines. If the resistance is too high, then the rise time takes too long, and instead of a nice square wave on the SCL line, you’ll get a ramp wave.

If the resistance is too low. The MCU won’t be able to pull the line down, plus I think you start to waste more current (which might be an issue in some applications).

The faster you run the i2c line at the more critical the rise time becomes (you have less time…).

This is a great link with lots of oscilloscope pictures:

http://dsscircuits.com/articles/effects-of-varying-i2c-pull-up-resistors

(I actually own an oscilloscope now… when/if I get some time I’ll hook it up to the i2c lines… I also need to figure out how to use it properly)

1 Like

experimented yesterday, adjusted interrupt levels again, 1 for i2c, 2 for triggers, 3 for timers, and changing the event and timer code to use cpu_irq_save / cpu_irq_restore again (and changing timers_pause / timers_restore to do the same).

10ms metro with reading from both ansible and TXi and writing to 4 TXos ran for over 2 hours before crashing. when it crashed ansible continued running but tt froze completely.

forgot to try increasing the i2c rate - will try it with this variation today and also updating ansible with lower i2c priority.

@sam - what scope did you get? i use picoscope which has i2c decoding - super handy!

3 Likes

Just moving this discussion back onto this thread.

My feeling is that there is an elegance and a simplicity to a multi-producer single-consumer (MPSC) queue to be used for the main run loop.

Under this model, anything can be a “producer” and post an event to the queue (trigger IRQ, timer IRQ, “consumer” code). But only “consumer” code can run an event. Consumer code cannot be preempted by other consumer code. Contested resources, such as access to SPI, synchronous I2C read/writes, running a script, etc, only happen in consumer code, and thus are guaranteed to only be accessed by one piece of code at a time.

Now sadly “perfect” models like these never survive their first encounter with the real world… still it’s something to strive towards.

Taking your example of CV and using two different timers. Both CV and screen are updated over SPI. None of the SPI code I’ve seen is re-entrant, you always need to spi_selectChip, and there is no masking done. Now you’re in a situation where the screen SPI is done in the main run loop, which could be preempted by the high priority CV timer. The locking overhead from making that work, might defeat any performance gain. (See below for what happens in 2.0 as it stands.)

Before going down that road it would be worth having measurements done to see if we can quantify what the performance is. And what needs improvement.

As an alternative to going down the multiple priority timer route, could I instead suggest have a multiple priority event queue. You still have the benefits of only one “consumer” running at a time (e.g. less locking required for resource access), but you gain the ability for, say, CV updates to run before a screen refresh.

There are dangers with this, the 2 big ones I can think of are:

  1. High priority tasks starving low priority tasks of CPU time. You could get into the unfortunate situation where e.g. the keyboard handling code never got the time to run. I think this will be hard to get perfect, but should be easy to get good.

  2. Slow running events can hold up the queue, because only one event runs at a time, and can’t be preempted by another event. So you could have a slow screen render in progress while several high priority events have queued up, but we must wait until the screen render has finished before the run loop can process the next event.

The simple solution for this is to make events short... Screen rendering (when it's needed), does 2 slow things. The first is a lot string manipulation, then second is sending data over SPI to the OLED. These could be split into 2 separate events on the queue, 1 to prep the data, 1 to send the data to the screen.

Synchronous I2C is another really really slow thing, that will be much trickier to fix.

Personally I quite like this idea, but then again, I’m emphatically not an embedded developer, and I hate all things IRQ. This could just be my way to not have to deal with them too much.


Just coming back to the 2.0 code and SPI. I believe what happens is that screen rendering SPI is done in the main loop, but that can be preempted by the CV timer. But not the other way round. Thus you can (and do) end up with the following code running…

spi_selectChip(OLED, ....)
spi_write(OLED, ....)

// preempted by CV timer

spi_selectChip(DAC, ....)
spi_write(DAC, ....)
spi_unselectChip(DAC, ....)

// return to main thread

spi_write(OLED, ....)  // uh oh, no chip selected!!!
spi_unselectChip(OLED, ....)

Thus the screen can occasionally get garbled, but not the CV.


A Rigol DS1054Z, hacked to be faster stronger, etc, etc. I was planning on doing a bit more electronics, but then I fell into the rabbit hole of 2.0…

2 Likes

Interesting. I really wish we had something concrete about pending interrupts when you mask all of them.

cpu_irq_save / cpu_irq_restore are used within the ASF, and code that we run is already using those functions.

I wonder if when you mask an interrupt, the pending bit is still set, so that it fires when you unmask. (I guess at most you’d only ever receive one request per type of interrupt.)

a single event queue with a single consumer would definitely make for a less bug prone code (and i’m a big proponent of “simple is beautiful” when it comes to coding). this would only work though if the queue gets processed at a rate that is equal or faster to the rate of filling it. as soon as we run into a scenario where things just don’t get processed fast enough i think we have to start thinking about introducing different priorities, which then necessitates making sure that any operations that use shared resources (such as updating CV) are atomic (good point about SPI updates not being atomic - i should play with changing that as well).

this could be done several ways. having a separate queue for high priority updates, or system triggers that process immediately rather than generating an event, for instance. in both cases you have the danger of higher priority updates blocking lower priority ones - so either way it needs to be coded in such a way that both priorities get a chance to run, either by allocating time for lower ones, or defining a high priority event for “execute a lower priority event” or having some other mechanism, such as a “smart” event loop which would check when was the last time the keyboard input was processed and make sure that it wouldn’t lapse too much, but could also adjust the rate based on how full the event queue is. so then you could have it processed at a regular frequency when not under stress and less often when it is (but still often enough as to not cause to be unresponsive).

and then, of course, one could look into optimizing the event processing, so, say, if you add an event that updates CVs and the last event in the queue also does that you could simply replace that last event with the new one… but this is an interesting can of worms on its own and not really needed unless performance becomes a big problem. i do want to keep an eye on performance though as i make some things more atomic as that will definitely reflex on performance, the question is to what extent (and in any case stability is better than speed). i agree that all the code that works with exclusive resources should be reviewed to make it as performant as possible (so, the screen refresh as you mentioned, the event queue code, the timers code perhaps…)


triggers off and CV slewing are actually a really interesting cases to consider within this context. a trigger off is an event that follows a trigger on event - which can be delayed based on how full the queue is, but then the trigger off would be expected to happen after a specific amount of time from the trigger on, not when the event loop gets around to executing it. so the trigger on is a regular event that then initiates a higher priority event of turning the trigger off. same for CV slewing - setting CV level is a regular event, but if slewing is enabled it should either be done with higher priority (so that the slew time doesn’t increase as your queue gets fuller and some of the slew timer interrupts get masked) or if done with regular timers it should check how much time actually elapsed and adjust accordingly. not sure increased complexity is worth in this case, but interesting to consider nonetheless. and then you get into other interesting stuff, such as - consider you have a 50ms long trigger pulse, you trigger it and 20ms after that you do TR 1 1. the first command turns the trigger on and creates an event to turn it off in 50ms. 20ms later the 2nd command also sets it to on (so nothing changes) but 30ms later the event from TR.PULSE turns it off. should this happen? probably not, since you indicated the desire to have it on with the later command…


i don’t hate IRQs (thready safety is a great topic to read while drinking morning coffee…), i just hate investigating thread related bugs :slight_smile: i’m trying hard to not have to read the ASF code. looks up how cpu__irq__save and barrier are implented. damn it! :slight_smile:

2 Likes

Yeah, maybe not when it’s just about to be bedtime though…!

There is definitely a lot interesting discussion about how to fairly schedule tasks. One nice property is that any changes to the scheduling algorithm are self contained. The actual tasks don’t change. It should make for tidy experiments.

One more thought before I go to bed… we’ve got a 60MHz CPU, I suspect it spends a large amount of time doing nothing. I wonder if we could use that idle time (i.e. no events in the queue) to do optimistic updates of CV slews and trigger times, or something else even.

2 Likes

update: it appears i changed the i2c interrupt priority in the wrong file, so my change from 3 to 1 didn’t have any effect. if i do change it in the correct location (libavr32\conf\conf_twi.h) i can get it to freeze very quickly.

this gives more weight again to my previous theory, i2c exchanges getting interrupted cause the system to lock, and the i2c code isn’t very resilient to that. so instead i reverted the i2c priority to 3 and changed both the trigger and timer interrupts to 2. this version proved to be stable so far - it ran over 14 hours with no issues, until i stopped it.

to remove one of the variables i also disconnected ansible from i2c, so the test above didn’t include any ansible remotes. i now reconnected ansible and added some remotes to metro. running it on 2ms (!) metro with no issues so far. at such a high rate keyboard response is not very reliable, takes a few times to register each keypress, but it still works.

what i’m planning to do next:

  • update ansible to use the same libavr32 code
  • try reverting some of the other changes i made to see if they make a difference as well (triggers ISR change)
  • look into updating SPI code to be more atomic
  • increase i2c rate.

i’ve pushed my code to git:


https://github.com/monome/libavr32/tree/irqs (just realized i branched in monome repo. d’oh!)

… and it just locked. double d’oh!

4 Likes

If you hit delete branch in the web UI, it will be as if it never happened.

Otherwise…


In this commit:

Reordering stack local variables after the IRQ mask won’t make a difference. AFAIK there is no danger of those variables being overwritten by another thread with a call to the same function. Each function invocation will have it’s own stack frame. It’s only static and global variables that you need to worry about.


Did this change make much difference? Which appears to be a reversion of this commit, which I’ll be honest I made on the off chance that it improved performance.


To what extent is the code just WIP experiments so that we can follow along? (i.e. should I be nagging you about putting a .c file in the conf directory?)

1 Like

i’ll transfer it at a later point. or maybe i should just leave it there until it’s ready for merge - wouldn’t that commit history disappear if i delete the branch without merging? and there are already commits in my tt branch referencing those.

iirc my worry was compiler reordering operations, but yeah in this case doesn’t make a difference. doesn’t hurt either though.

in one of my tests it did but i plan to test reverting it back to what you had. this is from the exploratory testing i did when i was changing multiple things, so it’s possible it only made a difference together with some other changes.

at this point yeah, mostly WIP. style/refactoring comments will be probably easier for me to track when i do a proper PR. i’ll transfer the 2 functions from /conf to /src

2 Likes

Just push it to your own fork on GitHub instead. (.gitmodules is only about the initial checkout, you can manually add a remote inside the sub module)

I can go into this further if you want?

:thumbsup:

If you do keep this change:

static volatile irqflags_t irq_flags;

void irqs_pause( void ) {
  irq_flags = cpu_irq_save();
}

void irqs_resume( void ) {
  cpu_irq_restore(irq_flags);
}

It might be wise to return irq_flags rather than save it in a global. I’d be worried about nested calls to irqs_pause otherwise.

3 Likes

that’s what i did, i have a remote added in the submodule and i have a branch in that remote i’m using for libavr32 changes. but somehow i missed the fact i created it in monome repo instead of my own. at this point i can probably just create another branch on top of monome/irqs, and just make sure that’s what i have checked out in the submodule folder, and no changes will be needed in my tt branch - correct?

this was my worry also. but then you should never have a condition where these calls are nested since all irqs are disabled.

i thought i did this because there were some cases where they don’t get called from the same function, but looking now i don’t see it, i’ll change as you suggest - agreed it’s a better way!

ran some more tests with ansible connected to i2c - inconclusive so far. part of the problem is that it’s getting into the territory now where it can run for hours before freezing, which makes testing somewhat time consuming…

1 Like

When you clone the Teletype repo, what you need to do is change the origin of the submodule to point somewhere else, that way when you git push it will create it there instead of at Monome. (git push defaults to origin as the remote, unless told otherwise)

cd libavr32
git remote set-url origin https://github.com/scannerdarkly/libavr32.git  # assuming this exists
git remote add upstream https://github.com/monome/libavr32.git           # so you can fetch from 'upstream'

On the one hand I always feel bad about setting it up this way, as the submodules always cause confusion. On the other hand, learning how to use git is definitely a transferable skill!


This is the example I had in mind:

// in file 1
void set_cv() {
    irqs_pause();
    do_something();
    irqs_resume();
}

// in file 2
void run_script() {
    irqs_pause();
    set_cv();       // uh oh!
    irqs_resume();
}

A call to run_script will cause problems. There is no code in there like that now, but if it were introduced later…


That’s brilliant progress, but I do feel your pain too. Do you have a gut feeling as to what change has made the biggest improvement?

2 Likes

ah, that explains it. i had origin set to monome repo and had 2 other remotes added for your and my repos. at this point submodules feel pretty natural, this was the part i missed.

yep, you’re right, this would definitely cause problems. i’ll make the change.

my feeling is that the biggest change is changing UI_IRQ_PRIORITY to a lower level so that it doesn’t compete with i2c. but i really need to prove it by running more tests. another thing that seems to make a difference is having ansible connected - testing this now… if this is the case there might still be i2c related bugs in ansible.

if my initial theory is correct (i2c communications get interrupted by some ISR with the same priority level, which leaves i2c hanging waiting for a response and thus blocking everything else) then this means that i2c communications should have the highest level and should never be allowed to be masked. but this also means that if there are other i2c related bugs they can cause the whole system to freeze. giving i2c lower priority would mean that it doesn’t block anything else but it can easily get corrupted and once it does all the remotes will stop working, not a good scenario either…

one argument against this theory is that the latest implementation with cpu_irq_save would also mask i2c interrupts which doesn’t seem to cause any issues…

2 Likes

back to basics: testing with a simpler set up.
disconnected everything from i2c except one TXo connected directly to TT.

5427406 beta 10 version

  • one trigger updating CV 1 and TO.CV 1 - seems fine
  • 2 triggers, updating their respective CV and TO.CV - managed to crash only once, seemed fine otherwise.
    interestingly, at some point it updated TO.TR output 3 at some point even though the scripts didn’t touch it - possibly pointing to i2c corruption?
  • added TO.TR.PULSE to each of the scripts
    fairly easy to get it to crash, very reproducible
    as the trigger rate increases it seems to lock completely (keyboard unresponsive, CV updates stop) but decreasing the rate fixes it, but at some point it locks completely
    adding 2 more triggers or metro with i2c doesn’t seem to make much difference

5427406 beta 10 version with UI_IRQ_PRIORITY changed to 2

  • with 4 triggers and metro at 10ms seems fine, consistent with the behaviour above except i can’t manage to crash it

5427406 beta 10 version with UI_IRQ_PRIORITY changed to 1

  • same as above

this seems to confirm that i2c must not have anything interfere with it (and as a consequence of that i2c bugs including ansible might affect tt operation as well).

@sam - i tried both with this change and without and didn’t seem to make a difference. however, there was this interesting comment from @zebra: Does your teletype stop responding after an hour or two?

i do notice a difference in behaviour between just changing UI_IRQ_PRIORITY level vs https://github.com/scanner-darkly/teletype/tree/dev - at extreme trigger rates the latter seems to be more responsive. also with everything reconnected to i2c and super i2c heavy scripts the latter doesn’t crash but the simpler version one does. likely it’s the irqs_pause / irqs_resume change.

at this point would be great to have somebody else also try the version which changes UI_IRQ_PRIORITY level. if somebody could test this it’d be super helpful (you just need one TXo):

script 1:

TO.CV 1 RAND 10000
TO.TR.PULSE 1```

script 2:
```CV 2 RAND 10000
TO.CV 2 RAND 10000
TO.TR.PULSE 2```

and then try hitting it with 2 fast triggers into 1&2. play with the trigger rate, see if you can get it to lock with the official beta 10. 

if you can reproduce locking with the above conditions then try this version instead and see if it makes a difference: <a class="attachment" href="/uploads/default/original/2X/0/06006533f7351448c00b4510696c983afeaa76ef.hex">teletype.hex</a> (323.0 KB)

if you ran into any other locking issues you could also give this version a try and see if it fixes your issue. this is the official beta 10 version with one change described above (disclaimer: this is not the ultimate solution, there is still a long path to happiness ahead but this will help a lot to get there faster!)
1 Like

great!

the best way to test is to have triggers temporarily disabled, then enter the scripts above, confirm they work by manually triggering them using F1 and F2 (you should see the trigger outputs on TXo pulse and both teletype and TXo CV outputs 1&2 will be set to a random value). then connect something that can produce fast triggers to trigger inputs 1&2 (you can use any LFO here), start with a slow rate and then observer it as you increase the rate.

at some point it should stop updating the outputs and the keyboard will likely become unresponsive. when that happens lower the rate again and see if it unlocks. if it does try increasing the rate again. repeat this several times and see if you can get it into a state where it stays unresponsive even if you lower the rate or disconnect the triggers altogether.

if you manage to get it to this state try repeating this process several times to make sure it’s reproducible, and then try the version i posted and see if you can reproduce it with that version.

after that i’d just try any scripts that involve i2c, remotes or TX commands, try using a metro script at 25ms or 10ms, and see if you can get it to freeze with each version.

awesome, thanks for your help!

redraw errors will happen at higher rates - this is a known issue and there is a couple of things we could try once we get it not to freeze anymore.

for i2c i’d try doing lots of loops and commands that involve both reading and writing over i2c (although i don’t think there are any ES or MP remotes for reading, i don’t have i2c on either though so not 100% sure).