Okay, so just to be clear: I downloaded Ansible 1.50 and now replace the hex file inside theat folder with the one from the zip file posted above and then update it and wait for the 2.0b10?

With beta9 on TT, metro at 25, metro toggling TXo TRs, and script 1 setting random TXo CVs, I can ratchet up the rate of triggering script 1 to ~1100Hz before it crashes. In terms of write performance I think that’s pretty amazing.

1 Like

The Ansible firmware is only if you’re curious.

And yes, beta 10 coming later today. You can play with beta 9 now if you want. It’s pretty good as far as performance goes.

:smiley:

@scanner_darkly’s changes look like they go even further than that too.

Still, we’d love to get rid of the all the crashes.

Okay, my curiosity led me to noticing Ansibel to get unreponsive in teletype mode with putting out triggers in audiorate over i2c. Teletype remains unaffected by this. Ansible stays in that condition even when I turn down the audiorate input to teletype but starts working when connecting a grid to leave the mode w/o a need to power cycle. Unconnecting the grid and switching bakc to teletype mode works too.

I had a M 25 running while doing this and several audio rate inputs on teletype sending trigger outs. The trigger outs on teletype stay working too when Ansible fell asleep.

EDIT: I just noticed that the front panel buttons stay unresponsive even after all teletype scripts are vanished and Ansible is in Kria mode (could program patterns but did not get on the config page w/o power cycle)

Is this all with the Ansible firmware from the zip in this thread?

Next question, do the same bugs appear in 1.5.0?

Yes and no. It’s with the new firmware from above. I have one with this installed now and one with v1.50 again. They both freeze, sometimes in the the same moment sometimes one takes a bit longer (I think the new one, but only seconds). Then both react on switching to a grid app by hotplugging the grid but on 1.50 the grid gets not responsive and unplugging it and then switching back to teletype app does not consistantly make this responsive again, power cycle is needed mostly. Not on the new one. The lack of frontpanel responsiveness is not consistent. On the new one it sometimes works for the v1.50 I cannot say since the grid is unresponsive.

1 Like

Sounds like there is no difference between the zip version and 1.5.0.

Which is fine by me. The test version of Ansible was just to check that the changes hadn’t broken anything.

Thanks for testing.

agreed! i think we should concentrate on getting the 2.0 release ready, and if i manage to improve i2c / fast triggers stability this will be 2.1.

for i2c / fast triggers issue my main goal is to get it to the point where it’s absolutely stable. i hope that this goal is achievable, but if not perhaps we could at least strengthen it to the point where it crashes significantly less often and under significantly higher load (which my testing suggests is possible). speed is a not a concern - i’d rather have a system that is not capable of processing audio rate triggers (as it was never designed for that) but is stable (as a side note, there is an interesting question of whether triggers or timers should have higher priority - i’ll experiment with this as well).

for now my plan is to continue experimenting with IRQs / events / timers / i2c to see if i can get it not to crash. with my set up i was able to get it to freeze very quick with the previous beta version. the changes i posted made it crash less fast (~10-15 min). the latest change i tried was giving i2c lower priority (1 instead of 3) - my theory was that the i2c interrupts getting masked caused i2c to hang, so my previous changes were aimed to make sure this doesn’t happen (only masking the timer and the trigger interrupts, since i2c does not use the event code anyway). but changing it to lower priority actually made it more stable - looks like i2c code is much more resilient than i thought… it did freeze eventually - not sure how long it took as i left it running overnight, but it did run for at least 1.5 hours which is a huge improvement.

with that in mind i’m going to try doing complete masking again and going back to using cpu_irq_save / cpu_irq_restore and see if that makes a difference. my plan is to test with making the same changes in ansible, to eliminate the possibility of broken ansible i2c taking down tt as well.

agreed on the testing plan - but i think there is no point on starting testing until i manage to get it stable in my testing under a heavy load, or at least improve it significantly. i’ll post my progress here and once i get it to the point where i’m happy i’ll start a new thread and will ask for help with beta testing.

these are not incremental changes though, they’re all related - they have to be done in conjunction.

good idea, i’ll do this!

agreed on this as well, i’ll make it part of my changes!

in my tests (beta9, metro at 10, doing some i2c stuff in metro) i was able to get it to freeze pretty quickly even with below audio rate triggers…

1 Like

:heart:

:heart: :heart:

As incremental as possible then?

I hate to bring up the dreaded resistance again… do you want to try with a smaller number of modules on the bus? You’ve got way more than most. Also, try building the latest master of Ansible.

I was running i2c to Ansible (only thing on the bus) at 2ms metro times.

yeah, definitely. i’ll bundle those changes that need to be bundled, and the rest will be separate commits.

my thinking is - if i can get it to be more reliable in such a heavy setup it should be even more stable for something less taxing :slight_smile: my worry is that there is some bug somewhere, and testing like this will flush it out sooner. if there is some race condition that only happens in very specific conditions this gives me a way to reproduce easier.

1 Like

That’s a really good point.

It wasn’t until I could easily trigger a crash with an audio rate trigger that I was able to get anywhere with this.

Still, if you have an Oscilliscope, check to see if your i2c lines are pulled up quick enough.

might be worth trying i2c rate increase to 400k if you already have the scope out!

i can hook up i2c to an oscilloscope but not sure i’ll be able to determine if it’s within the i2c spec… what’s the max time it should take according to the i2c spec? i could try and measure that.

but i’m not convinced anymore that freezing has something to do with i2c, since i gave it lower priority, so in theory if the i2c communication got corrupted somehow this should only affect the remote commands but everything else should continue working. in my testing though even with lower i2c priority when it froze tt became completely unresponsive, including triggers / front panel button / keyboard (even after disconnecting/reconnecting). which makes me think maybe there is something else at play.

@tehn i can give this a try. do you mean this should make it easier to flush out i2c issues?

It would seriously reduce the time that the TT has to wait around for i2c responses. I had thought it was running at 400k per previous discussions, but had never verified it in the TT’s firmware. A 3x rate increase on these communications should really help, one would think. The TXi already returns as fast as humanly possible. :slight_smile:

1 Like

The pull up resistors effect the rise time of the SDA and SCL lines. If the resistance is too high, then the rise time takes too long, and instead of a nice square wave on the SCL line, you’ll get a ramp wave.

If the resistance is too low. The MCU won’t be able to pull the line down, plus I think you start to waste more current (which might be an issue in some applications).

The faster you run the i2c line at the more critical the rise time becomes (you have less time…).

This is a great link with lots of oscilloscope pictures:

http://dsscircuits.com/articles/effects-of-varying-i2c-pull-up-resistors

(I actually own an oscilloscope now… when/if I get some time I’ll hook it up to the i2c lines… I also need to figure out how to use it properly)

1 Like

experimented yesterday, adjusted interrupt levels again, 1 for i2c, 2 for triggers, 3 for timers, and changing the event and timer code to use cpu_irq_save / cpu_irq_restore again (and changing timers_pause / timers_restore to do the same).

10ms metro with reading from both ansible and TXi and writing to 4 TXos ran for over 2 hours before crashing. when it crashed ansible continued running but tt froze completely.

forgot to try increasing the i2c rate - will try it with this variation today and also updating ansible with lower i2c priority.

@sam - what scope did you get? i use picoscope which has i2c decoding - super handy!

3 Likes

Just moving this discussion back onto this thread.

My feeling is that there is an elegance and a simplicity to a multi-producer single-consumer (MPSC) queue to be used for the main run loop.

Under this model, anything can be a “producer” and post an event to the queue (trigger IRQ, timer IRQ, “consumer” code). But only “consumer” code can run an event. Consumer code cannot be preempted by other consumer code. Contested resources, such as access to SPI, synchronous I2C read/writes, running a script, etc, only happen in consumer code, and thus are guaranteed to only be accessed by one piece of code at a time.

Now sadly “perfect” models like these never survive their first encounter with the real world… still it’s something to strive towards.

Taking your example of CV and using two different timers. Both CV and screen are updated over SPI. None of the SPI code I’ve seen is re-entrant, you always need to spi_selectChip, and there is no masking done. Now you’re in a situation where the screen SPI is done in the main run loop, which could be preempted by the high priority CV timer. The locking overhead from making that work, might defeat any performance gain. (See below for what happens in 2.0 as it stands.)

Before going down that road it would be worth having measurements done to see if we can quantify what the performance is. And what needs improvement.

As an alternative to going down the multiple priority timer route, could I instead suggest have a multiple priority event queue. You still have the benefits of only one “consumer” running at a time (e.g. less locking required for resource access), but you gain the ability for, say, CV updates to run before a screen refresh.

There are dangers with this, the 2 big ones I can think of are:

  1. High priority tasks starving low priority tasks of CPU time. You could get into the unfortunate situation where e.g. the keyboard handling code never got the time to run. I think this will be hard to get perfect, but should be easy to get good.

  2. Slow running events can hold up the queue, because only one event runs at a time, and can’t be preempted by another event. So you could have a slow screen render in progress while several high priority events have queued up, but we must wait until the screen render has finished before the run loop can process the next event.

The simple solution for this is to make events short... Screen rendering (when it's needed), does 2 slow things. The first is a lot string manipulation, then second is sending data over SPI to the OLED. These could be split into 2 separate events on the queue, 1 to prep the data, 1 to send the data to the screen.

Synchronous I2C is another really really slow thing, that will be much trickier to fix.

Personally I quite like this idea, but then again, I’m emphatically not an embedded developer, and I hate all things IRQ. This could just be my way to not have to deal with them too much.


Just coming back to the 2.0 code and SPI. I believe what happens is that screen rendering SPI is done in the main loop, but that can be preempted by the CV timer. But not the other way round. Thus you can (and do) end up with the following code running…

spi_selectChip(OLED, ....)
spi_write(OLED, ....)

// preempted by CV timer

spi_selectChip(DAC, ....)
spi_write(DAC, ....)
spi_unselectChip(DAC, ....)

// return to main thread

spi_write(OLED, ....)  // uh oh, no chip selected!!!
spi_unselectChip(OLED, ....)

Thus the screen can occasionally get garbled, but not the CV.


A Rigol DS1054Z, hacked to be faster stronger, etc, etc. I was planning on doing a bit more electronics, but then I fell into the rabbit hole of 2.0…

2 Likes

Interesting. I really wish we had something concrete about pending interrupts when you mask all of them.

cpu_irq_save / cpu_irq_restore are used within the ASF, and code that we run is already using those functions.

I wonder if when you mask an interrupt, the pending bit is still set, so that it fires when you unmask. (I guess at most you’d only ever receive one request per type of interrupt.)

a single event queue with a single consumer would definitely make for a less bug prone code (and i’m a big proponent of “simple is beautiful” when it comes to coding). this would only work though if the queue gets processed at a rate that is equal or faster to the rate of filling it. as soon as we run into a scenario where things just don’t get processed fast enough i think we have to start thinking about introducing different priorities, which then necessitates making sure that any operations that use shared resources (such as updating CV) are atomic (good point about SPI updates not being atomic - i should play with changing that as well).

this could be done several ways. having a separate queue for high priority updates, or system triggers that process immediately rather than generating an event, for instance. in both cases you have the danger of higher priority updates blocking lower priority ones - so either way it needs to be coded in such a way that both priorities get a chance to run, either by allocating time for lower ones, or defining a high priority event for “execute a lower priority event” or having some other mechanism, such as a “smart” event loop which would check when was the last time the keyboard input was processed and make sure that it wouldn’t lapse too much, but could also adjust the rate based on how full the event queue is. so then you could have it processed at a regular frequency when not under stress and less often when it is (but still often enough as to not cause to be unresponsive).

and then, of course, one could look into optimizing the event processing, so, say, if you add an event that updates CVs and the last event in the queue also does that you could simply replace that last event with the new one… but this is an interesting can of worms on its own and not really needed unless performance becomes a big problem. i do want to keep an eye on performance though as i make some things more atomic as that will definitely reflex on performance, the question is to what extent (and in any case stability is better than speed). i agree that all the code that works with exclusive resources should be reviewed to make it as performant as possible (so, the screen refresh as you mentioned, the event queue code, the timers code perhaps…)


triggers off and CV slewing are actually a really interesting cases to consider within this context. a trigger off is an event that follows a trigger on event - which can be delayed based on how full the queue is, but then the trigger off would be expected to happen after a specific amount of time from the trigger on, not when the event loop gets around to executing it. so the trigger on is a regular event that then initiates a higher priority event of turning the trigger off. same for CV slewing - setting CV level is a regular event, but if slewing is enabled it should either be done with higher priority (so that the slew time doesn’t increase as your queue gets fuller and some of the slew timer interrupts get masked) or if done with regular timers it should check how much time actually elapsed and adjust accordingly. not sure increased complexity is worth in this case, but interesting to consider nonetheless. and then you get into other interesting stuff, such as - consider you have a 50ms long trigger pulse, you trigger it and 20ms after that you do TR 1 1. the first command turns the trigger on and creates an event to turn it off in 50ms. 20ms later the 2nd command also sets it to on (so nothing changes) but 30ms later the event from TR.PULSE turns it off. should this happen? probably not, since you indicated the desire to have it on with the later command…


i don’t hate IRQs (thready safety is a great topic to read while drinking morning coffee…), i just hate investigating thread related bugs :slight_smile: i’m trying hard to not have to read the ASF code. looks up how cpu__irq__save and barrier are implented. damn it! :slight_smile:

2 Likes

Yeah, maybe not when it’s just about to be bedtime though…!

There is definitely a lot interesting discussion about how to fairly schedule tasks. One nice property is that any changes to the scheduling algorithm are self contained. The actual tasks don’t change. It should make for tidy experiments.

One more thought before I go to bed… we’ve got a 60MHz CPU, I suspect it spends a large amount of time doing nothing. I wonder if we could use that idle time (i.e. no events in the queue) to do optimistic updates of CV slews and trigger times, or something else even.

2 Likes