(Modern) C Programming Tips and Tricks


Have you got the FTDI serial cable setup yet? It’s definitely worth getting that up and running before you need to it to debug something.

FYI debugging is just printf() style, no gdb or anything else (unless you can recreate your bug in the simulator or a test case).

Regarding the JI bug, it’s also worth writing up a test for it in the tests directory if you can figure it out. If you’re coming from the Python world, you’re probably more than aware than automated testing helps keep you sane.


I haven’t, that’s good to know. I’ll get one set up.


ok here’s one for the more grizzled embedded programmers in here…

any rules-of-thumb, tradeoffs, gotchas etc that go with gcc’s -Os optimisation flag?

We’re running out of space in BEES & I’ve just discovered -Os decreases .hex binary size from 614kB to 425kB! So what are we likely to lose here in terms of performance?


good question. i’ll add that we are starting from -O3.

so (just pasting here from gcc docs)

we would be losing these:

enabled by -O3


and keeping these:

enabled by -O2

-fdevirtualize -fdevirtualize-speculatively
-fgcse -fgcse-lm

except for these:

disabled by -Os



I think we talked about it earlier, maybe even in this same thread, but a lot of the optimizations (especially inlining) in -O3 can make code bigger. Sometimes it can even make code smaller. It’s worth noting the description for -Os is specifically “Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size”

Of course the answer is always profile :slight_smile:


I was worried that might be the case!


I will say that in my experience with ARM Cortex-M that -O3 usually performed worse than -Os or even -O1 for the code we were trying it with. I don’t know if that means anything for the AVR32 in the Aleph though :slight_smile:


well, O3 definitely helped for the aleph 4 years ago when we set it. :slight_smile: but always figured eventually we would want to strike a different balance for more code space

( of course profiling is good but it is also hard for something like bees (an object-oriented patching environment) running on something like avr32. my method in evaluating O2 vs O3 back when, was just to make a large scene with long chains of operators fired by different events (metro, UI and USB) and measure idle time in the main event loop. hard to get very granular about function calls without just setting GPIO flips in particular functions - that’s not super helpful for a bigger picture…)

question is, which flags in O3 matter. and my suspicion is that unfortunately -finline-functions is probably a big one both for speed and size.

i don’t think the alignment flags in O2 matter much because avr32 doesn’t allow unaligned accesses anyway.

gonna try some tests explicitly setting/unsetting some of the O3 and O2 flags that Os would disable…

and well yeah, we certainly talked about inlining and the inline keyword, flags, and attritbutes. but there are a lot of other flags in O2 and O3 that i frankly don’t undersstand at all, would be interested to hear any other specific ideas about them…

ok, starting a more granular investigation of the gcc opt flags on avr32. immediately discovered that in fact not all the flags are supported by avr32-gcc.

spinning this into a GH issue for further discussion:
[ https://github.com/monome/aleph/issues/292 ]


so i did a bunch of builds with O2 and additional individual flags. as expected, -finline-functions and -finline-small-functions emit a lot more code, but so do -fpeel-loops (complete unrolling) and -fipa-cp-clone (clone external functions with constant arguments.)

for this particular codebase (aleph bees) and avr32-gcc, the following set produces almost no size increase compared to -O2:

-fpredictive-commoning \
-ftree-loop-distribution \
-fexpensive-optimizations \
-funswitch-loops \

i’m gonna do a profiling session with a few of these flags. but expect that predictive commoning and gcse could both be pretty valuable for speed. ftree-loop-distribution is probably useless, and -funswitch-loops doesn’t seem to be doing much for this particular codebase cause we don’t like to put conditionals in loops in the first place.


Specifically for STM32, the F4 and F7 series have what’s called an “ART Accellerator” which does some prefetch from the flash and puts the instructions in a cache. So there’s a fair bit of an advantage in having your code be small enough such that it all fits in the cache. The cache size varies but it’s around 1 kbyte on the chips I’ve been working with. Some of the processors also have a special core coupled memory which is great for copying things like ISR code into so that it can be used instantaneously instead of having to be read from flash memory.


ahhh that makes a lot of sense. (sorry i deleted my edit but i didn’t want to muddy the waters too much by bringing in x86 and whatever)

so in other words: sometimes a smaller executable will run faster simply by virtue of being smaller (cache.) this isn’t quite the same as saying -Os produces faster code than -O3.

(avr32 uc3 doesn’t have code cache, btw.)


Here’s a really insightful article on optimization and benchmarking from Raymond Chen:


I just realised basically everything in puredata source code “is a” t_pd…

I’m definitely in the camp that prefers to “compose” objects using “has a”. In this case, consistency is king & the machinery of “is a” (cast to desired parent pointer type before calling method) is straightforward and hard to screw up. However…

The two approaches are mashed together a bit in the code for BEES - for example a grid operator “is a” op_t but also “has a” op_monome_t. Something about the way “has a” works in this instance feels a little janky or something… I dunno, just throws me every time I use it and requires much double-checking.

So my philosophical conundrum is:

“is there a clean, intuitive way to express ‘has a’ in pure C?”

(EDIT: thinking harder about this, “has a” is trivial in C and we use it all the time. The reason the monome_op abstraction of BEES was throwing me is it’s more like a mixin, which is where things get a bit gnarly)


I usually keep a state-tracking variable “is_" or "has_” alongside the target datatype and manage the state as I initialize/allocate/close/free


Hi! Newish to lines, have been writing C for a while.

  • seconding what others said about learning to use a debugger. It’s honestly world changing and I lean to C sometimes when I shouldn’t just because the debuggers are so. damned. useful.
  • if your platform has them, learn to use manpages or other searchable local documentation. I like using emacs for reading manpages (with the regrettably named M-x woman) and infopages.
  • also I like using emacs’ gdb integration.
  • and setting up some one-click building and testing in your editor has a lot of value.
  • if your platform has it, ASAN and UBSAN are cool. -Wall and -Werror are useful. valgrind can also be useful.
  • Check error codes from functions. The goto error pattern helps.
  • C99 struct initializors are cool: f(&(struct whatever){ .x = something });. Lots of examples here:
  • Initialize things to 0, -1, or {0} when it makes sense. Unitialized var bugs are the worst to debug.
  • Learn a little about the type system and the different guarantees different types have. Using size_t and ssize_t when it’s appropriate helps with cross-platform dev but also helps with documentation.
  • I like klib (https://github.com/attractivechaos/klib) for datastructures.

I don’t think K&R as good as the reputation makes people think. I liked “Learn C the Hard Way” more than anything else.


yeah i was thinking of it more like a mixin or double inheritance. it’s not a great pattern and would like to see some better example


I have been reading 21st Century C, and it is enlightening. Despite being aware of the additions to C99 and C11 and using some of them in various contexts, seeing the ways that they can be used together is kind of mind-bending.

The combination of variadic macros and compound literals is really interesting.

#include <math.h> //NAN
#include <stdio.h>

#define make_a_list(...) (double[]){__VA_ARGS__, NAN}
#define matrix_cross(list1, list2) matrix_cross_base(make_a_list list1, \
    make_a_list list2)

void matrix_cross_base(double *list1, double *list2){
    int count1 = 0, count2 = 0;
    while (!isnan(list1[count1])) count1++;
    while (!isnan(list2[count2])) count2++;
    for (int i=0; i<count1; i++){
        for (int j=0; j<count2; j++)
                         printf("%g\t", list1[i]*list2[j]);

int main(){
    matrix_cross((1, 2, 4, 8), (5, 11.11, 15));
    matrix_cross((17, 19, 23), (1, 2, 3, 5, 7, 11, 13));
    matrix_cross((1, 2, 3, 5, 7, 11, 13), (1));   //a column vector

I used to do a lot with the preprocessor, but have been less crazy with it recently. I may reconsider.


Now sure how I didn’t know this before, but gcc has an -fstack-usage flag. If you enable it, it will output .su files along your .o files which contain a report of the stack usage of each function in that compilation unit. Really useful for embedded programming especially.


So… turns out that all our CPUs are borked.

Either they’re really really borked if made by Intel (but patchable). Or just really really borked if made in the last 20 years.

The Meltdown vulnerability is surprisingly easy to understand how to exploit, and seems to be fixable by unmapping most of the kernel in user mode for a modest performance penalty. But so far Spectre is making my head hurt a bit, and appears to have much further reaching consequences, particularly for running code in sandboxes (e.g. Javascript in a web browser).

One of the LLVM patches to add a flag to defend against part of the Spectre attack has the potential for a serious slowdown:

When manually apply similar transformations to -mretpoline to the
Linux kernel we observed very small performance hits to applications
running typical workloads, and relatively minor hits (approximately 2%)
even for extremely syscall-heavy applications. This is largely due to
the small number of indirect branches that occur in performance
sensitive paths of the kernel.

When using these patches on statically linked applications, especially
C++ applications, you should expect to see a much more dramatic
performance hit. For microbenchmarks that are switch, indirect-, or
virtual-call heavy we have seen overheads ranging from 10% to 50%.



Sometimes I make the joke that in the future all human beings will be security engineers. And after that we’ll start recruiting dolphins.


Yeah the Meltdown & Spectre thing is pretty interesting. Not sure what it has to do with C programming Tips and Tricks though, so I’m hesitant to discuss more in this thread :slight_smile: