1. Run one thread locked to each core.
(NOTE : this is only appropriate on something like a game console where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be locking to cores, and you have the thread affinity scheduler problems, and so on).
The one-thread-per-core set of threads is your thread pool. All code runs as "tasks" (or jobs or whatever) on the thread pool.
The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and there are probably some OS threads that are running which will grab some of your time, and you want that; but you are not using the OS threading in your code).
2. All functions are coroutines. A function with no yields in it is just a very simple coroutine. There's no special syntax to be a coroutine or call a coroutine.
All functions can take futures or return futures. (a future is just a value that's not yet ready). Whether you want this to be totally implicit or not is up to your taste about how much of the operations behind the scenes are visible in the code.
For example if you have a function like :
int func(int x);
and you call it with a future
it is promoted automatically to :
<int> func( future
<int> x )
return func( x.value );
When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.
To branch I would use a new keyword, like "start" :
<int> some_async_func(int x);
int current_func(int y)
// execution will step directly into this function;
// when it yields, current_func will yield
<int> f1 = some_async_func(y);
// with "start" a new coroutine is made and enqueued to the thread pool
// my coroutine immediately continues to the f1.wait
<int> f2 = start some_async_func(y);
"start" should really be an abbreviation for a two-phase launch, which allows a lot more flexibility.
That is, "start" should be a shorthand for something like :
coro * c = new coro( some_async_func(y); );
because that allows batch-starting, and things like setting dependencies after creating the coro, which
I have found to be very useful in practice. eg :
coro * c;
for(i in 32)
c[i] = new coro( );
if ( i > 0 )
c[i-1]->depends( c[i] );
start_all( c, 32 );
Batch starting is one of those things that people often leave out. Starting tasks one by one is just like waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and going back to sleep over and over, or switching back and forth).
3. Full stack-saving is crucial.
For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.
You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.
Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.
4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.
Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.
Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the
return from an async procedural call. So you can do things like :
which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".
<int> y = some_async_func();
yield( y , my_mutex.when_lock() );
Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).
You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them instead of doing thread-waits.
5. You must have some kind of GC.
Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is the only reasonable way to do this.
I would also go ahead and put an RW-lock in every object as well since that will be necessary.
6. Dependencies and side effects should be expressed through args and return values.
You really need to get away from funcs like
that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they
can be used to create dependency chains.
When that's not directly possible, you must use a convention to express it. eg. for file manipulation I recommend using a string containing the file name to express the side effects that go through the file system (eg. for Rename, Delete, Copy, etc.).
7. Note that coroutines do not fundamentally alter the difficulties of threading.
You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but they are no panacea and do not try to be anything other than a nicer way of writing threading. (eg. they are not transactional memory or any other auto-magic).
to be continued (perhaps) ....
Add 3/15/13 : 8. No static size anything. No resources you can run out of. This is another "best practice" that goes with modern thread design that I forgot to list.
Don't use fixed-size queues for thread communication; they seem like an optimization or simplification at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a big problem.
The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program. It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during the graph evaluation, it's a big mess figuring out how to deal with that.
Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal with.