Session 8 Summary

C++ Programming Course, Summer Term 2018

Code discussed in this session

Enforcing Compile-Time Evaluation: constexpr

Ranges and Views

Making actual use of constexpr, perfect forwarding and expression templates:
ranges and view:

Talks on the proposed Ranges Technical Specification:

Iterator vs. Sentinel

Not without audible weeping and gnashing of teeth we observed that end iterators are no real iterators. Most otherwise valid expressions on an iterator type are not allowed if an instance happens to represent an end iterator, such as:

  • Dereference: *it
  • Member dereference: it->member
  • Incrementing: ++it, it++
  • Offset: it += n, it -= n
  • Random access: it[n]

Valid expressions for end iterators:

  • Random access:
    • Distance (RAI): it_a - it_b
    • Order: it_a < it_b, …
  • Equality: it_a == it_b, it_a != it_b

Generally speakig, this leaves us with equality / inequality comparison for end iterators. Arguably, inequality is the only relevant expression we need for iteration:

for (auto & val : container) {
  // ...
}

// which is syntactic sugar for:
for (auto it = std::begin(container);
     it     != std::end(container);
     ++it) {
  // ...
}

// same as:
for (auto it = container.begin(); it != container.cend(); ++it) {
  // ...
}

Note that in the last variant of the loop, the type of
it
   -> decltype(container)::iterator
is different from the type of
container.cend()
   -> decltype(container)::const_iterator.

Iterator vs. Container

STL container concepts and algorithms depend on iterators.

In the lazy_sequence and sparse_array assignments, we learned that iterators are conceptually independent from containers, however:

  • Iterators may have different reference types for const and non-const contexts
    (see sparse_array)
  • An iterator does not require a container of elements
    (see lazy_sequence)
  • Values referenced by an iterator do not necessarily have to exist until it is dereferenced
    (see all index-based iterators we came across)

We also discussed why it is important to rely on the least complex iterator category for STL-style algorithms, for example using ++it in favor of it += 1.

Input- and forward iterators would suffice to iterate a sequence, but they do not provide distance-based expressions.

This is inconvenient for ranges!

Let’s revisit the lazy_sequence in particular:

// sequence (here: closed range) of squares:
auto ls = lazy_sequence<int>(
              20, // size
              [](int i) { return i*i; });

The only reason why we required to specify the size of the sequence and only allow indices starting from 0 was to mimic an STL sequence container.
But apart from that, there is no need to restrict the beginning and length of the sequence. We could also allow:

// half-open range [0...] of squares:
auto ls = lazy_sequence<int>(
            [](int i) { return i*i; });

But if no size is specified, there is no lazy_sequence::end(), so how could we iterate over the sequence?

Perhaps lazy_sequence was a bad idea to begin with?

So let’s stick to the plain STL: how would you specify an iteration on an std::list over exactly n elements?

I assume you have acquired a healthy aversion to StackOverflow answers at this point.
The wide spectrum of recommended solutions to this very problem in this StackOverflow discussion - masterfully avoiding any correct or useful insight - should eliminate any residual doubt.

You would probably use a loop counter.
For classic loops, this is ugly but would get the job done:

for (int  ct = 20,
     auto it = list.begin();
     it != list.end() &&
     ct > 0;
     ++it, --ct) {
  //
  //   This is just as useful as it looks
  //
}

But we don’t like old-fashioned for loops, do we.
We prefer:

  • range-based for in case of in-order sequential iteration, or
  • algorithms like std::for_each, std::transform, … if we can be specific about operation semantics

Range-based for loops do not allow decrementing a counter in the loop header, and STL algorithms cannot be “canceled” at all. This is why there is a std::<algorithm>_n variant at least for the most basic algorithms.

In conclusion, it would be highly desirable to:

  • … specify iteration for half-open ranges
  • … decouple sequence length from iterator distance
  • … use a non-iterator concept for end

Sentinels as a concept replacing end iterators achieve all of this, and much more. They only provide equality comparison operators which effectively serve as a break condition.

class seq_length_sentinel {
  public:
    seq_length_sentinel (int n) : _num_compare(n) { }

    template <class Iter>
    bool operator!=(const Iter &) const noexcept {
      return --_num_compare > 0;
    }

    template <class Iter>
    bool operator==(const Iter & rhs) const noexcept {
      return not (this->operator!=(x));
    }

  private:
    mutable int _num_compare = 0;
};

To specify a sub-range of a lazy_sequence, we need a thin wrapper type which implements the Range concept.

A range just provides the expressions:

  • std::begin(range) -> Iterator
  • std::end(range) -> Sentinel

Applied to lazy_sequence, we finally can put the pieces together and fixed range iteration for good:

auto ls         = lazy_sequence(generator_fun);

int  sub_offset = 10;
int  sub_length =  7;
auto sub_range  = make_range(ls, sub_offset, sub_length);

auto sub_begin  = std::begin(sub_range);
                  // returns std::advance(std::begin(ls), 10)
auto sub_end    = std::end(sub_range);
                  // returns seq_length_sentinel(7)

Range adaptors are explained in this article in more detail.

Multithreading and Parallelism

Creating threads is easy:

#include <iostream>
#include <thread>
using namespace std;

void func(int x) {
    cout << "Inside thread " << x << endl;
}

int main() {
    thread th(&func, 100);
    th.join();
    cout << "Outside thread" << endl;
    return 0;
}

Tasks

An even higher level of abstraction avoids the concept of threads altogether and talks in terms of tasks instead. Consider the following example:

#include <iostream>
#include <future>
#include <chrono>
using namespace std;

int square(int x) {
    return x * x;
}

int main() {
    auto a = async(&square, 10);
    int v = a.get();

    cout << "The thread returned " << v << endl;
    return 0;
}

The async construct uses an object pair called a promise and a future. The former has made a promise to eventually provide a value. The future is linked to the promise and can at any time try to retrieve the value by get(). If the promise hasn’t been fulfilled yet, it will simply wait until the value is ready. The async hides most of this for us, except that it returns in this case a future object. Again, since the compiler knows what this call to async returns, we can use auto to declare the future.

Exercise:

Use async to solve the sum of squares problem. Iterate up to 20 and add your future objects to a vector>. Then, finally iterate all your futures and retrieve the value and add it to your accumulator. This should be only a few modifications from the code above.

this_thread

Let’s make sure this really runs in parallel. Using the code from the last exercise, now add a cout that prints x inside square. Run your program again. Every time you run it, it should be listing the values of x in order. This seems awfully deterministic and is not characteristic of running things in parallel. We did start them in that order, so maybe the threads aren’t overtaking each other. We can check this by adding a sleep inside square, which we can pretend is the heavy computation of x * x:

this_thread::sleep_for(chrono::milliseconds(100));

Note that all these seemingly global objects are in the std namespace, but since we issued using namespace std, we made them visible globally. Okay, run this and time the execution. They are clearly taking turns and they are not running in parallel. Use cout to print this_thread::get_id().

Since the main execution is also considered a thread, try printing the thread ID inside main using this same function. What does this tell you?

The function async by default gives the program the option of running it asynchronously or deferred. The latter means square will be called first when we call get(), and it will be executed in the same thread. Ideally, the program should make an intelligent decision, optimized for performance, but for some reason GCC always defers, so let’s not give it a choice about it.

Change the call to async as follows:

async(launch::async, &square, ...)

Run it again, timing the execution.

Note

We got a 20 time speed up only because our threads aren’t actually heavy on the CPU while sleeping, so it’s not an ideal surrogate for imagining that x * x actually takes 100 ms of CPU time. In the real case, the speed up would be largely determined by how many cores we have on our computer.

Ideally, you should avoid starting more computationally intensive threads than your computer can truly run in parallel, since otherwise your CPU cores will start switching its attention between different threads. Each switch is called a context switch and comes with an overhead that will hurt performance.

Shared Access

Let us imagine that x * x is a very costly operation and we want to calculate the sum of squares up to a certain number. It would make sense to parallelize the calulation of each square across threads.

We can do something like this:

#include <iostream>
#include <vector>
#include <thread>
using namespace std;

int accum = 0;

void square(int x) {
    accum += x * x;
}

int main() {
    vector<thread> ths;
    for (int i = 1; i <= 20; i++) {
        ths.push_back(thread(&square, i));
    }

    for (auto& th : ths) {
        th.join();
    }
    cout << "accum = " << accum << endl;
    return 0;
}

This should sum all squares up to and including 20. We iterate up to 20, and launch a new thread in each iteration that we give the assignment to. After this, we call join() on all our threads, which is a blocking operation that waits for the thread to finish, before continuing the execution.

This is important to do before we print accum, since otherwise our threads might not be done yet. You should always join your threads before leaving main, if you haven’t already.

Before moving on, also note that C++11 offers more terse iteration syntax of the vector class, very close in syntax to Java. We are also using the keyword auto instead of specifying the data type thread, which we can do whenever the compiler can unambiguously guess what the correct type should be. We added an & to retrieve a reference and not a copy of the object, since join changes the nature of the object.

Now, run this. Chances are it spits out 2870, which is the correct answer.

Let’s list all distinct outputs from 1000 separate runs, including the count for each output:

\[ \text{for} \, (i \in \{1 \dots 1000\} ; \quad \text{do} \quad \text{./a.out}; \quad \text{done} \, | \, sort \, | \, uniq -c \]

We should see several runs with incorrect results.

When the compiler processes accum += x * x, this expression is non-atomic and consists of machine-level instructions equivalent to this:

int temp = x * x;
temp    += accum;
accum    = temp;

With two threads interleaving, it could look like this:

// Thread 1             // Thread 2
int temp1 = accum;      int temp2 = accum;          // temp1 = temp2 = 0
                        temp2 += 2 * 2;             // temp2 = 4
temp1 += 1 * 1;                                     // temp1 = 1
                        accum = temp1;              // accum = 1
accum = temp2;                                      // accum = 4

We end up with accum as 4, instead of the correct 5.

Exercise:

Before we fix the race condition, since keeping accum as a global variable is poor style, we would rather pass it into the thread. Add a parameter int& accum to square. It is important that it’s a reference, since we want to be able to change the accumulator. However, we can’t simply call thread(&square, accum, i), since it will make a copy of accum and then call square with that copy. To fix this, we wrap accum in ref(), making it thread(&square, ref(accum), i).

Mutex

A mutex (mutual exlusion) allows us to encapsulate blocks of code that should only be executed in one thread at a time. Keeping the main function the same:

int accum = 0;
std::mutex accum_mutex;

void square(int x) {
    int temp = x * x;
    accum_mutex.lock();
    accum += temp;
    accum_mutex.unlock();
}

The problem should now be fixed. The first thread that calls lock() gets the lock. During this time, all other threads that call lock(), will simply halt, waiting at that line for the mutex to be unlocked. It is important to introduce the variable temp, since we want the x * x calculations to be outside the lock-unlock block, otherwise we would be hogging the lock while we’re running our calculations.

As a final polish, you should use std::lock_guard, a mutex wrapper that provides a convenient RAII-style mechanism for owning a mutex for the duration of a scoped block.

In the following variant, you also have unambiguous control over the locked section and obvious prevention of reordering:

int accum = 0;
std::mutex accum_mutex;

void square(int x) {
    int temp = x * x;
    {
      std::lock_guard<std::mutex> lock(accum_mutex);
      accum += temp;
    }
}

Atomic

Recommended Talk: C++ atomics, from basic to advanced. What do they really do?

C++11 offers even nicer abstractions to solve this problem. For instance, the atomic container:

#include <atomic>

std::atomic<int> accum(0);

void square(int x) {
    accum += x * x;
}

We don’t need to introduce temp here, since x * x will be evaluated before handed off to accum, so it will be outside the atomic event.

Condition Variables

It is a common scenario to have one thread wait for another thread to finish processing, essentially sending a signal between the threads.

This can be done with mutexes, but it would be awkward. It can also be done using a global boolean “ready flag” that is set by one thread and polled by the other.

Since setting notified to true is atomic, this would not need a std::atomic or a mutex. However, reacting on a flag with low latency requires a busy loop, pumping CPU load to 100% in the thread. We could add a sleep inside the polling loop to reduce CPU load, but this effectively sets latency to the sleep period length.

A more principled way however is to add a call to wait for a condition variable inside the for loop.

Condition variables can be spuriously awaken, so we still need a ready flag.

#include <iostream>
#include <thread>
#include <condition_variable>
#include <mutex>
#include <chrono>
#include <queue>

std::condition_variable cond_var;
std::mutex m;

int main() {
    int value  = 100;
    bool ready = false;

    std::thread consumer([&]() {
        std::unique_lock<std::mutex> lock(m);
        while (!ready) {
            cond_var.wait(lock);
        }
        cout << "The value is " << value << endl;
    });

    std::thread producer([&]() {
        value = 20;
        ready = true;
        cond_var.notify_one();
    });

    reporter.join();
    assigner.join();
    return 0;
}

Shared Ownership