Rewriting the renderer for Cathedral

So. Slack is down, which means I can’t actually talk to my team mates right now. It’s 01:18 AM, and I’m sipping on a nice Nikka whisky after just solving a really stupid and annoying bug. I figured I might as well write my first blog post for this blog that I setup ages ago and never used. I’ve actually started writing a bunch of posts, but I can never seem to keep them short enough to have time to finish them up.

Just to be clear – I seldom have a goal when I start writing here. It’s usually just to clear my mind and get new ideas while working on Cathedral . Don’t expect things to always be coherent or worthwhile to read.

Anyway: In short –  Cathedral is built around an OpenGL-based renderer. Initially, we just made a 1:1-renderer in terms of drawcalls to sprites, and over the last few days, I decided to rewrite it.

Whenever I can rip out something vital of a game engine (or any other software project for that matter), rewrite it and plug the new thing in and just have it work, I always get a happy feeling. It’s like a really good indicator that I made the right call when it comes to the architecture. This time, it ALMOST worked that way. It worked fine on OSX, Debian and Ubuntu, but Windows just produced the dreaded black screen, meaning that something was wrong with my OpenGL state.

So, some background info: The initial renderer was never supposed to be used in production, but on the other hand, it actually scaled quite well, seeing how we’re rendering a bunch of quads and nothing else.

The engine that Cathedral is built on (we call it “Ganymede”), uses a quad tree to figure out what to render on-screen. We build our maps up using layers, where we have a designated collision layer. Anything below the collision layer is behind the camera, and anything in front of it is in front of the camera:


This has been working out surprisingly good, but we’ve designed a few levels with very dense screens. We have several layers of 16×16 tiles, at a resolution of 400×240, and some of these are parallax layers. First of all, for each non-parallax layer, that’s 375 tiles. Secondly, the way Ganymede renders parallax layers is by simply not tracking them in our quad tree. It’s usually a small enough portion of our maps so that keeping track of their spatial locations in a tree is more expensive than just brute forcing it (seeing how they change location as you move left/right)

So. In some areas, we might actually render a couple of thousand sprites getting rendered each frame. For the old renderer, this simply meant doing several thousand draw calls, rebinding vertex buffer objects on each frame, and a noticeable drop in performance, especially on low-end hardware such as cheaper laptops. After checking with apitrace, we noticed that we were doing a whopping 33k OpenGL calls per frame in our heavier areas. Of course, only part of these were draw calls, but still.

… So, what to do?

I decided to write a batch renderer instead. One of the things that put us in a good position for this was that we pack all of our textures into a texture atlas. It did produce a new set of issues:

  1. We don’t always use the same shader. Luckily, our “extra” shaders that we use for the default one is simply to apply things such as vertex colors and clipping on UI elements.
  2. We do  have some generated textures, and we needed to still have these outside of the texture atlas.
  3. We used to create all sprites at origin, and then transform them to the correct location

The first point could be partially solved by simply killing off our GUI shaders and add an extra attribute to our default shader. The only issue left was clipping (I’ll get back to that). The second issue could be solved by simply adding more texture samplers to the default shader.

Now, we still needed to solve the transformation part, as well as the clipping. And just to show you what I mean with having to clip things, here’s a shot of what our map should look like when the inventory screen is opened:cathedral-screenshot-2017-10-11-9_33_44

We originally did this completely on the GPU, using GLSL shaders (by simply discarding pixels outside of the defined boundaries). This went away as we built the batch renderer, since we needed as much as possible to NOT have to switch any vital state, such as shaders, textures (or at least not too frequently).

Imagine this UI without clipping. Or well. You don’t have to. Here’s a screenshot with clipping disabled:


Kind of a mess. So to reiterate: We needed to solve clipping and we needed to solve transformations, and we needed to do it in as few draw calls as possible. And here’s the thing: Our data is pretty uniform. Everything (and I mean absolutely everything) in Cathedral is a quad. Each one needs 4 vertices and 6 indices.

So, we simply moved these operations over to the CPU. In essence, our batch renderer does this (and there’s a lot of details left out here. I’ve tried to keep it to the essentials only):

void begin() {
    // Keep track of how many things we have in our buffer
    idx = 0;

    // The vbo object is an array of 3 buffers that we switch
    // between to avoid locking up the renderer
    glBindBuffer(GL_ARRAY_BUFFER, vbo[bfrIndex]);

    // glEnableVertexAttribArray, glVertexAttribPointer etc:

    // Map a buffer that we can write to this frame
    buffer = static_cast<Vertex*>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));

void batch(Surface* s, Time dt, const glm::mat4& view) {
    // For each of our 4 verts per surface, do:
    auto pos =
        transformVertex(view, s, glm::vec4(-1.0f, 1.0f, 0.0f, 1.0f));

    buffer->position = glm::vec4(pos.x, pos.y, pos.z, 1);
    buffer->color = color;
    buffer->texture = glm::vec2(uo, vo);
    buffer->sampler = slot;

    // ... We also have some extra flushing in here if we fill
    // up our buffer or run out of texture slots etc.

    // At the end, add the number of indices per sprite (6)

void end() {
    // Unmap buffer and calculate the next buffer index
    bfrIndex = (bfrIndex + 1) % BUFFER_COUNT;

… So, we do clipping on the CPU, we do transformations on the CPU, and batch everything into a single draw call. There’s not a whole lot of branching or anything like that going on in this code either, so for a game like Cathedral, this is pretty well-behaved. It’s measurably much much faster than the old approach (which again, wasn’t really meant to last forever anyway)

Clipping by the way, is basically just interpolation of each vertex, and since we never even rotate sprites in Cathedral (messes up the pixel art look!), we don’t even have to care about that. We can simply just clip along the x and the y axes, based on a clipping rectangle.

Oh, and by the way. After doing all this work, it worked fine on Linux and OSX (compiled on GCC and Clang, respectively). Windows however, was just a black screen (compiled with MSVC++ 2015). Apitrace saved the day and let me inspect my buffers (which were all thrashed by the way).

The issue was our vertex declaration, and A+ to you if you spot the issue:

namespace Ganymede {
   // I did something really stupid in this struct!
   struct Vertex {
        glm::vec3 position;
        glm::vec4 color;
        glm::vec2 texture;
        float     sampler;


Anyway. Bug solved, time to sleep.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s