Zooid Engine: First Time Profiling and Optimizing

I’ve been working on Zooid Engine (My custom game engine) since 2017. I’ve only worked on this during holidays or Sundays. Mostly, I work on stuff without really optimizing the code. About a couple of months ago, I found a real performance issue on the engine. To improve the engine, I need to focus on a couple of weekends to solve the issue. This article will be a journal of how did I do it.

The Sponza map. The engine will render all static game objects (423 game objects), the text, and the small editor UI in 436 FPS or 22.92 ms per frame.

Before jumping to the topic, I’d like to explain a bit about how the engine handles update and rendering.

The Engine Overview

The engine works very straightforward. Two threads are running. The first one is the Update or Main thread. This thread is in charge of updating all the objects in the game world and gather all information for rendering. All the rendering info will be pack in a big data list called Draw List. The second thread is the render thread. This thread will process the Draw List, run the rendering logic, and tell GPU to draw/swap the GPU buffer.

In general, these two threads will work side-by-side. The render thread will mostly run when the main thread is doing an update to the object. When the main thread is finished with the update, the main thread needs to wait for the render thread to finish processing the Draw List before gathering the rendering information. During this time, the render thread will be idle and waiting for the main thread to finish gathering rendering information. When the main thread’s done, then the rendering thread does its job again. This is the initial implementation of multi-threading in Zooid Engine.

The initial design of multi-threading in Zooid Engine.

Multiple Draw Lists

Now, the initial design will have one major problem. It’s easy to see. What does happen if the render thread takes a longer time to process the draw list than the update section in the main thread. What will happen is the main thread will be idle in between the update and gather render process. It will cause the fame update to take longer than it’s supposed to.

Draw Job takes longer than the main thread update

The solution I think the best for such a problem is by having multiple draw lists, so the main and render thread will be switching over between the two. This will make the gather render process no need to wait for the render thread to finish its job. The gather render will gather all information to the other draw list that the render thread is not using. The real implementation is each thread will be flip-flopping between 2 draw lists.

Draw Job will not make the main thread idle/wait because they are using different draw list during the same frame.

Was this solving the issue? In a sense, yeah. The improvement is tiny, though. It’s about 1-3 ms. There was something else making the gather render performance dropped.

Render Command List

At this point, I need to do proper profiling for the engine. So, I decided to use an external profiler. Creating an in-game profiler was an option, but it will take some time, and I need to reduce the overhead of an in-game profiler. For external profiler, I chose Optick because it’s straightforward to embed to the game – compile the source with the project.

The multi draw list works like a charm based on the profiling result.

Notice that the Event_GATHER_RENDER will not wait for the Render thread to stop working.

I couldn’t keep my eyes on Event_GATHER_RENDER. As the name stated, this event is gathering all the render information for the render thread. Why was this process taking too much time? So I break the timing down on Event_GATHER_RENDER. I found something interesting.

It turned out this was updating the OpenGL buffer. The engine works with multiple threads, but the OpenGL context can only belong to one thread. So, when the main thread needs to update the GL buffer, it needs to grab the OpenGL context from the render thread and make it wait until the render thread is done processing.

GLRenderer::AcquireRenderThreadOwnership was taking too long. This might cause the main thread to spin and wait for the GL context to be freed by the render thread.

One naive solution, I think, was by having additional context for the main thread. That could be a solution. But I don’t think this was a good idea. That might be a data race where we’re updating the GPU resource in the main thread while the render thread is doing some process on the same buffer. And I’m not sure this will also work for other Graphics APIs.

I wondered if we can call this resource update in the render thread before doing the drawing job. But that’s what I need. I need to defer this call to the next render thread process. Then, I came up with the render command list – this is not a GPU command list. This is basically a function pointer list called in the render thread before the draw started. With this, I don’t need to worry about data race on the GPU resource, and I believe this is also safe for other Graphics APIs.

“Commands” is a section where all commands in the command list are executed.

I’m not sure how this context working with other Graphics APIs. (If you know if the case also the same for DirectX or Vulkan, please feel free to said in the comment. That would be helpful for me when I’m implementing other Graphics API to the engine).

Okay, now it should solve the issue? BIG TIME!. The engine was running with ~13 ms per frame, compared to before ~22ms.

~12ms per frame after dealing with the threading issues

Zooid UI List and Array

Looking at previous profiling, I think we can improve more on the engine. The second one I wanted to improve was ApplicationTick. What is ApplicationTick? The design is by having the game or application using the engine as a library. So far, I’ve used one of my applications in the project, Scene Editor. This draws the UI list of objects in the world. I use ZooidUI – Immediate Mode UI I create and manage.

Application-Engine Layers

I looked into the function breakdown of the Application Tick and found that dealing with text drawing took about 75% of the process from ~6ms of the application tick. At the time, I thought there are two possible issues here: generating text was super expensive, or the application draws all texts in the list view whether it’s being hidden or no. My suspicion went to the latter. Then, I run another map with fewer game objects.

Functions profile during the Application Tick. DrawTextInRect took about 75% of the total time.

And right, with fewer objects, the application tick only took less than half a millisecond. It turned out that I draw each item on the list regardless of whether it’s outside of the list. I fixed this quickly by checking the text box/rectangle with the list view rectangle. If there is no intersection, the UI doesn’t need to process the text.

After the changes, the Application Tick took ~1.5ms. That made the total frame time to ~9ms. Next, let’s see what I optimize more on this.

In the function breakdown below, I saw that the assignment operator took half of the application time. ZooidUI copies the array of the component names in the world from the scene into the internal array. The array will reset the memory every time the assignment operator is called regardless of the current capacity is enough. This caused the array to allocate a new memory chunk, initialize with the default values, and assign the new values from the array on the right value/side. Since the map has 423 objects, all of these works will take up time.

The texts outside the list rectangle are clipped. The DrawTextInRect function is taking less than 10% of the time. Now Array::operator= took half of the total time.

For this array problem, I wanted to remove the first two of the works of copying an array. In this case, it’s an easy change to make. I need to check if the other array’s capacity is less than or equal to the current array’s capacity. If so, I don’t need to allocate a new memory block and initialize it with default values. I slab the memory block from the other array to the current array. In this case, it works like a charm.

With the solution above, the application tick only took ~0.5ms. This makes the total frame update time about ~8ms.

With the changes of Array::operator=. Application Tick took about 0.5ms
With UI and Array optimization.

Cache Object Bounding Information

Checking the current state of the frame time and breakdown so far. I could see that Event_GATHER_SHADOW_LIST taking about 25% of the total time.

Event_GATHER_SHADOW_LIST took more than 25% of the total time.

Event_GATHER_SHADOW_LIST is an event to gather all shadows’ information corresponding to the light. The information comes from the directional light, which I implemented the Cascaded Shadow Map. This contains all objects in the scene most of the time.

The current process of gathering information for the Cascaded Shadow Map is by checking each object’s bounding sphere to check in which level of the shadow area it belongs to and save the information for the shadow map generations in the Render thread.

Looking at the event’s breakdown, the checking section takes most of the total time – calculating the bounding sphere and determining which cascade shadow level. Calculating the bounding sphere (getBoundingSphere() function in the image below) took about 30% of the event time. It’s about 3ms of the total frame time.

getBoundingSphere() took about 30% of the total time in the Event_GATHER_SHADOW_LIST.

The same issue also happened in Event_GATHER_RENDER. The gatherBoundingSphere() took about 20% of the event time and about 2ms of the total frame time.

getBoundingSphere() also took about 20% of the total time in the Event_GATHER_RENDER

I found that getBoundingSphere() was doing a bit expensive calculation, extracting the scale from the object’s matrix. I tried to optimize the calculation by rearranging the calculation and using SIMD for the calculation. It didn’t reduce the frame time that much, about ~1ms.

The scene contains hundreds of static objects. Since the static object doesn’t move and the matrix is always the same, caching the bounding sphere makes a lot of sense.

Another bounding information I cache is the Axis Aligned Bounding Box. This affected the Event_GATHER_BOUND, which calculates the entire scene’s bound for calculating sections for the Cascaded Shadow Map.

Saving all the bounding information will remove the need to recalculate the bounding information (Bounding sphere and AABB) in every frame and only recalculate the bounding information whenever the transformation changed. However, this will take longer in the first frame since the initial bounding information needs to be initialized.

Caching all the information mentioned above, the average frame time now is about ~4ms.

How does the profiler look like now? It’s not bad, I believe. All the events` time is reduced by about 1.4ms. The Event_GATHER_SHADOW_LIST is still taking a big portion of the time. This is expected since the event iterating all the objects in the scene. This needs to be optimized in the future, e.g., having pre-calculated shadow maps for static objects and only gathering a shadow list for dynamic objects.

Event_GATHER_RENDER is reduced to ~0.6ms, Event_GATHER SHADOW_LIST is reduced to ~1.7ms, and Event_GATHER_BOUND to ~0.3ms

Notes and Conclusion

The frame time was used in this post all relatives to my system specification to run this test, which is Intel Core i9-9900k for CPU and RTX 2080 Ti 11GB. The time may be varied for a different system specification, but the optimization above should have the same improvement.

There are some notes that I learn from doing the first profiling and optimization on the Zooid Engine.

  • Most of the profiling here attacks the system design of the engine. This made me tweaking the system to work better on the performance.
  • Proper profiling helps to understand better the difference between how the engine design and the actual implementation.
  • Proper profiling (and profiling tools) makes it easier to find all the things that impact the overall performance/frame time.
  • Finding and removing unnecessary calculations should be done first before going deeper into optimizing the code/calculations. Caching some of the information might be a good solution if the same result is produced every frame.

Hopefully, this post will give some insight when designing and implementing a game engine from scratch.

Leave a Reply

Your email address will not be published. Required fields are marked *