Best Practices for Wrapping C++ Libraries with CGO
Background
Recently, our business needed to reuse a client SDK library written in C++, and to allow our team’s primary language Go to seamlessly integrate with the SDK, we used CGO bridging technology to wrap the C++11 SDK library into a production-ready Go SDK. After reviewing most of the Chinese and English materials about cgo online, I found that there are still many details to pay attention to, especially when implementing Go calling C++ libraries. Most Chinese materials only briefly cover simple C++ STL library function wrapping as examples. This article summarizes a series of best practice tips for wrapping complex C++ libraries based on previous work, hoping to fill the gap in related materials.
Use CGO Only When Necessary
List this as the first rule as a warning, because many cgo beginners may fall into the misconception regarding execution efficiency: using cgo calls instead of regular Go functions will significantly improve single-core CPU performance. This actually needs case-by-case analysis. According to the source code of cgocall in the standard library, each cgo call essentially involves switching function call stacks (parameters, PC and SP need to be set to external C programs). These operations are additional overhead compared to native Go calls. For simple calculation function calls, cgo may have hundreds of times more execution time than native Go calls per invocation.
// Call from Go to C.
func cgocall(fn, arg unsafe.Pointer) int32 {
// ...
// Reset traceback.
mp.cgoCallers[0] = 0
// Enter system call, Go will start a new system thread M to execute subsequent operations
entersyscall()
osPreemptExtEnter(mp)
mp.incgo = true
errno := asmcgocall(fn, arg)
// ...
osPreemptExtExit(mp)
exitsyscall()
// ...
// There may be C to Go callbacks, related objects need to be kept alive by GC before determining whether they can be released
KeepAlive(fn)
KeepAlive(arg)
KeepAlive(mp)
return errno
}So how do you determine when to use cgo? My personal experience is to look at the cost-benefit ratio. If using cgo can greatly reuse existing C code, reduce subsequent development and maintenance costs, or if it’s a CPU-intensive function where using C libraries can well compensate for the side effects brought by cgo, then cgo is also a good choice. It should be noted that if some third-party libraries have already achieved 80-90% of C library performance using pure Go and Plan9 assembly, it’s not recommended to use cgo bridging libraries, especially when library users need to make heavy concurrent calls to library functions within Go goroutines. The computational scalability brought by user-space system thread reuse is far superior to the kernel context switching of slightly faster C-layer execution in cgo calls.
C ABI Interface Encapsulation Follows Minimalism Principle
If we manually wrap a C++ library, the actual call chain is Go runtime -> cgocall -> C ABI -> CPP func. ABI is the Application Binary Interface, and as long as the specifications and conventions of this binary bridging interface are followed, external languages calling C interfaces can be well implemented. Therefore, before implementing the bridging interface, it’s recommended to first define structures and methods that conform to Go language semantics starting from the Go layer, then map Go objects to corresponding C structures or C++ objects, and Go methods to corresponding C functions and C++ methods, thus determining the bridging C library header file. After determining the function signatures of the intermediate bridging layer, proceed to implement the specific logic of the header file using C++ language, and finally compile and link the C++ implementation with the C header file into a whole, which is also the underlying logic of how Go can call C++.
Don’t Use SWIG to Generate C Bridging Code
Not wanting to manually maintain C interfaces, some people might think of the common approach for Python calling C++ so libraries, which is using SWIG to automatically generate stub functions for C++ libraries. Actually, we had similar ideas when preparing to write bridging code initially, but ultimately abandoned them. A key reason is that SWIG’s Go generator doesn’t have high community maintenance activity and doesn’t support many C++11 syntax features, such as smart pointers. Moreover, the generated bridging code is unreadable, which increases the mental burden for later maintenance and problem troubleshooting, and may even pose memory leak risks, so manual encapsulation is recommended in production environments.
Batching Your Calls
Since each cgo call has certain additional overhead, this call overhead has a linear relationship with the number of calls. A natural optimization approach is to merge multiple cgo calls into one, thereby reducing the additional overhead of cgo calls while processing the same amount of content. After changing N calls to 1 call, the additional overhead of the call itself becomes approximately 1/Nth of the original within a certain period, effectively reducing overall call latency during frequent calls. For example, the C++ calls I wrapped involve extensive string processing, parsing, compression, serialization and other operations internally, so processing an 8MB large character slice typically has lower overall processing latency than making 8 calls of 1MB each.
Reduce Unnecessary Data Copying
cgo provides some common helper functions by default to convert built-in types in Go to corresponding data types in C. When ensuring input parameters are read-only internally, zero-copy tricks can often be applied.
For example, passing read-only byte slices without data copying:
b := buf.Bytes()
rc := C.the_function((*C.char)(&b[0]), C.int(buf.Len()))Actually equivalent to:
str := buf.String()
p := C.CString(str)
defer C.free(p)
rc = C.the_function(p, C.int(len(str)))The first approach clearly avoids the additional overhead caused by copying in large slices.
Be Careful About Memory Leaks
Cross-language calls with copy-by-value parameter passing are generally safer practices. However, when garbage-collected Go language is mixed with manually memory-managed C language and RAII C++, special attention must be paid to prevent heap memory leaks in C or C++ layers. Go calling C interfaces generally follows the principle of whoever allocates, whoever is responsible for releasing.
For example, if Go allocates C heap memory for char* and copies the content of string variables to the memory space corresponding to C’s char*, then Go layer should actively release the memory.
package main
/*
#include <stdlib.h>
*/
import "C"
import "unsafe"
func HelloWorld() error {
cs := C.CString("Hello World!")
defer C.free(unsafe.Pointer(cs)) // Go layer is responsible for explicitly calling C's free to release
// Safely use cs for cgo calls
// ...
// Bridging layer doesn't need to care about parameter release
err := C.my_c_func(cs)
if err != nil {
// Non-null C string return values are released by the receiver
defer C.free(unsafe.Pointer(err))
return errors.New(C.GoString(err))
}
return nil
}By analogy, when passing parameters down through layers, ensure the receiver doesn’t need to care about input parameter memory release, avoiding omissions or duplicate releases that could cause the entire program to coredump.
Note that when returning results upward, it’s the opposite: the caller is responsible for allocating memory, while the receiver is responsible for actively releasing memory.
Correctly Handle Smart Pointers
In many C++ libraries, especially those written in C++11, extensive use of smart pointer features is made. Many third-party libraries habitually wrap newly created object pointers in shared_ptr for automatic reference counting when creating objects, so that destructors can be correctly called for automatic destruction and heap memory release after objects leave their scope. However, smart pointer reference counting is almost useless in cgo, as C++ layer cannot perceive Go layer reference situations, let alone correctly count and automatically destroy objects. Therefore, to correctly release object resources when Go structures hold C++ objects, the best approach is to implement smart pointer reference counting in Go layer, and finally have Go layer actively release objects.
If C++ objects and Go objects have a one-to-one mapping, we can elegantly implement upper-layer transparent C++ object reference counting at the application layer. Because the number of references to Go objects holding raw C++ object pointers equals the number of C++ object references, runtime.SetFinalizer comes in handy perfectly. It acts as a hook function that gets called when Go objects are GC’d. Using this hook, we can easily achieve safe release of underlying held C++ objects when Go objects are released.
package myClient
/*
#include <stdlib.h>
extern "C" {
void* client_create(const char* a); // Call constructor in C++ layer to new object pointer and return to upper layer
void client_release(void* obj); // Call delete to release object in C++ layer
}
*/
import "C"
import (
"errors"
"unsafe"
)
type Client struct {
_ptr unsafe.Pointer
}
func New(a string) *Client {
_a := C.CString(a)
defer C.free(unsafe.Pointer(_a))
_ptr := C.client_create(_a)
cli := &Client{
_ptr: _ptr
}
runtime.SetFinalizer(cli, func(c *Client) {
// Ensure C++ objects are automatically destroyed when Client objects are GC'd
c.free()
})
return cli
}
func (c *Client) free() {
C.client_release(c._ptr);
}Note that C language doesn’t have concepts like classes and objects, so to ensure ABI compatibility during bridging, void* universal pointers should be uniformly used to define header files, and only during .cpp implementation can they be cast to corresponding C++ object pointers for further use.
Make Good Use of Exceptions to Improve Code Robustness
The underlying implementation uses C++ language, so we can wrap calls with try-catch statements to capture and handle exceptions, or return error information to upper-level callers if unhandled.
func (c *Client) DoSomething() error {
err := C.do_something(c._ptr);
if err != nil {
defer C.free(unsafe.Pointer(err))
return errors.New(C.GoString(err))
}
return nil
}#include <string.h>
#define CAST_T(_T) reinterpret_cast<_T>
const char* do_something(void* obj) {
try {
CAST_T(CppClient*)(obj)->Do();
}
catch(MyException &e)
{
// handle failure
// ...
return NULL;
}
catch(const exception& e)
{
return strdup(e.what());
}
return NULL;
}CGO Compilation Engineering
Usually our C++ SDK is maintained in a separate code repository, so static and dynamic libraries that cgo compilation depends on are maintained in a separate branch of the C++ library, ensuring that main library upgrades don’t affect maintenance of cgo bridging cpp code, while also easily obtaining subsequent updates to the main library.
The general directory structure of the Go SDK package is as follows:
➜ myclient tree -h
[] .
├── [] lib
├── [] client.go
├── [] client_mock.go
└── [] wrapper.hFor example, if the wrapped Go SDK package name is myclient, Go layer related objects and cgo call encapsulation are implemented in the client.go file, C ABI header file definitions for bridging are generally placed in wrapper.h, and the lib directory stores dependencies needed to compile and link the wrapper.h header file, including static libraries and related dynamic libraries. Dependencies are obtained by compiling specific branches of the C++ library and copied to the directory in advance. The entire project directory also looks clear and natural, with the CPP implementation of wrapper.h maintained in the branch of the original C++ library, effectively decoupling service callers and providers through C-defined interfaces.
And client_mock.go is used for conditional compilation to adapt to local Mac environment compilation and debugging, because introducing cgo may break cross-platform compilation and debugging convenience. Conditional compilation can solve the issue of the entire binary program being unable to compile due to platform barriers to some extent.
Debugging and pprof Optimization
This is probably the point that many cgo users complain about most. Currently there isn’t a good way to debug cgo programs. For profiling CPU usage, using perf to generate flame graphs is still possible for analysis. Using low optimization levels when compiling with GCC can preserve specific call stack information.
If memory leaks occur within cgo calls, it becomes more troublesome. For debugging cgo programs or memory analysis, I haven’t found good methods myself, and welcome everyone’s suggestions.
Actual Pitfalls Encountered
When running the encapsulated Go SDK in the production environment, we found that all CPU cores were abnormally saturated. Through Go’s CPU pprof, we eventually pinpointed to a single heavy computation cgo call, where approximately hundreds of goroutines were concurrently calling it continuously.
go func() {
C.heavy_process_func()
}However, when analyzing process CPU using the perf top command, we discovered that the operation consuming the most CPU time was actually _raw_spin_lock in the kernel layer. Throughout the entire heavy_process_func process, our business logic showed no traces of using spin locks, and the upper-level goroutines calling it had no intersections either. Where did the race condition for mutex locks come from?
Finally, by actively outputting all exception logs from the C++ layer, the truth came to light: it turned out that when C++ throws large amounts of exceptions concurrently across multiple threads, stack unwinding competes for global locks in glibc. Therefore, after properly converging exceptions, the problem was finally resolved. Newer versions of glibc have fixed the issue of multi-threaded exception throwing competing for global locks. Details can be seen in the corresponding issue and patch.
Bug - Concurrently throwing exceptions is not scalable
References
Views