Lots of things have been done with pkcs11mod since my last post on the subject. This is going to be a bit of a “grab bag” post without much structure (because that’s what reality looks like here, and this post is intended to reflect reality).

I noticed that the Chromium tests were hanging. After a bit of investigation, I found that the flags I was passing to Chromium to make it run in Docker weren’t quite sufficient. The alpine-chrome project has a more thorough set of flags, and applying these to pkcs11mod’s test scripts got the tests passing again.

I implemented the GenerateKeyPair function in p11mod, which got the test-rsapub OpenDNSSEC test passing. Thanks again to OpenDNSSEC for the test!

I noticed that trying to use either of the proxy modules with a nonexistent target module path resulted in a Go panic. This is not good, since a panic will crash the host application. Now this is mitigated by both pkcs11mod and p11mod checking for initialization errors, and returning CKR_GENERAL_ERROR, which is used by the PKCS#11 specification to indicate an error that is probably not recoverable. The application can handle this however it likes, but whatever it does will probably be better than crashing.

When trying to test pkcs11mod on Windows, I noticed that it failed very early. So early that the Go runtime never even tried to initialize. By comparing pkcs11proxy.dll to nssckbi.dll in Dependencies, I found that pkcs11mod was failing to export the C_GetFunctionList function, which caused applications to conclude that pkcs11proxy wasn’t a PKCS#11 module. That explained the symptoms. But wait, didn’t ncp11 work on Windows back when it was first released? Yes. Some digging found that this regression was caused by the fix for Go issue 30674. (Apparently, Go used to export all functions, and Windows has a tendency to crash if you export too many functions, so Go now requires developers to explicitly designate which functions will be exported.) I fixed this via the __declspec(dllexport) attribute. And now, pkcs11proxy is detected as a PKCS#11 module on Windows.

I fixed the documentation (and Cirrus scripts) to generate libraries with the idiomatic pkcs11proxy.dll filename on Windows targets, instead of the libpkcs11proxy.so filename that’s supposed to be only a Linux thing. Not a big deal, but correct documentation is important.

I added a trace mode (hidden behind an environment variable) to both pkcs11mod and p11mod, which makes it easier to follow what PKCS#11 calls are being issued. Be warned, even running a proxy module for a minute or so can produce a trace log of over a megabyte, so this should not be done routinely. Not all functions are traced yet; I’d happily accept PR’s to add more detailed tracing.

I added a test for tstclnt on Debian, and noticed that it was failing. Further investigation showed that tstclnt was failing to read the trust bits from the CKBI module when pkcs11proxy was in the middle. This was odd, because I was already testing tstclnt on Fedora, and it worked fine there. The obvious difference was that Fedora uses Red Hat’s p11-kit CKBI module, while Debian uses Mozilla’s CKBI module. Fedora’s replacement is supposed to be compatible with Mozilla’s version, but clearly the Red Hat CKBI module was doing something different from Mozilla’s, and my code was expecting the behavior of Red Hat’s module. Using the trace mode (see above paragraph), I figured out what was going on. NSS queries for 4 different EKU trust attributes: TLS server authentication, TLS client authentication, email, and code signing. (Clearly NSS could get away with only querying for TLS server authentication in this case, since tstclnt isn’t doing anything with the other three, but they don’t do that optimization here. Which is probably okay, since most of the time, the extra queries won’t harm performace.) Red Hat’s CKBI module contains trust bits for all 4 of those EKU attributes. Mozilla’s CKBI module, however, omits the TLS client authentication attribute. This makes sense, I guess, since the Mozila root CA program is not intended to be used for TLS client authentication, while Red Hat’s CKBI module incorporates certificates added by the user, which might be used for that EKU. Here’s where the problem happens: NSS queries for all 4 trust attributes in a single GetAttributeValue function call. Per the PKCS#11 specification, if any of the attributes in a oneshot query are missing, some extra error bits are returned, so that the caller knows which attributes existed and which didn’t. However, Miek’s pkcs11 library doesn’t handle this case; it instead just returns an error, causing pkcs11mod to lose the trust bits attributes that did exist. Patching Miek’s library seemed like an overkill approach, so I instead rigged pkcs11mod to detect the CKR_ATTRIBUTE_SENSITIVE and CKR_ATTRIBUTE_TYPE_INVALID error codes, and retry the attributes one at a time to figure out which ones we can return to the application. This isn’t ideal performance-wise, since we end up retrieving the attributes twice, but the difference should be negligible in the real world. And with that, tstclnt on Debian passed the tests.

Upstream p11 v1.1.0 added some named errors, which allowed me to avoid recognizing those errors via a brittle string comparison. That makes my code cleaner, which is good.

I added tests for Firefox and Chromium on Debian, which passed immediately (presumably because of the above tstclnt fixes).

I added tests for Firefox (both ESR and Nightly) on Windows. These mostly passed, but I did need to fix the logfile output paths, since $HOME is not a thing on Windows. The logfile is now saved to os.UserConfigDir(), which behaves sanely on all major OS’s. It falls back to the current working directory if the user config directory isn’t accessible for some reason (e.g. sandboxing).

I then added some more Firefox variants on Windows: Firefox Rapid Release, Firefox Beta, Firefox Developer Edition, LibreWolf, and GNU IceCat. They all passed. I also tried to add Waterfox and Pale Moon, but Waterfox hung on the install step in my Cirrus VM, and Pale Moon doesn’t support the headless screenshot feature, which my tests rely on. My guess is that both of them would work fine outside of my test environment.

Next, I added tests for certutil (the NSS tool, not the CryptoAPI tool of the same name). Specifically, I dumped the CKBI certificate list, with trust bits, via pkcs11proxy and p11proxy, and compared the output to what I got without the proxy. Surprisingly, this turned out to be a giant gopherhole [1]. Dumping the certificate list caused certutil to segfault halfway through dumping when proxied – but only on Windows, and only via p11proxy (pkcs11proxy didn’t segfault). Inspecting logs revealed that certutil was calling the CloseAllSessions function, which p11mod didn’t implement. I implemented that function (it was pretty simple), but this didn’t fix the segfault. I enabled the trace mode that I had previously added, and found that it was segfaulting after the Finalize function had returned, which seemed odd, since that meant my code had already finished running by the point of the segfault. After some confusion, I found Go issue 11100, which described my problem: certutil had tried to unload p11proxy.dll, which yanked some data structures out from under the Go runtime, causing a segfault. There is a way to instruct Windows (via the GET_MODULE_HANDLE_EX_FLAG_PIN argument to the GetModuleHandleEx Windows API function) not to ever unload a DLL file until the application exits. Conveniently, the golang.org/x/sys/windows package (nicknamed goxsys/windows by the Tor folks [2]) has support for this. I tried it out, and sure enough, that fixed the segfault.

However, this introduced a new issue. Now, certutil wouldn’t segfault, but it would hang after outputting the entire certificate list, until I hit Ctrl-C in the command prompt. Go issue 11100 had some mention of this occurring in Unity: apparently some applications wait for all background threads to exit before closing the program, and the Go runtime (which never terminates) is being recognized as such a background thread. In theory, this is an application bug, but pkcs11proxy is supposed to leave the behavior of buggy software unchanged, so this was something I couldn’t just ignore. Alas, I didn’t find a clean way to solve this. So I did the dumb approach: when the Finalize function is called, pkcs11mod starts a 5-second timer in a goroutine, which then calls os.Exit(0). Sometimes the dumb approach is the best option that presents itself.

Unfortunately, this meant that now, if an application tries to unload pkcs11mod, the application will be forcibly closed 5 seconds later. This works fine for applications like certutil which will exit right after that point anyway. But in Firefox, you probably don’t want your browser to kill itself 5 seconds after you click the Unload button in the Security Devices dialog. So, what to do…?

One does not simply avoid user-agent sniffing.  Even in PKCS#11 modules.

Yep, once again, sometimes the dumb approach is the best option that presents itself. We simply check the process name that pkcs11mod is loaded into. If it’s certutil.exe, we start the 5-second exit timer; otherwise, we don’t. The number of applications that exhibit this bug is, I’m guessing, small enough that we can simply enumerate them.

This got the certutil tests passing. However, now the Firefox tests on Windows were failing. Inspection of Firefox stderr indicated some error messages about a tab subprocess failing to start. A DuckDuckGo search for these errors indicated a Bugzilla thread where the reporter said that disabling the Electrolysis sandbox fixed a crash. I added some Cirrus tasks that ran my usual Firefox tests but with Electrolysis disabled. Sure enough, those tests passed. I initially figured that my usage of GetModuleHandleEx, or perhaps my attempt to read the process name, had violated some kind of sandbox policy, but after bisecting my recent changes, I came to a surprising conclusion: the thing that crashes Electrolysis is simply the act of importing the goxsys/windows package. I’m guessing there’s some kind of init() function inside that package that tries to do something that the sandbox doesn’t like, and panics when it gets a permission failure [3]. After grepping GitHub hoping someone had already found a solution to this, I discovered that the vst2 project had already implemented a similar solution to mine, but using the lower-level syscall package instead of goxsys/windows. I copied their code nearly verbatim, and it worked out of the box: the DLL was pinned properly in certutil, and Electrolysis worked fine. Thus bringing me out of the certutil gopherhole.

I then relaxed a bit by adding macOS builds to Cirrus, and confirmed that pkcs11mod builds without errors on macOS. (I don’t know if the binaries actually work, but they’re on Cirrus now, you can try them yourself and let me know.)

Finally, after discussion with the other pkcs11mod devs (Aerth and Bernard), we unanimously agreed to relicense pkcs11mod and ncp11 from LGPLv3+ to LGPLv2.1+. This makes it unambiguously legal to use pkcs11mod with GPLv2-licensed applications. The failure to do this initially was an oversight on our end; it was kind of ridiculous that we permitted pkcs11mod to be used with non-freedom software but not GPLv2 software.

We’re not yet at the point where I’d consider pkcs11mod to be well-tested enough to be used in production by people other than its 3 authors, but we are getting a lot closer to that point.

This work was funded by NLnet Foundation’s Internet Hardening Fund.

[1] Rabbitholes are a thing in Plan 9, not Go.

[2] Insert Andrew “rasengan” Lee joke here.

[3] The “goxfail” jokes write themselves.