Code Monkey home page Code Monkey logo

Comments (96)

krackers avatar krackers commented on August 21, 2024 2

@Wowfunhappy sorry for the late response.

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet

Note that even in the older xnu version, THREAD_RESTART was still a valid defined value for wait_result_t. In fact, you can see that it is returned in wait_queue_assert_wait64 of osfmk/kern/wait_queue.c when the queue becomes invalid. The difference is that in this older kernel version, kqueue scan did not properly handle this, and just panicked instead of returning EBADF.

I agree that this seems way too obvious an omission to be a bug. But I can't think of any other explanation at the moment (that said, my knowledge here is very slim). I think the first step might be to see if we can create a scenario that can reliably reproduce this. Based on my current understanding, it seems like the conditions for triggering would look like:

  • Have one process call kqueue to create a new fd for event queue, and then call kevent to wait upon it

  • While that happens, another process closes the fd

  • Based on what's described above, this would generate a THREAD_RESTART condition in the wait queue code which would bubble up to kqueue_scan and cause the panic.

But this seems way too obvious and easy to trigger, so I'm still skeptical that this is actually what's happening.

As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete

I see you're following the "Kernel Shaman's" build instructions? I'm not quite sure why there would be a difference in size assuming that both are release builds, but it's possible that apple internal builds have some other ifdefs set during compile, or the kernel version is different? This wouldn't explain why the hackintosh kernels match in size though...

That's also why I'm personally partial to the "hotpatch via kext" approach. That way you introduce only the minimal difference necessary.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

...well, for what it's worth, I decided to build a custom kernel. As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete. The only tricky part was that I had to install Xcode 5.0.2; later versions did not appear to work.

The only source change is that I added the THREAD_RESTART case to kqueue_scan_continue, which sets error = EBADF as in newer versions of XNU.

I am (a bit nervously) going to try using this on my main machine. A half hour in, nothing catastrophic has happened so far.

The kernel is attached in case anyone else on Mavericks wants to try it, but I am not necessarily recommending that anyone do so. If you do use it, make sure at minimum to have a secondary partition available to restore your original kernel should the need arise. Remember that you will need to clear the kernel cache on your first boot with the new kernel.

@krackers Do you have any idea why this kernel would be only 8,257,152 bytes, when the original kernel and every custom Hackintosh kernel I've checked is 8,394,688 bytes? It's weirding me out.

XNU-2422.115.4-THREAD_RESTART add-JA-v2022..01.22.zip

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

Do those (macOS 10.11 Chrome users) actually exist? ;)

I mean, anecdotally, I've seen several of them in the wild! I even helped a third grade schoolteacher add the Let's Encrypt certificate to Keychain Access last fall.

This may well be an kernel bug, but I don't think it's happening in official versions of Chrome. A kernel panic is more difficult to overlook than a missing email, and this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently. I don't see how Google, with all of their users and analytics, wouldn't know about this problem if it existed.

It seems much more likely to me that e0cff58 is culpable, somehow...


† Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024 1

@Wowfunhappy Assumption 1 isn't necessarily true. If we use the working hypothesis that the kpanic is caused by attempting to use an invalid (freed) kqueue, then you'd also have to search for references for code which uses MessagePump, since when that instance goes out of scope its destructor will free the kqueue fd.

There's also exactly 1 other place where THREAD_RESTART is signalled, and that's in some sephamore related method. If the kpanic is triggered by that, then we're really in a pickle since that would mean the root cause might not even be strictly related to kqueue operations.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024 1

@Wowfunhappy

Yes, Lilu will work since it has support for patching kernel functions as well. But I don't like it here because I don't think using Lilu is going to give us any strong benefit that we could not have ourselves by copying a few header files for useful functions. And the Lilu only takes care of boilerplate, but we must still bring core logic. And that too we'd have to learn the Lilu API so the benefits of having boilerplate taking care of is negated. I think it's much simpler and more understandable to just make vanilla kext?

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users.

End result will be to create a kext that users can just place in /Library/Extensions or whatever to hotpatch kernel on the fly. Will only work for users with same kernel version that you based offsets on though, other kernel versions people will have to recompile kext with their own offset. It's also fun for learning, but yes it is a bit more work to do than just recompiling xnu. Up to you whether the pain of occasional kernel panic is worth spending a few weekends trying to create a kext though. Happy to provide guidance along the way. I edited previous repose #44 (comment) with some concrete starting steps if you've decided to take this route.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in user-space?

To be honest, I don't know. If we can determine for sure that it is caused by a freed kqueue fd, then we need to take next step to figure out why the fd is being freed. This would involve adding some logging code in Chromium (maybe in destructor for messagepump?). If we verify that is indeed the cause, then next step would be to create some work around where we try to only free the kqueue after we ensure no thread is blocked waiting on one. We could try to make that change now itself but to me seems like the effort required to figure out how to implement this (would have to study exactly what type of event to send to interrupt existing kqueue wait) combined with effort to recompile chromium is much more than doing the kext.

Other option is if you can try to find commits involving messagepump we can try to see if one of them might be responsible, but this is more of a shot in the dark compared to the kernel patch which we know will work. If you're optimizing for time spent vs. reward then the kernel patch approach seems to be a surer bet.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024 1

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner. Can we run chrome under gdb or with dtrace to figure out where the kevent syscall is being made?

Tangential thing Completely unrelated, but if you like reading tales about old versions of chrome and patching things, did you know that newer version of chrome actually fixed (well more like worked-around) a coretext bug in 10.9? Under very rare and infrequent conditions (usually when rendering CJK text + emoji, but not deterministically) coretext shaper will get stuck when using AAT font, and just go in an infinite loop.

You can see in
https://github.com/harfbuzz/harfbuzz/blob/main/src/hb-coretext.cc#L187

that google actually discovered this issue and added a workaround to coretext in v58 or so. But this workaround is not perfect, and there are still conditions which cause this (seems to be a combination of emoji and kanji characters from what I've seen). I actually went a wild chase some time back trying to see if I could fix this myself by hooking and hotpatching that function to determine which parameters caused coretext to crash. I could have recompiled Chromium but figuring out how to build chromium seemed harder than just hotpatching. In fact even figuring out how to parse the crash dump format from non-debug builds (minidump format) was quite an exercise.

As you know, for security chrome divides the main browser process from the helper processes, which are spawned from a different binary. SIMBL works fine to inject into the main chrome process (surprisingly, Chrome is alright with the injection. I would have thought that they do crazy tricks to avoid process injection like in windows, but seems like not.), but since chrome helper does not load any cocoa libraries we cannot simbl inject into that. We can't add dylib to the helper binary because it is codesigned with CS_RESTRICT (and removing codesign here seemed like a bad idea) since there are some entitlements it needs I think, but helper binary also loads chrome framework dylib (manually via dlopen actually, not via usual dynamic linking), so we can actually modify that chrome framework without tripping any security issues. I tried this and it worked, but changing signature does prevent some keychain related things from working which I didn't like.

I finally realized that what I really want is to use mach_inject automatically for every helper process spawned. Newer versions of macforge do this, but for 10.9 mysimbl only does traditional osax injection.

(Btw, @Wowfunhappy I noticed many of your plugins usually done via insert_dylib approach, you should consider packaging as a simbl plugin instead since it's much easier to enable/disable and you don't break signatures).

Continue tangential thing So I thought some more and realized that to do `mach_inject` we only really a way to look for pid of spawned process. Turns out that chrome launches the helper via `posix_spawnp`, and we can hook this call via standard patching techniques. So in my mind I constructed this wacky chain where I'd simble inject into the main chrome process, then hotpatch to interpose `posix_spawnp`, then take the pid result from that call and call `mach_inject` to inject a blob into the helper, and the injected blob would finally hotpatch `harfbuzz_shape`.

_Of course after more thinking I realized this was absolutely insane and I just removed the problematic emoji that was triggering the crash from my webpage. _

EDIT: Fwiw I just realized a way to do this is via DYLD_INSERT_LIBRARIES, it automatically gets propagated to all spawned child processes, and does not break code signing. You can then use mach_override or any other plt hooking library.

Oh so back to how newer version of Chromium fixed it. Turns out that newer versions of harfbuzz don't call into CoreText at all to shape AAT fonts, so we sidestep this problem entirely. Very fortunate, and it probably also avoids a whole class of security vulnerabilities. Anyway, thank you for reading.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024 1

Oh one more crucial defensive check you can do is add the two following asserts:

kqueue_scan_continue_panic_end_location == kqueue_scan_continue_panic_start_location + sizeof(search_bytes)
kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes) + extra_space_to_fill ==  kqueue_scan_continue_panic_end_location

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

@krackers ...y'know what, that might be all it is, I forgot I was literally searching for a header called "MH_MAGIC_64" 🤦‍♂️. I'll take another look.

Recompiling isn't a problem, I can just lipo together two different binaries that technically have different code. (It does make the built cycle a bit more annoying unless I also take the time to automate that.)

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

The kext has been updated with compatibility for 32-bit Lion.

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

(The strategy for finding the kernel base address works fine in 32 bit, but is pointless because Lion lacks KASLR.)

from chromium-legacy.

fhgwright avatar fhgwright commented on August 21, 2024 1

@Wowfunhappy

If I don't discover any new issues in the next week or so, I'm going to add this to my Preference Pane, so it gets installed automatically for anyone who uses that. I don't know if it makes sense to incorporate this into the Chromium Legacy app directly, but you are of course welcome to it.

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly, given that:

  1. AFAIK this has never been triggered by any application other than Chromium Legacy.
  2. It was never triggered by any supported release of Chrome.
  3. AIUI, the kernel fix still leads to an application crash, so an application-level fix is also needed, anyway.

Given item 3, the kernel patch on test systems could be useful in identifying the combination of syscalls that provokes the bug. And assuming that the offending code doesn't provide any new capability, there must be something in Chromium which changed from an "old way" to a "new way" of doing something. Once that difference can be figured out, the next questions are:

  1. How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?
  2. How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

It would certainly be useful to also have a kernel fix for users who want a more robust kernel and are willing to install a kernel update, but it would be best to avoid making that a requirement for avoiding this bug. And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch. Some work would be needed to identify the correct sources to use as a starting point, though fortunately that needs to be done only once for any given abandoned OS version.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

@fhgwright

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

The preference pane does ask for permission!

Screen Shot 2022-09-17 at 8 34 57 PM

The installation is "automatic" in the sense that, if the user says yes, it's done in one click.

Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly

I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!

there must be something in Chromium which changed from an "old way" to a "new way" of doing something.

My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175

How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?

It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...

How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

Wide enough that Bluebox wasn't able to do it. #44 (comment)

And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.

Already done!

#44 (comment)

This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.

The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024 1

I was on 10.9.4 because I never realised that it wasn't the latest Mavericks 😂😂😂 I kind a skimmed through the end of the issue and didn't spot the workaround, I'll update to 10.9.5 and try it, will report back after some testing.

Quick heads up that Apple also released further updates after the base 10.9.5 which don't change the version number, you want 10.9.5 build 13F1911. I'm pretty sure the updater will prompt you, but if not, install: https://support.apple.com/kb/dl1886?locale=en_US

from chromium-legacy.

pjpreilly avatar pjpreilly commented on August 21, 2024

When I've triggered this...by chance.. I seem to always be switching windows either through hot corners invoking Mission Control or clicking the Chromium Dock Icon to switch windows..... FWIW.... or maybe switching tabs.... I've noticed sometimes when switching tabs the pointer does not take focus of the new tab until a subsequent click in the tab being switched to...seems to occur after the browser has been running for a good amount of time.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy Mapping is given in https://opensource.apple.com/source/xnu/xnu-4570.61.1/osfmk/kern/kern_types.h

You're right in that 3 corresponds to THREAD_RESTART. Not too familiar with kqueue internals but just based on statically analyzing the call hierarchy, seems to be

kevent_internal -> kqueue_scan -> kqueue_scan_continue

It's really interesting that kevent syscall was only recently updated to return EBADF though. Seems like this is returned whenever the fd becomes invalid (e.g. if it's closed), which would definitely explain why it's sporadic and hard to reproduce. I wonder if this bug was just never noticed on older versions of osx because nothing else other than chrome seems to make such heavy use of IPC, and older versions of Chromium probably used some other method for synchronization/ipc [1]?

There are only a few places in chromium source that make the kevent syscall (search for
"kevent" "#include <sys/event.h>"). Might be interesting to add some debugging code around that to see if there's any pattern.

[1] https://source.chromium.org/chromium/chromium/src/+/2096391cdcc3bdf963e68977e9015aa8cdbe85c6

https://source.chromium.org/chromium/chromium/src/+/c96c632d5501d5ebba14e1130fdea8b15ac29756

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Thanks! I wonder if this was the underlying project which introduced the problem. The timing lines up, 9–12 months after 10.9 support was dropped from Chrome.

Aside, I assume it's not possible to just replace kqueue_scan_continue with a version that handles THREAD_RESTART, right? Because it's an internal kernel function, or something like that?

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy Excellent question. It is possible if you are willing to install a kext. Basically since all the kexts will live in the same address space as the kernel code, you can hot-patch the kernel by modifying the instructions in memory.

Now to actually do this patch you'll need to be a bit clever:

  • Load the kernel mach-o in ghidra/hopper/ida and find the address of the function that we want to modify (kqueue_scan_continue). This will be our magic number that is the offset to patch from base.
  • Then in your kext, you will have to find the base location the kernel is loaded at (due to kaslr this is not fixed), and add that offset from step (1) to find the address of the desired function.
  • Add an extra offset to get to the specific instruction we want to patch. In this case, it's not a trivial modification since we want to insert a block of code, so the best way to do this is to insert a jump to code in your kext that will handle the switch(), then jump back to the original code.
  • Do the patch as described above, and it should all just work.

Conceptually not too difficult, and not that different from code injection in userspace. Only difference is that stakes are higher. If you have a VM and are willing to play around with it, I can give you some code pointers. Of course at this point it might just be easier to recompile the kernel itself since we have the source anyway...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Thanks, makes sense it would need a kext. Probably not ideal if it can be helped.

(Just for my own use, I am a bit tempted to recompile XNU. But I run into this panic so rarely anyway... Some people on MacRumors were saying it happens daily for them.)

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

MacBiter over on MacRumors captured a backtrace with keepsyms enabled.

panic(cpu 6 caller 0xffffff80055c7ab4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167
Backtrace (CPU 6), Frame : Return Address
0xffffff81f05a3ef0 : 0xffffff8005223139 mach_kernel : _panic + 0xc9
0xffffff81f05a3f70 : 0xffffff80055c7ab4 mach_kernel : _kqueue_scan + 0x8f4
0xffffff81f05a3fb0 : 0xffffff80052d7c67 mach_kernel : _call_continuation + 0x17

I assume this is _call_continuation? Still an internal kernel function, I naively thought it might give us Chromium's initial syscall.

Edit: And I think this is the definition... I wasn't expecting to find ASM...


Edit 2: I don't actually understand most of what it's saying, but pages 418 – 420 of Jonathan Levin's Mac OS X and iOS Internals may be relevant here. I think it's saying that wait_result is set by the thread_block_reason() function.

Screen Shot 2022-01-19 at 7 10 46 PM

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet. It would make sense if Chromium was the one setting wait_result, since Google is only targeting modern kernels. But this is kernel code calling kernel code. I guess the obvious answer is that it's a bug, but it seems like something that wouldn't have gone unnoticed for years.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers I'm not ignoring your suggestions of things to try, they're just mostly outside of my skill level. 🙂 I only barely understand explanations of what a file descriptor is...

Anyway, the plot thickens. I neglected to check something earlier:

https://github.com/apple-oss-distributions/xnu/blame/cc9a63552421e7db025ad771974f3e9f479948ad/bsd/kern/kern_event.c

That THREAD_RESTART case was only added to XNU in 3789.1.32, aka macOS 10.12 Sierra.

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Could this indicate the problem was introduced by Chromium Legacy? If I'm not missing anything, I think the only Chromium Legacy changes that involve kevent come from this commit. e0cff58

Google's code review comments for this change are interesting...

from chromium-legacy.

pjpreilly avatar pjpreilly commented on August 21, 2024

On latest canary..Version 100.0.4889.0 (Official Build) (x86_64)
two finger scrolling in a facebook messenger chat....

Tue Feb 15 06:40:24 2022
panic(cpu 0 caller 0x55bb8e): "kevent_scan_cont() - invalid wait_result (3)"@/SourceCache/xnu/xnu-1699.32.7/bsd/kern/kern_event.c:1982
Backtrace (CPU 0), Frame : Return Address (4 potential args on stack)
0xe2bf68 : 0x2203de (0x6b08cc 0xe2bf88 0x229fb0 0x0)
0xe2bf98 : 0x55bb8e (0x7006a4 0x3 0x6 0xa4c39d4)
0xe2bfc8 : 0x2c7d0c (0x9f03320 0x3 0x10 0xa4e9190)

BSD process name corresponding to current thread: Chromium Helper

Mac OS version:
11G63

Kernel version:
Darwin Kernel Version 11.4.2: Thu Aug 23 16:26:45 PDT 2012; root:xnu-1699.32.7~1/RELEASE_I386
Kernel UUID: 859B45FB-14BB-35ED-B823-08393C63E13B
System model name: MacBook4,1 (Mac-F22788A9)

System uptime in nanoseconds: 71096824466362
last loaded kext at 27108605406: com.apple.driver.AudioAUUC 1.59 (addr 0x146b000, size 24576)
last unloaded kext at 115835844270: com.apple.driver.USBCameraFirmwareLoader 1.1.0 (addr 0xd35000, size 24576)
loaded kexts:
com.apple.driver.AudioAUUC 1.59
com.apple.filesystems.autofs 3.0
com.apple.driver.AppleHWSensor 1.9.5d0
com.apple.driver.AppleHDA 2.2.5a5
com.apple.driver.AppleUpstreamUserClient 3.5.9
com.apple.driver.AppleBacklight 170.2.2
com.apple.driver.AppleMCCSControl 1.0.33
com.apple.driver.AppleIntelGMAX3100 7.0.4
com.apple.driver.AppleIntelGMAX3100FB 7.0.4
com.apple.driver.SMCMotionSensor 3.0.2d6
com.apple.iokit.IOUserEthernet 1.0.0d1
com.apple.iokit.IOBluetoothSerialManager 4.0.8f17
com.apple.Dont_Steal_Mac_OS_X 7.0.0
com.apple.driver.AudioIPCDriver 1.2.3
com.apple.driver.ApplePolicyControl 3.1.33
com.apple.driver.ACPI_SMC_PlatformPlugin 5.0.0d8
com.apple.driver.AppleLPC 1.6.0
com.apple.driver.AppleSMCPDRC 5.0.0d8
com.apple.driver.CSRUSBBluetoothHCIController 4.0.8f17
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless 1.0.0d1
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib 1.0.0d1
com.apple.BootCache 33
com.apple.driver.AppleUSBTrackpad 227.6
com.apple.driver.AppleUSBTCKeyEventDriver 227.6
com.apple.driver.AppleUSBTCKeyboard 227.6
com.apple.driver.AppleIRController 312
com.apple.iokit.SCSITaskUserClient 3.2.1
com.apple.driver.XsanFilter 404
com.apple.iokit.IOAHCIBlockStorage 2.1.0
com.apple.driver.AirPortBrcm43224 501.36.15
com.apple.driver.AppleFWOHCI 4.9.0
com.apple.driver.AppleHPET 1.7
com.apple.driver.AppleUSBHub 5.1.0
com.apple.driver.AppleAHCIPort 2.3.1
com.apple.driver.AppleIntelPIIXATA 2.5.1
com.apple.driver.AppleUSBEHCI 5.1.0
com.apple.driver.AppleUSBUHCI 5.1.0
com.apple.driver.AppleEFINVRAM 1.6.1
com.apple.driver.AppleRTC 1.5
com.apple.iokit.AppleYukon2 3.2.2b1
com.apple.driver.AppleSmartBatteryManager 161.0.0
com.apple.driver.AppleACPIButtons 1.5
com.apple.driver.AppleSMBIOS 1.9
com.apple.driver.AppleACPIEC 1.5
com.apple.driver.AppleAPIC 1.6
com.apple.driver.AppleIntelCPUPowerManagementClient 195.0.0
com.apple.nke.applicationfirewall 3.2.30
com.apple.security.quarantine 1.4
com.apple.security.TMSafetyNet 8
com.apple.driver.AppleIntelCPUPowerManagement 195.0.0
com.apple.kext.triggers 1.0
com.apple.driver.DspFuncLib 2.2.5a5
com.apple.driver.AppleBacklightExpert 1.0.4
com.apple.driver.AppleSMBusController 1.0.10d0
com.apple.driver.AppleHDAController 2.2.5a5
com.apple.iokit.IOHDAFamily 2.2.5a5
com.apple.iokit.IOSurface 80.0.2
com.apple.iokit.IOFireWireIP 2.2.5
com.apple.iokit.IOAudioFamily 1.8.6fc18
com.apple.kext.OSvKernDSPLib 1.3
com.apple.driver.AppleGraphicsControl 3.1.33
com.apple.iokit.IONDRVSupport 2.3.4
com.apple.iokit.IOGraphicsFamily 2.3.4
com.apple.iokit.IOSerialFamily 10.0.5
com.apple.driver.AppleSMC 3.1.3d10
com.apple.driver.IOPlatformPluginLegacy 5.0.0d8
com.apple.driver.IOPlatformPluginFamily 5.1.1d6
com.apple.driver.AppleUSBBluetoothHCIController 4.0.8f17
com.apple.iokit.IOBluetoothFamily 4.0.8f17
com.apple.driver.AppleUSBMergeNub 5.1.0
com.apple.iokit.IOUSBHIDDriver 5.0.0
com.apple.driver.AppleUSBComposite 5.0.0
com.apple.iokit.IOSCSIMultimediaCommandsDevice 3.2.1
com.apple.iokit.IOBDStorageFamily 1.7
com.apple.iokit.IODVDStorageFamily 1.7.1
com.apple.iokit.IOCDStorageFamily 1.7.1
com.apple.iokit.IOATAPIProtocolTransport 3.0.0
com.apple.iokit.IOSCSIArchitectureModelFamily 3.2.1
com.apple.iokit.IO80211Family 420.3
com.apple.iokit.IOFireWireFamily 4.4.8
com.apple.iokit.IOUSBUserClient 5.0.0
com.apple.iokit.IOAHCIFamily 2.0.8
com.apple.iokit.IOATAFamily 2.5.1
com.apple.iokit.IOUSBFamily 5.1.0
com.apple.driver.AppleEFIRuntime 1.6.1
com.apple.iokit.IONetworkingFamily 2.1
com.apple.iokit.IOHIDFamily 1.7.1
com.apple.iokit.IOSMBusFamily 1.1
com.apple.security.sandbox 177.11
com.apple.kext.AppleMatch 1.0.0d1
com.apple.driver.DiskImages 331.7
com.apple.iokit.IOStorageFamily 1.7.2
com.apple.driver.AppleKeyStore 28.18
com.apple.driver.AppleACPIPlatform 1.5
com.apple.iokit.IOPCIFamily 2.7
com.apple.iokit.IOACPIFamily 1.4

from chromium-legacy.

jgrisham avatar jgrisham commented on August 21, 2024

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Do those (macOS 10.11 Chrome users) actually exist? ;)

blah...
  • Those still running 10.11 may be using Safari/Firefox/Opera, or blame their kernel panics on running an old OS version.
  • (I also suspect many are headless or server systems that haven't been updated - these types of systems probably don't see much interactive web browsing.)
  • Personally, my MacBook Pro still is running Mavericks, but I seem to remember thinking (for some long-lost rationale) that if I were ever to bump it forward, I'd be stopping at 10.10, 10.12, 10.13, or 10.15.

AFAIK, there aren't many (any?) macs where that is the latest installable version.

Systems tend to max out on 10.4, 10.6.8, 10.7.5, 10.11, 10.13.x, etc...

  • Having a max supported OS of macOS 10.11 seems mostly limited to MacMini3,1s and other similar late 2008/early 2009 Penryn/Nehalem models ... the late 2009 models tend to support up to 10.13.x ... I don't remember exactly, but the demarcation point may have been 'Metal' framework support by the integrated GPU.

more blah...

We recently identified a MS Outlook / Windows MAPI bug that now silently eats emails... but only for IMAP mailboxes, and only for maybe one out of a thousand. It has existed for at least a decade, likely two. (It was likely dormant for most of that time; surfacing only as mail processors began to support ARC in late 2019/2020.)

  • From the public reporting, it wouldn't be at all surprising if Apple had overlooked intermittent kernel bugs for just as long, especially for one that didn't generally cause panics on iOS / TVOS.
  • The larger the company (and the older/more stable the codebase), the harder it is for even well-meaning and talented engineering teams to find out edge cases exist - there are unintentional gatekeepers at all levels.

Tl;dr: it's plausible that it had been broken all along (i.e. not backported to earlier versions when it was finally discovered / fixed in XNU 3789.1.32).

from chromium-legacy.

jgrisham avatar jgrisham commented on August 21, 2024
{discussion clutter}

Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.

That's what I always thought too...

  • anything else seems like users and userspace devs are just being gaslit by kernel devs (I guess it's technically only gaslighting if it's intentional)

A kernel panic is more difficult to overlook ...

Indeed.

this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently

Do we actually know that? I mean, how could we without a major analytics operation? (I'm honestly curious - not trying to be argumentative!)

... yet, someone who recompiles XNU for fun in their spare time also said:

I've been avoiding reporting this ...

Chromium is one of the most resource-intensive users of a heterogenous variety of OS resources that many users are likely to run these days.

  • It's also liable to be used in very different ways on each launch; even users visiting the same website day after day will ultimately be running different JS from one month to the next.

with all of their users and analytics

I really want to believe, but for something that doesn't significantly impact PR or ad revenue, causing occasional and non-reproducible crashes for users of older versions of only one operating system?

  • Simple things break complex systems, and yet only seem simple in retrospect.
  • It would not be at all surprising if they had much more impactful bugs languishing in their backlog, both then and now. Knowledge without action and all...

Consider:

  • Apple fixed this at the kernel level (after a few years)
  • You fixed this at the kernel level (after a few weeks)

Also:

  • None of that actually matters 😒

Neither one of those fixed kernels will get to the majority of affected users here, so we do have to continue trying to mitigate this kernel bug by rolling back or modifying our userspace code.


Tl;dr: Thank you for trying to hunt it down!

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Looking again at the kernel kevent code, it seems there are very few places where THREAD_RESTART is actually signaled. The most likely source for the one received in kqueue_scan seems to be

https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335

	/* If it is an invalid wait queue, you cant wait on it */
	if (!wait_queue_is_valid(wq))
		return (thread->wait_result = THREAD_RESTART);

and in the kevent code kq->kq_wqs is the wait queue as part of the kq struct. This again makes me think that somehow the kqueue is being closed by some other process and we're not handling that.

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

As for userspace workarounds, assuming that hypothesis is correct then what I can think of is to try to ensure that no thread is waiting on a kqueue that is to be closed. So we'd have to send some sort of interrupt signal to break the wait on kqueue before closing the fd? Not too sure since I'm not too familiar with these parts...

As for the reverted commit you linked, at first glance I don't see anything too suspicious there. Why did that commit need to be reverted for 10.9 compatibility though? I don't see anything that's 10.11+ specific.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

I'll try next week!

As for the reverted commit you linked, at first glance I don't see anything too suspicious there.

I don't understand this code, so I'm just looking at the timing of when things were added and the differences between codebases. I'm assuming:

  1. The offending code has to make use of the kevent system call.

  2. There has to be something different in Chromium Legacy's codebase and mainline Chromium's codebase, which is causing Chromium Legacy to kernel panic on 10.11 and below but not mainline Chromium. (This is based on a separate assumption that Chromium Legacy would kernel panic if used on 10.11, since 10.11 contains the offending XNU code.)

The commit I linked contains the only difference I could find between the two codebases which uses kevent.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers I ran into some stupid trouble guarding the logging statements. wait_queue_is_valid doesn't exist in kern_event.c.

I tried to copy the two defines from: https://github.com/aosm/xnu/blob/0a13c7a4dcb9ab989727c2b0e08f7861457ed069/osfmk/kern/wait_queue.h#L161, adding:

#define _WAIT_QUEUE_inited		0x2
#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

But the compiler complained:
no member named 'wq_type' in 'struct wait_queue'

The wait_queue struct is defined in kern/wait_queue.h, and wq_type is ostensibly public, but adding #include <kern/wait_queue.h> to kern_event.c had no effect. I don't want to copy-paste the struct into kern_event.c because I don't know if defining it in a new place will do something weird.

I'm sure I'm doing something dumb. 🙂

For now, I've attached a kernel with unguarded logging statements before each call of wait_queue_assert in kern_event.c. The source is this revision of: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d/440e40ec82da87f4240f22981a020e0f169ea915. It works in my VM, but it's noisy enough that I don't want to leave it running on my real system. It seems wait_queue_assert_wait_with_leeway is called from kqueue_scan many times per second.

Also for reference, the guarded version I can't compile: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558
XNU-2422.115.4-THREAD_RESTART add-JA-v2022.02.21.zip
d

Edit: Issue was closed by accident, please disregard.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy

No you're not doing something wrong, the way their includes work is tricky and took me a good 10 minutes to figure out.

Very roughly, when all the headers for are put together the net result is something like


#include <stdio.h>


typedef struct wait_queue		*wait_queue_t;
#ifndef MACH_KERNEL_PRIVATE

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types

#else



typedef struct wait_queue { // from wait_queue.h
	unsigned long int                 
	/* boolean_t */	wq_type:2,		/* only public field */
					wq_fifo:1,		/* fifo wakeup policy? */
					wq_prepost:1,	/* waitq supports prepost? set only */
					wq_eventmask:2; 
	int	wq_interlock;	/* interlock */
	int	wq_queue;		/* queue of elements */
} WaitQueue;

#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)
	
#endif
	

In particular, note that we actually have two different declarations of the wait_queue struct that are selected based on whether MACH_KERNEL_PRIVATE. I assume one is intended be used akin to an opaque-type (think "private" in an OO language) to hide internal members, while the other is used only within the wait_queue.c. Interestingly for opaque types you normally just leave the declaration without any fields, but here they explicitly initialize with some dummy fields of the proper size, which I assume was done so sizeof() would work propery.

This is also why just #including didn't work, since the proper #define wasn't set, so it effectively never made it past the preprocessor.

Now let's work around this. You're right in that we don't want to just directly copy-paste the struct since we'll run into duplicate definition error. Instead let's do it in a slightly smarter way where we create our own struct with identical layout and then cast to that to access fields.


typedef struct wait_queue		*wait_queue_t;

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types


#define EVENT_MASK_BITS  

typedef struct wowfunhappy_wait_queue { // happy little wait queue
	unsigned long int                 
	/* boolean_t */	wq_type:2,		/* only public field */
					wq_fifo:1,		/* fifo wakeup policy? */
					wq_prepost:1,	/* waitq supports prepost? set only */
					wq_eventmask:((sizeof(long) * 8) - 4); 
} WowFunHappyWaitQueue;

#define _WAIT_QUEUE_inited		0x2
#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

void buzz(wait_queue_t wq) {
	printf("%d", wait_queue_is_valid((WowFunHappyWaitQueue*) wq));
}

** Note 2 important things. 1) This is technically undefined behavior I think, although if you were to use a union instead of a cast to do the type-punning then it might be standards compliant? But who cares about standards so long as it works (and casting should indeed work).

  1. Note that I only included the initial member of the struct here, and left out the other 2. This should be fine because IIRC type-punning is safe for any prefix of shared fields as well. **

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Unfortunately, I have been forced to revert back to a vanilla XNU kernel on my personal machine.

I woke up this morning and found that my Mac had been logged out of iMessage. When I tried to log back in, the app complained of an unknown error. After a lot of reboots, password changes, network changes, and other dead ends, I finally tried reverting back to stock XNU, and iMessage instantly began working again.

If anyone else is using one of the kernels I attached, and iMessage stops working for you, you know how to fix it. And for what it's worth, I experienced zero panics during my month with the custom kernel.

Stuff related to iMessage and custom kernels

I'm not entirely sure what happened, since I was using my custom kernel for nearly a month without problems. It may be that because I was already logged into iMessage when I switched kernels, the app kept working up until Apple sent some kind of heartbeat message. The imessage_debug utility reported failures when retrieving the values of Gq3489ugfi, Fyp98tpgj, kbjfrfpoJU, oycqAZloTNDm, and abKPld1EcMni on the custom kernel (and success with the stock kernel), but the software's authors neglected to include any documentation as to what these values mean, where they come from, or why they're important. 🤷‍♂️

Regardless, I personally value iMessage compatibility, and the system is enough of a crapshoot that I'd rather not mess with it and potentially get my account flagged. Perhaps I'll see if I can figure out a kext patch instead.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

That's interesting. Based on what I found online it seems cause is

Publicly-available XNU sources lack the necessary functions needed for generating the IOPower data (just look at IORootParent::initialize() in the kernel binary vs. what's in the source).

I assume that the hackintosh community has worked around this by just doing patching instead. There seems to be a robust patching framework they've built up (Lilu). Another alternative might have been to create a kext that does populate these values in the io registry, but I don't know why they didn't go that route. I'm not too familiar with the hackintosh scene so I don't know how they obtain these values in the first place if they don't have an existing mac to copy from (basedo n https://dortania.github.io/OpenCore-Post-Install/universal/iservices.html#fixing-en0 looks like they generate random SNs??)

I personally don't like the idea of using a monolithic framework to do the patching (and Lilu is bloated since it tries to handle both userspace and kernel patching). Looking at ([Onyx the black cat[(https://github.com/gdbinit/onyx-the-black-cat)) is a good reference since it contains some useful functions we might want (especially the write enable on kernel pages). Most of those patches are in-place binary patches though, whereas for a slightly more involved patch like this I'd usually want to insert a trampoline.

I tried to find any existing examples of kexts which have a trampoline insert that we can steal (there are many userspace ones like mach_override that we can port if really needed since the core logic is the same, but I'm lazy and don't want to think too much). I found

https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/inline-hook/inline_hook.c

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

https://github.com/SilverMoonSecurity/PassiveFuzzFrameworkOSX/blob/be29a4df6e7b2394774a8cc3019bde03d816158c/src/kernel_fuzzer/the_flying_circus/InlineHook/inline_hook.c

(interestingly all same to be the same-ish file, but don't have any attribution...)

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Er actually doing a trampoline is going to be a bit annoying because we need to ensure the return value is actually used. We can either implement the trampoline'd function in assembly so we don't need to worry about C messing things up, try to fuss around with the proper syntax to ensure that wait_result gets mapped to the function param and the return value gets mapped to the register that contains error, or just forget about doing this properly and patch the default section to return EBADF directly and hope that this doesn't mask any other bug.

In terms of hacker spirit I still like the trampoline though, and if we set up the framework for that it will be easy to insert the logging I mentioned. After going through the candidates I like the approach taken by

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

since it is the easiest for me to understand and flexible enough to coax into doing what we want.

@Wowfunhappy Either way (whether or not we opt to use trampoline) let's start off with creating a simple kext that just reads the contents of kqueue_scan_continue function address to make sure we are accessing the correct address. https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c has a nifty solve_kernel_symbol function but I don't think we can use it because kqueue_scan_continue is not exported function. Instead, we need to find that corresponding block in the disassembled binary to get the relative address (relative to kernel base) and then add the kaslr offset to it to get the absolute address in memory.

Once you have the relative address, can you create a basic "hello world" kext (https://craftware.xyz/tips/Building-kernel-extension.html) that uses the kaslr_slide function from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c to dump out some bytes from the absolute address to see if they match the assembly we expect? Once we have that then we can take the next steps...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

The Hackintosh community has some good reasons for avoiding custom kernels. Apple releases source code on a delay, and if you're running a non-ancient version of macOS, every update will potentially overwrite your custom kernel with a stock one, and leave you with a non-bootable system if you aren't careful. The more recent OSS releases also leave out really critical things like sleep and power management.

I'd thought that on an EOL OS, there was a certain purity to just fixing an obvious bug in an open-source component and recompiling—after all, what good is code that's never built? But it was never a great solution anyway, and it appears that even Apple's older code releases aren't as complete as I thought.

I am a Hackintosh user (on one of my two machines) and I'm already using Lilu, so it might make sense for me to go that route; it mostly depends on what I can actually manage to do. Downside: AIUI, Lilu has to be injected via a Hackintosh bootloader, because it needs to be loaded before the kexts it's patching. (Although, wait a minute, we're not trying to patch a kext, we're trying to patch the kernel—will this work at all? I know you said it's all the same address space, but then why does the loading order matter...?)

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users. The custom kernel had the advantage of being easy, I was done in a couple hours.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in userspace?

Edit: I started writing this before your more recent replies/edits @krackers. Okay, if I attempt to keep going with this I'll follow from your starting point, thank you! I would still like to step back for a minute and consider whether this a smart avenue to keep exploring—what are we hoping to find out?

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers I'm going to continue the tangential discussion a bit here to try to leave this issue as focused as possible.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner.

The crash log is pretty explicit though:

BSD process name corresponding to current thread: Chromium Helper

If it was a system framework, wouldn't that be showing up instead?

I also took a quick look in Instruments, and can confirm that Chromium Helper is certainly making BSC_kevent64 syscalls. Whether these are the syscalls that sometimes lead to a panic, I have no idea.

Edit: All coming from one place:
Screen Shot 2022-02-27 at 10 46 30 AM

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

If it was a system framework, wouldn't that be showing up instead

I think even if call into system framework, it's still part of same chrome helper process?

But yes based on your trace seems like it is indeed chromium message pump. Can you try recompiling chromium add adding some logging in the destructor for messagepumpkqueue (assuming that is the class which holds the kqueue reference)? I assume that it's one messagepump per chrome helper process, which means it might not be too noisy to run it and see if object destroy is related to kernel panic trigger.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I'll try if I can! It may be a while, grad school is picking up so I'm trying (key word) to spend less time screwing around with computers.

Bluebox, if you happen to read this, (I assume you're busy with other life stuff), you may want to add this logging yourself since you have the hardware to compile Chromium easily.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I'm not sure why (although my best guess is it's something in the latest stable version of Chromium), but I've encountered many, many more kernel panics than usual over the past few days. This is good in some ways, because it means I may be able to capture useful data more easily.

I decided to switch back to the custom kernel for a time, now with the logging messages. @krackers, this is the code I ended up with: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d#file-kern_event-c-L112. I hope this was enough, including the other defines led to compile errors (which I didn't capture—sorry, I was actually trying to finish this before I have to leave for work).

In the time I've spent writing this comment, I actually may have already captured an averted panic! In the console I have:

3/4/22 2:39:26.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!

But, bad news: that's the only log I have! So either the wait_queue_is_valid check is always returning true (either because I coded it wrong or due to something else), or the call order is something completely different!

Anyway, just wanted to share, I'll look into this more later, gotta go teach Scratch programming to a bunch of children... 💨

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy

Your code seems fine to me, so it seems like whatever is causing THREAD_RESTART to be signaled is via a different code path. Just to be sure we didn't miss anything, can you add some more logging in all of the

	if (!wait_queue_is_valid(wq))
		return (thread->wait_result = THREAD_RESTART);

lines in
https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335?

I.e. like you did before

	if (!wait_queue_is_valid(wq)) {
               printf("Func name + error message");
		return (thread->wait_result = THREAD_RESTART);
        }

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Thanks for confirming the code looked fine! I've added that additional logging, see the new file in the gist: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d

Now we just need to wait for a log message to show up again...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

The panic-averted log appeared once again:

3/4/22 7:45:05.000 PM kernel[0]: kqueue_scan_continue was called with a wait_result of THREAD_RESTART. A vanilla 2422 XNU kernel would have panicked!

Once again, it was not accompanied by any of the other messages I added.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Hm well there goes that theory. Seems like it's not directly caused by a kqueue event then, it just happens when it's in the kqueue wait loop.

The 3 other places THREAD_ERROR seems to be signalled are in vm_fault.c, sync_sema.c, and ipc_mqueue.c.

Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked calls in osfmk/ipc/ipc_mqueue.c? I don't know how noisy this will be though.

We can also try to determine this from the other direction: are there any plausible candidates for commits that would caused these panics to happen more frequently? Anything related to mach ipc or messagepump would be decent candidates.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Worst case I guess you could binary search for candidate commits, try 2 builds a day and you'd probably arrive at the change within a week or two.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Let's try the last one: can you add a printf before the wait_queue_wakeup64_all_locked calls in osfmk/ipc/ipc_mqueue.c? I don't know how noisy this will be though.

Sure, done, but they're very noisy, running many times per second. I'll give this a try but I'm not sure how much useful information it will give us. And since logs roll off after 4,000 messages I'm not sure I'll be able to catch the critical THREAD_RESTART one...

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Yeah that's what I feared. Ok I think we have 2 options going forward then

  1. Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.

  2. Other option is to patch this on kernel side via the kext approach mentioned before. I think we can just take a shortcut here and change the default condition to just return EBADF instead of panicking (as opposed to the "right" approach of inserting a new jump for a new case condition).

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Try to find the commit to Chromium that caused increased rate of kpanic. I tried poking around for changes related to ipc, but didn't find any smoking gun. Bisecting the way to the commit is probably easiest then.

Thing is, the frequency is still erratic. Case in point, I haven't received that log message again since Friday. I don't actually know that the change was in Chromium as opposed to something in my usage pattern...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I found a thing. The more I think about it, the less I'm convinced it's related, but maybe @krackers can take a peek and decide if I may be onto something. :)

Google was running into kernel panics of their own back in 2017. Their kernel backtrace was different than ours, but it touches on same of the same functions: https://bugs.chromium.org/p/chromium/issues/detail?id=756102

Google fixed it here: https://chromium.googlesource.com/chromium/src.git/+/5cd8eef498e28e1aacb9109645c001bbbfe1bc26%5E%21/#F0

   // On macOS 10.11+, using Mach port sets may cause system instability, per
   // https://crbug.com/756102. On macOS 10.12+, a kqueue can be used
   // instead to work around that. On macOS 10.9 and 10.10, kqueue only works
   // for port sets, so port sets are just used directly. On macOS 10.11,
   // libdispatch sources are used. Therefore, there are three different
   // primitives that can be used to implement WaitMany. Which one to use is
   // selected at run-time by OS version checks.

So, kPrimitive is supposed to be set to:

  • 10.12 and newer: KQUEUE
  • 10.11 exactly: DISPATCH
  • 10.10 and older: PORT_SET

This was set by Chromium via:
const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : (mac::IsOS10_11() ? DISPATCH : PORT_SET);

However, Chromium Legacy uses:
const WaitManyPrimitive kPrimitive = mac::IsAtLeastOS10_12() ? KQUEUE : DISPATCH;

I think this is wrong, isn't it? We're using the 10.11 codepath on 10.7–10.10, whereas Google explicitly built something separate for 10.9–10.10, although it's not clear why. (Since official Chrome doesn't support anything older than 10.11, it seems this code is now only used for iOS?)

Unfortunately, it seems @blueboxd changed this early on to "fix instabilities on loading DevTool and plugins".

Assuming that's still needed... this whole Mac-specific WaitableEvent implementation is a relatively recent addition to Chromium, I wonder if we could just revert the whole thing and fall back to whatever cross-platform implementation Chromium was using before? Could be more maintainable anyway.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Hm it's an interesting find. The bug comments also suggest that watching videos (or doing something with GPU heavy activity) is more likely to trigger the bug.

Libdispatch does make use of kevent in its work loop from what I can see, so it's a decent guess. But I'm not really familiar with this part of osx internals so I cannot say for sure.

The comment is a bit hard to parse, but from the surrounding discussion it seems like

  • Mach port sets cause issues with 10.11+
  • KQueue doesn't work for non port-sets on <= 10.11
  • Hence 3 separate implementations for 10.9, 10.11, 10.12+

It's not clear why Google didn't use libdispatch on <= 10.9, but I suppose they wanted the minimal fix possible to avoid introducing any additional instability.

I'm not sure about the "fix instabilities on loading DevTool and plugins" thing. What kind of instability was seen, and how was it narrowed down to the WaitMany implementation?

The mac-specific WaitableEvent implementation seems to have been added in v62, and up to v68 officially supported 10.9 so we may not need to revert that entirely (note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that).

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

note it's possible that it was never caught, but I never remember running into any kpanics when I was still using that

Because Chrome v62–68 set kPrimitive = PORT_SET on 10.9/10.10, whereas Chromium Legacy does not. That was the basis of my theory, anyway—I'm still confident that if an official version of Chrome ever led to kernel panics on an officially-supported OS, we would know about it. Someone somewhere would have reported it to Google and they would have treated it as a serious issue.

The easiest and most obvious thing would be to match what Chrome does—except, Bluebox may have had a reason for changing it in the first place. Maybe related to Lion support? (Notably, we know the KP exists on Lion because @pjpreilly has seen it and runs Lion.)

I guess the only way to find out at this point is to try... I'll see if I can find some spare cpu time to recompile Chromium this week, since BlueBox seems to still be away (or has wisely chosen to stay out of this quagmire).

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

The good news is a re-built Chromium with kPrimitive = PORT_SET on 10.7–10.10. The bad news is I triggered the would-be kernel panic in this version of Chromium. Oh well.

But now that I have Chromium building again, maybe I can experiment with some other stuff... I neglected to add the messagepumpkqueue logging Krakers suggested, for example.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

The next thing I want to try is wholesale getting rid of https://bugs.chromium.org/p/chromium/issues/detail?id=932175 and going back to the POSIX implementation. This change was made in 2019 after Mavericks support was dropped, and it clearly touches the kernel and kevent.

This is proving a bit challenging, as a lot of files were touched and things have been renamed, etc.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Okay, actual progress! I just noticed something I should have noticed earlier!

The log message I added to XNU which indicates an averted kernel panic is always immediately followed, within a second, by a crash from a Chromium Helper process! These crashes do not appear to have any user-visible effects, but they do provide a backtrace. See attached!

So, I can now say pretty confidently that the bug is somewhere in MessagePumpKqueue.

Chromium Helper_2022-03-14-112231_Jonathans-Mac-Pro.crash.zip

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

sorry for the late response, that's a very interesting find. It means that the EBADF code path is actually not supposed to happen at all, since their code only has checks for EINTR, hence the process crash. I don't know if this means that this is a case where on a 10.11 machine chromium helper process would have crashed as well, or whether it's a more fundamental bug in 10.9 mach ipc and adding the ebadf case in the kernel source just works around the issue/shifts the symptom from "stop the world kernel panic" to process crash.

Either way as you mentioned it confirms that the kpanic is triggered while one thread is waiting on kevent inside MessagePumpKqueue. It's still not clear though what's the other thread/process doing the action that causes the EBADF to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes? Then again none of the log statements you added to the kernel got triggered so that means the EBADF is being signalled from either vm_fault.c, sync_sema.c, or ipc_mqueue.c. But even if we narrow it down to one of those I wouldn't know how to find the corresponding userspace chromium code that results in that call since all of those are pretty broad (I still bet it's in ipc_mqueue though).

I haven't looked too much into what it would take to revert the use of MessagePumpKqueue but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO, can we change that to the posix-generic messagepumplibevent ? Although I see that some of the other mac-specific parts have been switched to communicate using mach-ports which means we'd need to go back and revert those too...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

It's still not clear though what's the other thread/process doing the action that causes the EBADF to be signalled though. I didn't see anything immediately obvious in the backtrace for the other threads in your dump, so it could also be one of the other chromium helper processes?

The other thing I noticed was WaitableEvent (and specifically WaitableEvent::TimedWait) being near the top in several other threads. I still feel like that's implicated somehow, even though switching kPrimitive didn't resolve the issue.

I haven't looked too much into what it would take to revert the use of MessagePumpKqueue but it seems like since all the platform-specific code is abstracted as a subclass of MessagePumpForIO, can we change that to the posix-generic messagepumplibevent ?

Yes, that's what I was trying to do last week before I noticed the process crash! The original set of changes was documented here. I had a bit of trouble setting it back to the POSIX version, as the changes touched a number of different files which have themselves evolved since 2019, but I'd like to try again at some point.

from chromium-legacy.

blueboxd avatar blueboxd commented on August 21, 2024

Thank you for the deep inspection of panic🙇

I traced this panic with kernel debugger before, but I couldn't find the root cause of panic.
(there was nothing special about the parameters of kevent64 preceding the panic)
Maybe it is not due to a single kevent64, but rather a combination of other events in the queue or running threads...?

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@blueboxd Hey there! As a quick catch up, I recompiled XNU so that whenever kqueue_scan_continue is called with a wait_result value of THREAD_RESTART, XNU will log a warning and return EBADF instead of panicking.

This resolves the kernel panic, and causes Chromium to merely crash instead. I attached a backtrace here. It points to Chromium's MessagePumpKqueue as the culprit.

I suspect that if we can fix that crash, the kernel panics will go away!

from chromium-legacy.

blueboxd avatar blueboxd commented on August 21, 2024

@Wowfunhappy
Thank you for the catch-up!

Stack trace of kernel debugger also indicates kevent64 in MessagePumpKqueue::DoInternalWork caused panic.
But, if parameters for kevent64 are not the root cause, how we can resolve panics/crashes...?
(simple re-writing of DoInternalWork may not resolve the issue:thinking:)
I'll try to modify the kernel for debugging too.

@aeiouaeiouaeiouaeiouaeiouaeiou
Chromium crashes are not the topic of this issue.
(this issue induces kernel panic in normal environment)
Please open a new issue with the crash log.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

It might be worth trying to figure out where the EBADF is signaled from in the kernel (since we ruled out the origins in the kevent code, that leaves only vm_fault.c, sync_sema.c, or ipc_mqueue.c). As you mentioned, it's likely some race condition somewhere between different threads that's causing ebadf to sent to the thread while inside the kernel wait loop.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

But, if parameters for kevent64 are not the root cause, how we can resolve panics/crashes...?

One thing we could try is using messagepumplibevent instead of MessagePumpKqueue. I think you'd just need to revert the set of changes described here: https://bugs.chromium.org/p/chromium/issues/detail?id=932175. The changes were a bit too complex for me, but you might have more success.

from chromium-legacy.

blueboxd avatar blueboxd commented on August 21, 2024

@krackers
A race condition is worst case scenario I had imagined...😔
OK, I'll embed debug code to the remaining candidates.

@Wowfunhappy
Ah, got it, there's a predecessor MessagePump mechanism.
I'll try to revert to the POSIX one and see what happens.

from chromium-legacy.

blueboxd avatar blueboxd commented on August 21, 2024

I tried to revert to MessagePumpLibevent, but ChannelMac, CurrentIOThread and app_shim are too deeply dependent on MessagePumpKqueue...
MessagePumpForIO cannot be changed to MessagePumpLibevent independently, so the current IPC mechanism needs to be nearly overhauled.
I think this method is a bit harsh at the moment...:pensive:

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Oh well, thanks for trying!

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I have discovered that I can consistently trigger the problem by idling on this awful page: https://gamebanana.com/games/5866. Make sure you are not using an adblocker. There should be an autoplaying video in the lower right of the screen.

The problem usually appears within 90 minutes. I do not know whether the tab needs to be in front, but the crash can occur when the window is entirely minimized.

When Chromium was launched via the Terminal, these messages are always logged at the time of the crash:

[22670:14595:0403/121124.951621:FATAL:message_pump_kqueue.cc(442)] Check failed: . kevent64: Bad file descriptor (9)
[1965:1287:0403/121126.049808:ERROR:network_service_instance_impl.cc(978)] Network service crashed, restarting service.

I assume the first message is printed by this PCHECK. I don't think this tells us anything we didn't already know; Chromium got a bad file descriptor in MessagePumpKqueue::DoInternalWork.

However, I find the second message somewhat more notable. It looks like Network service may be the responsible party?

Note that I have observed instances of the crash in which only the first message is printed, and not the second. However, when the crash occurs on the specific page I linked above, the second message is always present.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers Does this look notable to you at all? 6a57f2b

Annoyingly, bug 640281 is private, but it has something to do with guarding against invalid file descriptors, as well as the use of kqueue message pump.

I spotted this because Google recently changed the code for modern macOS compatibility, presumably causing #63. I don't have a complete theory as to why it would only have been failing on Chromium Legacy.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

That change is from 2016 though? And adding the guards should only help things, not hurt us.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Well, I was thinking the guards might not be working for some reason. Or, we could potentially add more.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

You could trace all close syscalls I guess (maybe using dtrace) and see what fd is being used

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Okay @krackers, if you're still willing to help (I would need a lot of hand holding), I'd like to try making a kext patch. The more I look into the crash at the Chromium level, the more unfeasible fixing it there seems to be. I think there are actually a lot of different code paths within Chromium which can trigger the panic, because I keep finding command line switches that will remove the crash on one page, but not somewhere else.

You suggested starting with a kext to find the address of _kqueue_scan_continue and dumping the bytes to confirm the output is correct, using kaslr_slide from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c. However, this relies on a kernel function vm_kernel_unslide_or_perm_external which AIUI doesn't exist prior to El Capitan. Any idea how to get that address?

I also found https://github.com/leiless/ksymresolver which I think may be better, it's said to be able to resolve non-exported symbols. But it too relies on vm_kernel_unslide_or_perm_external.

(Note that I may be away from my computer this weekend, but I'm done with finals so I should have more time in general!)

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy

doesn't exist prior to El Capitan

Interesting, thanks for pointing that out. Based on [1] as well as digging around other kexts, seems like without that function there are only 2 options for discovering the kaslr slide: either "back-read memory to find 0xfeedface" as mentioned in [1] (which I understand to be basically scanning the kernel space for the mach-o header to find the start, which seems to be what Lilu does?), or getting the runtime address of a known exported symbol (e.g. vnode_close) and subtracting it from the address of the symbol in the on-disk binary, as done in [2]. Interestingly prior to 10.11 there was a kas_info() syscall to get this info from userspace [3], although I tried searching and couldn't find a way to read vm_kernel_slide from kernel-space directly (or maybe it exists and I'm not doing it right).

I think the easiest will be to use the approach used in [2] where we basically set

#define VNODE_CLOSE_BINARY_ADDR 0xf00 // Open up kernel in hopper and put the address of the exported vnode_close symbol here
kaslr_offset = (mach_vm_address_t) &vnode_close - VNODE_CLOSE_BINARY_ADDR;

Then once we have the kaslr offset, we can do

#define KQUEUE_SCAN_CONTINUE_BINARY_ADDR 0x00f // Address of symbol in text segment as found in Hopper
kqueue_scan_continue = kaslr_offset + KQUEUE_SCAN_CONTINUE_BINARY_ADDR

uint8_t *kqueu_scan_continue_bytes = (uint8_t*) kqueue_scan_continue

Then hopefully do an IOLog("Dumping bytes at kqueue_scan_continue %02x %02x %02x", kqueu_scan_continue_bytes[0], kqueu_scan_continue_bytes[1], kqueu_scan_continue_bytes[2]) and check if the hexdump matches what you see in diassembler for the function

[1] https://www.zdziarski.com/blog/?p=6901
[2] https://github.com/cocoahuke/shrink_trackpad
[3] https://github.com/gdbinit/kextstat_aslr/

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

This blog post also shows a more programmatic way to do approach (2):

http://ho.ax/posts/2012/02/resolving-kernel-symbols/

Basically manually scanning the LC_SYMTAB section of the binary to find the address of the symbol in the binary instead of just hardcoding after finding via Hopper as I suggested. There's an attached github project you can just use. Although I personally think just hardcoding the address of the smybol in binary is cleaner since there's no need to waste cpu cycles scanning through the list when we know which one it's going to be.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Thank you! I got it working!

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

This outputs 6b 71 75, which matches what is in Hopper.

I initially used your method of getting the slide (which was invaluable, as it gave me a known-good result), but I managed to replace it with code adapted from Lilu, so I could hardcode only one address instead of two. Long term, I'd eventually like to make this work across different builds of XNU. (Long term being the key word, the address of kqueue_scan_continue is still hardcoded...)

What is the next step?

Edit: Eek, we posted at almost the exact same time!

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Wow that was fast, nice job! Fyi you can also remove the hard-coded KQUEUE_SCAN_CONTINUE_BINARY_ADDR and discover it on-the-fly by scanning LC_SYMTAB as I mentioned in the previous post.

One thing though, the address of kqueue_scan_continue you used seems to be in the strings section, not in the text segment. I loaded it in hopper, and it seems like there's no separate kqueue_scan_continue function, I guess it was inlined during compilation.

You should use ffffff80005c79e0 instead, which is the function that contains the call to _panic we want to avoid. (Since it won't be in LC_SYMTAB either, I guess you can't avoid harcoding it after all).

It contains the line we want to avoid:

_panic("\"%s: - invalid wait_result (%d)\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", "kqueue_scan_continue", LODWORD(r15), rcx, r8, r9, STK-1);

Now the next step is to find which bytes we need to patch. Recall the original switch statement was

	switch (wait_result) {
	case THREAD_AWAKENED:
		kqlock(kq);
		error = kqueue_process(kq, cont_args->call, cont_args, &count,
		    current_proc());
		if (error == 0 && count == 0) {
			wait_queue_assert_wait((wait_queue_t)kq->kq_wqs,
			    KQ_EVENT, THREAD_ABORTSAFE, cont_args->deadline);
			kq->kq_state |= KQ_SLEEP;
			kqunlock(kq);
			thread_block_parameter(kqueue_scan_continue, kq);
			/* NOTREACHED */
		}
		kqunlock(kq);
		break;
	case THREAD_TIMED_OUT:
		error = EWOULDBLOCK;
		break;
	case THREAD_INTERRUPTED:
		error = EINTR;
		break;
	default:
		panic("%s: - invalid wait_result (%d)", __func__,
		    wait_result);
		error = 0;
	}

and the decompiled-assembly is

    if (LODWORD(r15) != 0x2) {
            LODWORD(r12) = 0x23;
            if (LODWORD(r15) != 0x1) {
                    if (LODWORD(r15) == 0x0) {
                            r15 = r14 + 0x8;
                            _lck_spin_lock(r15);
                            r12 = *(rbx + 0x68);
                            _current_proc();
                            LODWORD(r12) = LODWORD(sub_ffffff80005c7390(r14, r12, rbx + 0x68, var_2C));
                            if (LODWORD(LODWORD(var_2C) | LODWORD(r12)) == 0x0) {
                                    rcx = *(r13 + 0x78);
                                    _wait_queue_assert_wait(*r14, 0x0, 0x2, rcx, r8, r9, STK-1);
                                    *(int8_t *)(r14 + 0x58) = *(int8_t *)(r14 + 0x58) | 0x2;
                                    _lck_spin_unlock(r15);
                                    _thread_block_parameter(sub_ffffff80005c79e0, r14, 0x2, rcx, r8, r9);
                            }
                            _lck_spin_unlock(r15);
                    }
                    else {
                            _panic("\"%s: - invalid wait_result (%d)\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", "kqueue_scan_continue", LODWORD(r15), rcx, r8, r9, STK-1);
                            LODWORD(r12) = 0x0;
                    }
            }
    }
    else {
            LODWORD(r12) = 0x4;
    }

I was originally thinking we could try inserting a trampoline to our own code in the else of if (LODWORD(r15) == 0x0) { in order to add the additional check for EBADF, but we can also do the lazy thing and just assume that any unhandled case must be EBADF, noop the call to panic, and set the code to EBADF instead of 0x0. (Basically we'd be betting that there's no way there's no way there's another unhandled wait_result type (there are 2 remaining wait_result_t types, THREAD_WAITING, and THREAD_NOT_WAITING`, but given that even latest xnu source doesn't add anything else I guess it's a safe bet that we won't see those two types here).

If we do the former approach, we use one of the trampolining libraries I think I mentioned in one of the previous posts. Let me know if you want to do this approach, and we can go in more detail.

If we do the latter, we basically need to rewrite

ffffff80005c7a9c         lea        rdi, qword [ds:0xffffff8000766e5e]          ; "\\\"%s: - invalid wait_result (%d)\\\"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167", argument #1 for method _panic, XREF=sub_ffffff80005c79e0+68
ffffff80005c7aa3         lea        rsi, qword [ds:0xffffff8000766ebb]          ; "kqueue_scan_continue", argument #2 for method _panic
ffffff80005c7aaa         mov        edx, r15d                                   ; argument #3 for method _panic
ffffff80005c7aad         xor        al, al
ffffff80005c7aaf         call       _panic
ffffff80005c7ab4         xor        r12d, r12d

to be mov r12d, (whatever the numeric value of EBADF is, I think its 0x9 from errno.h but please double check) followed by a bunch of nops to fill out the remaining space.

If we do this approach then it's easier (just a bit tedious though). We can do something like

uint8_t replacement_bytes = {41, BC, 09, 00, 00, 00 } // assembled mov r12d, 0x9 with padding to match existing
memcpy(kaslr_base + 0xffffff80005c7a9c, replacement_bytes, 6);
memset(kaslr_base + 0xffffff80005c7aa3, 0x90 /*noop*/, 0xffffff80005c7ab4 - 0xffffff80005c7aa3);

And then maybe dump out the bytes after replacing to make sure it's what you want.

I think we also may need to disable interrupts before doing patching. I found this code in lilu

bool MachInfo::setInterrupts(bool enable) {
	unsigned long flags;

	if (enable)
		asm volatile("pushf; pop %0; sti" : "=r"(flags));
	else
		asm volatile("pushf; pop %0; cli" : "=r"(flags));

	return static_cast<bool>(flags & EFL_IF) != enable;
}

which probably does what we want, but actually seems like apple already provides us ml_set_interrupts_enabled which is the exact same code (probably cleaner to use apple's one if it's already there): https://developer.apple.com/documentation/kernel/1593365-ml_set_interrupts_enabled

I also don't know if there is any sort of W^X in kernel land we need to bypass. I guess you can try it and if it doesn't allow us to write there'll be a kernel panic or something when we try to do so.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Maybe take a look at Lilu Kernel patching to see if there's any other steps we need:

https://github.com/acidanthera/Lilu/blob/bd48fa7f4d9773bab1f6dfb7890b9a98ab08d829/Lilu/Sources/kern_patcher.cpp#L301

there's a MachInfo::setKernelWriting in there, looks like in addition to disabling interrupts we need to unset a write protect bit. So I guess there is indeed W^X in kernel space (makes sense). Still not sure why they decide to flip CR0 instead of using whatever kernel-space equivalent of vm_protectthere is, but I guess former is easier,

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Thank you! Should be back at my computer tomorrow...

One thing though, the address of kqueue_scan_continue you used seems to be in the strings section, not in the text segment. I loaded it in hopper, and it seems like there's no separate kqueue_scan_continue function, I guess it was inlined during compilation.

Thanks, I was actually thinking that seemed off...

Fyi you can also remove the hard-coded KQUEUE_SCAN_CONTINUE_BINARY_ADDR and discover it on-the-fly by scanning LC_SYMTAB as I mentioned in the previous post.

To confirm, this would basically be the same as how I found the slide value, right? Start with a normal exported symbol and then scan forwards until I find what I'm looking for. Since the actual kqueue_scan_continue function was inlined and the symbol doesn't exist, I guess I could just look for the hex sequence of the panic message?

but we can also do the lazy thing and just assume that any unhandled case must be EBADF, noop the call to panic, and set the code to EBADF instead of 0x0. (Basically we'd be betting that there's no way there's no way there's another unhandled wait_result type (there are 2 remaining wait_result_t types, THREAD_WAITING, and THREAD_NOT_WAITING`, but given that even latest xnu source doesn't add anything else I guess it's a safe bet that we won't see those two types here))

Totally on board with this approach!

I suppose it is worth considering... in the scenario where the kernel did send either THREAD_WAITING or THREAD_NOT_WAITING, how inappropriate would it be to send an EBADF (bad file descriptor)? Is there a more generic error we should consider returning instead? We have some options in https://www.freebsd.org/cgi/man.cgi?query=errno&sektion=2&manpath=freebsd-release-ports, assuming of course that XNU is the same.

(I do find it interesting that there are other unhandled types. Maybe that indicates this bug wasn't quite as stupid as it initially appeared? E.g., maybe Apple had a reason to think this shouldn't ever happen, and didn't just blatantly forget to handle a case.)

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Start with a normal exported symbol and then scan forwards until I find what I'm looking for.

I guess I could just look for the hex sequence of the panic message?

The panic message is unfortunately also in the strings segment. So I'm not really sure what's the best way to discover the offset we want to patch at runtime. The best option just seems to be hardcoding the offset (discovered via xref using hopper). Alternatively you could try scanning for the specific instructions we want to replace, but I don't know if that's really any better than hardcoding since there's no guarantee the instructions would be the same between kernel versions (e.g. in another version compiler might have reordered the switch statement differently, or used some other register instead of r12d for return code).

in the scenario where the kernel did send either THREAD_WAITING or THREAD_NOT_WAITING

I think EBADF should probably be fine there, since the userspace code can handle it as if the syscall failed. But anyway from what I can tell it's not something that should ever happen anyway for 2 reasons:

  1. The empirical evidence that we've never seen a kernel panic implicating that type, and Apple also hasn't updated recent kernels to add it as a case

  2. Looking at the code flow, it seems like if we enter the kqueue_scan_continue the thread must have been put into a waiting state and now it's not waiting anymore due to some interrupt, context switch, etc., which means that neither THREAD_WAITING or THREAD_NOT_WAITING would be valid values

maybe Apple had a reason to think this shouldn't ever happen, and didn't just blatantly forget to handle a case

Yes it's probably that. Maybe the invariant was true at some point but later it became violated. That's why I think usually doing exhaustive switch statement is better since it forces you to confront all posibilities rather than sweeping under the rug with default. And forces you to document all the implicit assumptions for "unrechable" code.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers So I'm here at the moment:

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

When I comment out lines 55 and 56 (to dump the bytes again and see if they changed), the kext loads fine. However, with those lines—to see if it actually did anything—it KP's.

I haven't done the W^X thing yet because I figured I'd try without it first—but, why would the kernel not panic unless I try to actually print the bytes I modified? If you think write protection is the problem, I'll do that next, as long as the rest of the code so far looks correct to you.


I'm still trying to get rid of hard coded addresses...

  • Can we dynamically find the size of the procedure, instead of memset'ing 0xffffff80005c7ab4 - 0xffffff80005c7aa3?
  • This might be stupid, but how difficult would it be to find the function address the same way I can in Hopper? So, presumably it should be possible to scan for the panic message in the strings segment, and then find the code that references that message—right? Any idea what that might look like?

I care about this because I eventually want to integrate the kext into my PrefPane. If I can't find the offset dynamically, my backup plan is to grab them from hopper for a bunch of different kernels (at minimum, the final releases of Lion, Mountain Lion, and Mavericks plus Bronya's Ryzen builds), and select the correct one at runtime by seeing which address has the bytes I expect.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

You want to do

   memcpy(kqueue_scan_continue_panic_location, replacement_bytes, 6);
   memset(kqueue_scan_continue_panic_location + 7, 0x90 /*noop*/, 0xffffff80005c7ab4 - 0xffffff80005c7aa3);

Reason is that kqueue_scan_continue_panic_location is already a ptr (vm_offset_t is typedef'd to a ptr type). If you do memcpy to &kqueue_scan_continue_panic_location you are actually overwriting the value of kqueue_scan_continue_panic_location itself, not writing to the location pointed to by it. Or in other words, you are changing value in the stack instead of actually modifying text segment (possibly smashing your stack as well if the number of bytes written is more than allocated for this frame). That is why you get segfault when you try printing subsequently, because you've now rewritten kqueue_scan_continue_panic_location to the value 0x90909090 or something like that, and it's probably pointing to garbage so dereferencing via *kqueue_scan_continue_panic_location creates segfault.

You also need to flip the CR0 write protect bit via the previously mentioned function.

Also btw you probably want to print out the entire basic block after modifying, not just the first few values.


Can we dynamically find the size of the procedure

Yes, in principle. In general what we care about is the size of the basic block, and I suppose if you pulled in some disassembler (there are simple x64 opcode diassemblers) you can scan forward until you find an BB edge (jmp, jne, je, etc.). I think for this specific case it should suffice to scan forward until you see an xor after a call to panic (still need to pull in an x64 disassembler though)

but how difficult would it be to find the function address the same way I can in Hopper

Simple in principle, but maybe tedious in practice. You basically just do same way Hopper (or any static analysis tool does it), by scanning instructions and finding lea with address operand that references the string. But to do this we need to pull in disassembler, and maybe a bit of mach-o header parsing (since x64 is all relative addressing so we need to know the offset between text and data segment.. it's probably fixed, but I don't know for sure). There are some simple library dissassemblers (e.g. Zydis) which you can use for this.

Maybe it's fun to play around with, but I personally am lazy and think it's easier to just have the user disassemble their kernel and recompile the kext with the new constants. Reason is that even if you are able to get away with discovering the basic block size and offset at runtime, it's still not guaranteed that return value will be in r12d. So you also need to discover that at runtime, and basically pattern match for something like

lea _,, <panic string addr>
...
call       _panic
xor        retVal, retVal

I suppose technically you could try just pattern matching the assembled opcodes directly, which means you don't need to bring in a disassembler. But then you need to check how the instructions are assembled (e.g. if we want to match xor r12d, r12d or more generically xor-ing some register to zero, figure out which bits in the assembled instruction 45 31 e4 represent the registers and which is the xor opcode. x64 is irregular enough (to me at least) though that this seems even more brittle.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Oh hah apparently vm_offset_t is actually typedef'd to uintptr_t which isn't really a pointer type, just an integer type. So I guess you need to cast it (to void*), since we want to let the compiler you actually want to use it as a pointer.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Alright, no finding offsets dynamically! :)

Aaaaand... I think I've done it! @krackers can you look over this?

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

Calling out a couple of things in particular:

  • I added an additional 0x00 to the end of the replacement_bytes you gave me, it looked right based on hopper and was needed for the numbers to line up. Does this seem okay?
  • I have two explicit calls to panic(), in places where I didn't want to risk leaving the kernel in an unknown state (e.g. without write protection), do these seem reasonable to you? How is the level of defensiveness in general?

This definitely builds and is replacing the memory, so I'm going to give it a try on my own (non-virtual) machine. I've also attached the kext—but use at your own risk!

kqueue_scan_continue-patch.kext.zip

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

I added an additional 0x00 to the end of the replacement_bytes you gave me

I don't think we should do that? The address for the individual instruction isn't supposed to line up because lea rdi, qword is different instruction size than mov r12d, 0x9. If you throw mov r12d, 0x9 in an assembler the instruction should only be 6 bytes. We add the noops afterwards to fill out the rest of the basic block though, so nothing else gets affected (i.e. the address alignment is restored for everything after the basic block we're patching).

If you added the 00, then the instruction stream is now parsed as

| 0x41, 0xBC, 0x09, 0x00, 0x00 | 0x00 | 0x90 | 0x90 |

but then the extra 0x00 is left over after decoding, and that doesn't seem to be valid x64 opcode. I'm quite surprised it didn't panic when you did that...

I guess my comment assembled mov r12d, 0x9 with padding was misleading, there's no padding in there, that is the actual assembled form of mov r12d, 0x9. (Maybe I had mistakenly thought r12d was a 64-bit register since the fact that dword is actually only 32-bits always trips me up). The actual version with padding using an imm64 instead of imm32 would be mov r12, 0x9 which is actually 7 bytes long, so maybe you can use that if you really want instruction-level alignment).


Other than that, nice job! Level of defensiveness seems good, and checking for a match on the expected bytes before replacing is a good idea. Fyi you don't need to define a uint8_t *kscpb2 for the second print bytes, the kscpb pointer from before is still valid and you can just use that.

I don't think your strategy of looking for search_bytes will work in other versions though since the lea operand references the string in DS relative to RIP, and that's pretty much guaranteed to not be the same between versions.

Also while the replacement logic seems fine to me, probably better to double check by dumping memory from a few bytes before kqueue_scan_continue_panic_start_location to a little after kqueue_scan_continue_panic_end_location, then throw those bytes in a disassembler. You should expect that the bytes right before kqueue_scan_continue_panic_start_location are untouched, the bytes right after kqueue_scan_continue_panic_end_location are untouched, and the middle bytes are as expected.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Ah the reason why it didn't panic when you did that is probably because it hadn't hit it yet. If we did begin executing that codepath, willing to bet that you'd have seen a kpanic.

Once you fix that it should work though, but probably better to do the memory dump mentioned above to make sure there's no off-by-one issues anywhere. You probably have to wait to see if it works or not, checking the chromium crash logs (since you mentioned that with patched kernel returning EBADF, no kpanic but chromium process still crashes).

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Ah the reason why it didn't panic when you did that is probably because it hadn't hit it yet. If we did begin executing that codepath, willing to bet that you'd have seen a kpanic.

Yep! I came here to say that I'd gotten a kernel panic. :)

For now as a very quick fix, I've made the last replace_byte 0x90 instead of 0x00. The reason I thought that instruction had to be seven bytes was because otherwise I couldn't seem to end up with the right amount for extra_space_to_fill, when I tried to calculate it with only the start, end, and replacement instructions as hardcoded. But I don't remember exactly what I was thinking at the time, and I'm too tired now... 😴

Just waiting for the crash log...

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Alright, after another day of futzing with it, I can confirm definitively that this works on 10.9.5 (XNU 2422.115.15). I got the "Bad File Descriptor" crash log from Chromium.

kqueue_scan_continue-tune-2022.06.05.zip

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

Note that this kext uses the bundle identifier foo.tun. This allows it to load from /Library/Extensions despite not being codesigned.

In theory, this should also work on fully-updated copies of Lion (XNU 1699.32.7) and Mountain Lion (XNU 2050.48.19).

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Nice, looks good! Fyi in terms of style, to make the main code easy to follow you could factor out all the constants and helpers into the .h file. Also you can probably take advantage of clang blocks to refactor the flow into something much more readable like:

kern_return_t kqueue_scan_continue_tune_start(kmod_info_t * ki, void *d) {
    vm_offset_t kqueue_scan_continue_panic_start_location = 0;
    vm_offset_t kqueue_scan_continue_panic_end_location = 0;
    if (!get_kqueue_scan_continue_locs(&kqueue_scan_continue_panic_start_location, &kqueue_scan_continue_panic_end_location)) {
        return KERN_FAILURE;
    }

    unsigned long extra_space_to_fill = kqueue_scan_continue_panic_end_location -
                                        kqueue_scan_continue_panic_start_location - sizeof(replacement_bytes);
    

    assert(kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes) + extra_space_to_fill == kqueue_scan_continue_panic_end_location);


    bool succ = do_with_interrupt_disabled(^bool() {
       return do_with_wp_disabled(^bool() {
            memcpy((void *)kqueue_scan_continue_panic_start_location, replacement_bytes, sizeof(replacement_bytes));
            memset((void *)kqueue_scan_continue_panic_start_location + sizeof(replacement_bytes), 0x90 /*nop*/, extra_space_to_fill);
            printf("kqueue_scan_continue-tune: Memory rewritten\n");
            return true;
        });
    });

    if (!succ) {
        return KERN_FAILURE;
    }

    printf("Post-patch bytes: ");
    dump_bytes((void*) kqueue_scan_continue_panic_start_location, 40);
}

where you have the following helpers defined in the .h

void dump_bytes(void *start, int num_bytes) {
    for (int i = 0; i < num_bytes; i++) {
        printf("%02x ", ((uint8_t*) start)[i]);
    }
    printf("\n");
}

bool get_kqueue_scan_continue_locs(vm_offset_t *start_addr_out, vm_offset_t *end_addr_out) {
    vm_offset_t kernel_base = get_kernel_base();
    vm_offset_t kqueue_scan_continue_panic_start_location = 0;
    vm_offset_t kqueue_scan_continue_panic_end_location = 0;
    char search_bytes[sizeof(possible_search_bytes[0])];
    char replacement_bytes[sizeof(possible_replacement_bytes[0])];
    uint8_t *kscpb = NULL;
    
    for (int i = 0; i < LENGTH(possible_kqueue_scan_continue_panic_start_locations); i++) {
        kqueue_scan_continue_panic_start_location = kernel_base + possible_kqueue_scan_continue_panic_start_locations[i];
        kqueue_scan_continue_panic_end_location = kernel_base + possible_kqueue_scan_continue_panic_end_locations[i];
        memcpy(search_bytes, possible_search_bytes[i], sizeof(search_bytes));
        memcpy(replacement_bytes, possible_replacement_bytes[i], sizeof(replacement_bytes));
        
        kscpb = (uint8_t*) kqueue_scan_continue_panic_start_location;
        if (memcmp(kscpb, search_bytes, sizeof(search_bytes)) == 0) {
            break;
        }
        if (i == LENGTH(possible_kqueue_scan_continue_panic_start_locations) - 1) {
            printf("kqueue_scan_continue-tune: Memory region not found. You are probably using an unsupported kernel, or your kernel has already been patched.\n");
            return false;
        }
    }
       
    printf("kqueue_scan_continue-tune: Pre-Patch: Bytes at kqueue_scan_continue panic location" );
    dump_bytes(kscpb, 40);

    *start_addr_out = kqueue_scan_continue_panic_start_location;
    *end_addr_out = kqueue_scan_continue_panic_end_location;
    return true;
}

bool checked_set_interrupt_enabled(bool newVal) {
    printf("kqueue_scan_continue-tune: Set interrupt enabled to %d\n", newVal);
    ml_set_interrupts_enabled(newVal);
    bool succ = ml_get_interrupts_enabled() == newVal;
    if (!succ) {
        printf("kqueue_scan_continue-tune: Failed to set interrupt status to %d\n", newVal);
    }
    return succ;
}

bool checked_set_wp(bool newVal) {
    printf("kqueue_scan_continue-tune: Set CR0 write protect to %d\n", newVal);
    set_cr0(newVal ? get_cr0() | CR0_WP : get_cr0() & ~CR0_WP);
    bool succ = write_protection_is_enabled() == newVal;
    if (!succ) {
        printf("kqueue_scan_continue-tune: Failed to set write protect status to %d\n", newVal);
    }
    return succ;
}

bool do_with_interrupt_disabled(bool (^func)(void)) {
    boolean_t interrupts_were_enabled = ml_get_interrupts_enabled();
    if (interrupts_were_enabled && !checked_set_interrupt_enabled(false)) {
        return false;
    }

    bool succ = func();

    if (interrupts_were_enabled && !ml_get_interrupts_enabled() && !checked_set_interrupt_enabled(true)) {
        panic("kqueue_scan_continue-tune: Failed to re-enable interrupts!\n");
    }
    return succ;
}

bool do_with_wp_disabled(bool (^func)(void)) {
    boolean_t write_protection_was_enabled = write_protection_is_enabled();
    if (write_protection_was_enable && !checked_set_wp(false)) {
        return false;
    }

    bool succ = func();

    if (write_protection_was_enabled && !write_protection_is_enabled() && !checked_set_wp(true)) {
        panic("kqueue_scan_continue-tune: Failed to re-enable write protection!\n");
    }
    return succ;
}

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Also can you explain more about

Note that this kext uses the bundle identifier foo.tun. This allows it to load from /Extra/Extensions despite not being codesigned.

I didn't even know /Extra/Extensions was a valid place that osx loaded kexts from, nor that it ignores codesigning for tun/tap driver. Do you have a link where I can read more about this?

Also I was never really sure about 10.9 kext codesign requirements. I can manually kextload an unsigned kext fine, but it gives a warning. I haven't tried installing to /Library/Extensions but I assume it would fail to load in that case? But /System/Library/Extensions will accept unsigned? Does ad-hoc signature suffice for /Library/Extensions?

Edit: are you sure /Extra/Extensions works for non-hackintosh? kextd source doesn't seem to show that as a possible location https://opensource.apple.com/source/IOKitUser/IOKitUser-1445.40.1/kext.subproj/OSKextPrivate.h

Btw kext loading is done via kextd which is open soruce. As described in https://reverse.put.as/2013/11/23/breaking-os-x-signed-kernel-extensions-with-a-nop/ seems like codesigning is indeed enforced only for /Library/Extensions



OSStatus  sigResult = checkKextSignature(theKext, true);
--
  | if ( sigResult != 0 ) {
  | if ( isInLibraryExtensionsFolder(theKext) \|\|


Also looks like ad-hoc won't work since it checks specific chain



/* set up correct requirement string.  Apple kexts are signed by B&I while
--
  | * 3rd party kexts are signed through a special developer kext devid
  | * program
  | */
  | myCFString = OSKextGetIdentifier(aKext);
  | if (CFStringHasPrefix(myCFString, __kOSKextApplePrefix)) {
  | requirementsString = CFSTR("anchor apple");
  | }
  | else {
  | /* DevID for kexts cert
  | */
  | requirementsString =
  | CFSTR("anchor apple generic "
  | "and certificate 1[field.1.2.840.113635.100.6.2.6] "
  | "and certificate leaf[field.1.2.840.113635.100.6.1.13] "
  | "and certificate leaf[field.1.2.840.113635.100.6.1.18]" );
  | }


from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I didn't even know /Extra/Extensions was a valid place that osx loaded kexts from, nor that it ignores codesigning for tun/tap driver. Do you have a link where I can read more about this?

Oops! When I said /Extra/Extensions I meant /Library/Extensions, I just wrote the wrong thing. (/Extra/Extensions is indeed a dumb Hackintosh convention.)

As you noted, Mavericks will load unsigned kexts from any location except /Library/Extensions, aka the one place where third party kexts logically should live. It's frustrating!

Not only will ad-hoc codesigning not work, but even a paid developer account is not sufficient, you need to get special approval by Apple.

But, there's a list of bundle identifiers that are allowed to live in /Library/Extensions and be unsigned: System/Library/Extensions/AppleKextExcludeList.kext/Contents/Info.plist

But, now I'm thinking I may let the kext live in /Library/Extensions, but have a launchdaemon that copies it to a temporary location before loading it, instead of usurping an old bundle identifier...

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

But, there's a list of bundle identifiers that are allowed to live in /Library/Extensions and be unsigned

I see,AppleKextExcludeList.kext contains both a list of blacklisted kexts which are prevented from loading

https://github.com/st3fan/osx-10.9/blob/master/xnu-2422.1.72/libkern/c++/OSKext.cpp#L4643

as well as as a section on whitelisted kexts that are allowed to skip the codesign check:

https://github.com/st3fan/osx-10.9/blob/master/kext_tools-326.1.12/security.c#L1192

Neat, thanks for teaching me something new.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy actually, here's a good workaround. Invalid signature is only warning for kext in /SLE, so you can inform user to manually update the whitelist plist and then place in /LE. This seems like a clean option, better than installing the kext in /SLE directly I guess.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

@krackers Yeah, so I debated which option was the least hacky:

  1. Installing the kext in a non-standard directory.
  2. Installing the kext in /System/Library/Extensions.
  3. Installing the kext in /Library/Extensions and adding a new bundle identifier to AppleKextExcludeList.kext.
  4. Installing the kext in /Library/Extensions and usurping a pre-existing bundle identifier inside AppleKextExcludeList.kext.

Yesterday, I chose option 4, because I don't think anything with the bundle identifier foo is in widespread use. (The list must have been automatically generated somehow, as it also whitelists a bunch of Hackintosh kexts.)

However, I'm actually going with a fifth option that didn't originally occur to me—store the kext in /Library/Extensions/, but have a launchdaemon copy it to /tmp/ before loading. It turns out that non-IOKIt kexts in /Library/Extensions/ aren't autoloaded anyway, so a launchdaemon is needed regardless.


Thanks for the style notes. I'm think I'm going to leave it be for now, there's not a lot of code and it feels naturally procedural, so I actually prefer having it all in one place.

Anyway, unless I discover a new issue I'm going to try to close the book on this for now (partly because I need to mentally focus on other work).


@blueboxd As a quick recap, we've created a kext that fixes this problem on 10.7.5 (all security updates applied), 10.8.5 (12F2560), and 10.9.5 (13F1911) by modifying the procedure in kernel memory to return EBADF instead of panicking. (Other releases are not currently supported.)

If I don't discover any new issues in the next week or so, I'm going to add this to my Preference Pane, so it gets installed automatically for anyone who uses that. I don't know if it makes sense to incorporate this into the Chromium Legacy app directly, but you are of course welcome to it.

from chromium-legacy.

blueboxd avatar blueboxd commented on August 21, 2024

@Wowfunhappy Wow!! thank you for the great work!!
I'll try to find a way to integrate this mitigation into Chromium.app.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Quick note that the kext doesn't work in 32-bit Lion. I briefly looked into adding support today, but from what I can tell, my method of finding the KASLR slide doesn't work, and I don't want to rewrite the entire thing.

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

@Wowfunhappy Wouldn't you have to rewrite (or at least recompile) anyway because in 32-bit mode the mach-o magic is different and header struct layout probably also differs? The approach of scanning memory to look for 0xfeedface should still work though...

Btw I was surprised that 64-bit chromium worked on 32-bit xnu, but then I remember I watched a presentation on how 32-bit xnu had support for running 64-bit userspace processes, so maybe less surprising. I don't remember the presentation though, I'll try to find it since I forget exactly how they supported this.

Edit: Found it, it was a CCC talk: https://www.youtube.com/watch?v=-7GMHB3Plc8
Also found this article: https://appleinsider.com/articles/08/10/28/road_to_mac_os_x_snow_leopard_64_bit_to_the_kernel

from chromium-legacy.

fhgwright avatar fhgwright commented on August 21, 2024

@fhgwright

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

The preference pane does ask for permission!

Screen Shot 2022-09-17 at 8 34 57 PM

The installation is "automatic" in the sense that, if the user says yes, it's done in one click.

Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!

I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.

It should let you know whether the patch is already present, and offer an uninstall option as well. But both goals could be mostly served by giving the full path of the extension in the text.

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly

I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!

there must be something in Chromium which changed from an "old way" to a "new way" of doing something.

My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175

How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?

It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...

How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

Wide enough that Bluebox wasn't able to do it. #44 (comment)

But if it's true that the change involved moving away from using FDs for signaling due to FD exhaustion issues, then given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.

I doubt that I'd have trouble with FD exhaustion here anyway, since I crank up the limit for other reasons, and I'm not the sort to keep hundreds of browser tabs open simultaneously.

And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.

Already done!

#44 (comment)

But that didn't match the release kernel, and the discrepancy was never investigated.

This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.

The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.

Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.

Correct, it does not update itself.

It should let you know whether the patch is already present, and offer an uninstall option as well.

While this does not currently happen, uninstalling the PrefPane (via the provided uninstall script) will also uninstall the kext. In addition, the kext is installed in /Library/Extensions/, the standard location for third-party kernel extensions.

But that didn't match the release kernel, and the discrepancy was never investigated.

We did investigate. It does not match the release kernel because the release kernel is not open source. Which is why we went with a kernel extension instead.

Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.

The open source version of XNU is explicitly missing a function required by iMessage. This was actually documented by the Hackintosh community, I just wasn't aware previously.

given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.

Yes, but the Linux build uses Linux-specific code that won't work on XNU, so it's harder than it seems. But again, by all means, please give it a shot. I don't know if it will work, but it might, and I'd love to see this fixed within Chromium Legacy itself. It's just beyond my own ability.

from chromium-legacy.

hoangbv15 avatar hoangbv15 commented on August 21, 2024

Not sure if this will be valuable for you but I reproduced this on 10.9.4 and here are the logs.
I was leaving https://piped.mha.fi on, intending to type something but haven't decided, so no interaction with the machine yet. And my Macbook Pro 2012 kernel paniced.

If it's not helpful please let me know and I will remove the comment.

Kernel_2023-01-14-184339_Hoangs-MacBook-Pro.panic.log

Anonymous UUID:       58645209-3F22-8B42-01B7-0C62519750C7

Sat Jan 14 18:43:39 2023
panic(cpu 0 caller 0xffffff80071c6bb4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.110.17/bsd/kern/kern_event.c:2167
Backtrace (CPU 0), Frame : Return Address
0xffffff81f2053ef0 : 0xffffff8006e22f79 
0xffffff81f2053f70 : 0xffffff80071c6bb4 
0xffffff81f2053fb0 : 0xffffff8006ed7417 

BSD process name corresponding to current thread: Chromium Helper 
Boot args: kext-dev-mode=1

Mac OS version:
13E28

Kernel version:
Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64
Kernel UUID: BBFADD17-672B-35A2-9B7F-E4B12213E4B8
Kernel slide:     0x0000000006c00000
Kernel text base: 0xffffff8006e00000
System model name: MacBookPro9,1 (Mac-4B7AC7E43945597E)

System uptime in nanoseconds: 4066641044880
last loaded kext at 3901636049533: com.apple.driver.AppleIntelMCEReporter	104 (addr 0xffffff7f8924c000, size 49152)
last unloaded kext at 3966662167467: com.apple.driver.AppleIntelMCEReporter	104 (addr 0xffffff7f8924c000, size 32768)
loaded kexts:
...

from chromium-legacy.

krackers avatar krackers commented on August 21, 2024

Just to confirm, is this with Wowfunhappy's kext installed? That should that should hot-patch the kernel to avoid these panics (instead it will "merely" crash just the chromium renderer instead of bringing down your entire system).

Also fyi 10.9.4 is not the latest version of mavericks, you should probably be on 10.9.5.

from chromium-legacy.

Wowfunhappy avatar Wowfunhappy commented on August 21, 2024

Just to confirm, is this with Wowfunhappy's kext installed? That should hot-patch the kernel to avoid these panics

Not if they're on 10.9.4 it won't! I'd need to add memory addresses for that kernel.

Is there a reason this system is on 10.9.4 instead of 10.9.5? If there's something "special" about 10.9.4 in particular, I could add support for it.

from chromium-legacy.

hoangbv15 avatar hoangbv15 commented on August 21, 2024

from chromium-legacy.

hoangbv15 avatar hoangbv15 commented on August 21, 2024

from chromium-legacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.