Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PDB] Multiple versions of the same class are created and wrong one is used by default #7101

Open
SmileyAG opened this issue Oct 25, 2024 · 1 comment
Assignees
Labels
Feature: PDB Status: Triage Information is being gathered

Comments

@SmileyAG
Copy link

SmileyAG commented Oct 25, 2024

Describe the bug

When importing some DLLs with their PDBs, Ghidra creates multiple versions of the same class names, all different from each other. By default, the wrong version of the class is used, which makes everything problematic to read.

To Reproduce / Expected behavior

  1. Import the binaries in the attached archive from the 1_0 or 1_1 folder (when importing other versions, i.e. 1.2 and newer, this does not occur, but I included them in the archive just in case. For my testing I used binaries from the 1_1 folder!!!).
  2. Look at CBasePlayer and its copies. The real CBasePlayer has a size of 0x26A0 (1_1 version), in the GIFs below, my research from additional context column and decompile in IDA Pro confirms this. Some classes also have the same issue (you can find them by simply typing .conflict substring into Data Type Manager window).
    If it can help you somehow, then the hierarchy of the same CBasePlayer is as follows: CBaseEntity -> CBaseDelay -> CBaseAnimating -> CBaseToggle -> CBaseMonster -> CBasePlayer

Screenshots

find-cbaseplayer
some-offsets-stored-in-code-and-i-use-it-for-proof

Attachments

application.log
cryoffear-binaries-server.zip

Environment (please complete the following information):

Additional context

I know that the issue title may not sound quite correct, so feel free to edit it. I'll try to be as brief as I can.

I managed before to import a few dozen PDBs from other different games that were compiled on different MSVC versions and linker flags, and in all this time I encountered such a bug only in two DLLs. And the most interesting thing is that both of these DLLs are the first two versions of the same game modification.
But in next versions of that game modification, this did not happen (but the versions as a whole did not differ from each other significantly, rather bug fixes and the addition of a couple of small features), so I have an assumption that perhaps some options were added or removed used in the linker for newer patches, due to which Ghidra perceives PDB reading better? Their toolchain remained the same judging by linker version from dumpbin /headers (MSVC 7.10 aka .NET 2003).

For context on why I was looking for this via the player function in the first GIF to get the size of the class, see the following lines from the source code:
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/player.cpp#L342
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/util.h#L90-L93
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/cbase.h#L887-L908
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/cbase.h#L341-L345
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/util.h#L102-L107

And fortunately, in the Half-Life code (and the modification is based on it), for the save-restore system, the developers save some fields and their offsets into the game code using offsetof and macros:
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/engine/eiface.h#L367-L384
https://github.com/HLSources/Paranoia/blob/c2d8018e3fa9de209ba76c0bbe8491bc09d8d797/code_src/dlls/player.cpp#L88
This will serve as an additional assistant in our case, which will confirm whether the offset of a given field matches what is specified in the class or not. I showed this in the example in the second GIF.

Based on the same GUID from DLL and PDB match, the modification date of both files is the same and the fact that they were supplied with the modification by the developer and updated with each new patch - there is no doubt to think that all this is genuine. In IDA Pro, the correct and only version of the class is determined, I can say this based on the size of the class itself and the offsets of the fields to the member variables, comparing with those that are saved in the game code.

@ghizard
Copy link
Contributor

ghizard commented Oct 28, 2024

This is another tough issue, and just wanted to give you an initial reply. I have a related internal issue to re-investigate the mechanism used for mapping forward references to structure definitions. However, that whole mechanism stinks in general, but has been instrumental to getting PDB Universal working. The issue is that, in many places, forward references are used, such as with a pointer to a specific structure. The forward reference is just an empty version of the named structure. Something has to match this forward reference to the actual definition.

Normally, we should have a complete picture. A function might have an argument that is the pointer to struct A. Internal to the structure, there might be access to offsets within A but no real definition to A. In this case, we would need to rely on data type propagation from the calling function (and possibly further upstream from it). If a function has a local struct A, then we would know the precise definition. If the function only has a pointer to A and then does an allocation (new) as in your example, then we might have a chance to limit the definitions (like you did) to a shorter list of those that have the same size (that is unless there is a flexible array at the end of the structure). If we had a complete picture of all data type usage (to include locals and parameters), we would have a better chance of determining the correct definition to use at any place using any of these techniques. I cannot say for sure that this would fix everything, but I feel that it would go a long way.

The specific issue in your 1_1 (the only one I've looked at so far) examples is that there is only one forward reference to CBasePlayer, but 18-20 different definitions. What the "mapper" does, is to try to map one forward reference to one definitions; I would normally see the same number of each from examples I had built. And I had normally always tried to do clean builds. But when I would forget to do a clean build, I would encounter what we see in this 1_1 example: multiple definitions, where generally, the older definitions do not matter, but they still exist within the PDB. However... there are times where multiple different definitions could be found within a program, so we cannot pick just one and run with it either.

I'd like to fix this issue, but it would likely take some time and rely on other fixes, such as for #7099. I wish I could give you the best guidance for before continuing to work. I think you might be able to fix things by hand by noting the size issues (advise to check for flex arrays). You also might find that using the PDB MSDIA solution might work (I have not yet tried).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: PDB Status: Triage Information is being gathered
Projects
None yet
Development

No branches or pull requests

2 participants