Description
Describe the bug
Currently reading or writing PDB data with atomic charges that follow the PDB v3.3 format is not possible.
The PDB format specifies the notation for charge
explicitly via:
Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most cases, these are blank.
Since the type of charge is specified as float
and (a) usually these columns are blank and (b) strings like 2+
cannot be parsed to float, the type-formatiing fails and the entire charge
column gets filled with NaN values
Writing charge values also doesn't provide the expected results, since the formatter is specified as +2.1f
(same goes for anisou entries), which in my opinion doesn't match the PDB format at all, even when using float values.
Steps/Code to Reproduce
The following is an MWE for reading a PDB with charged atoms. Since the missmatch in formats for writing charge values should be clear from the above explanation, I'll omit an example for this (but I can provide one later if needed).
from biopandas.pdb import PandasPdb
atom_df = PandasPdb().fetch_pdb("2mjz").get_model(1).df["ATOM"]
print(len(atom_df.loc[atom_df["charge"].notnull(), "charge"]))
Expected Results
Detection of charged atoms in PDB data (first model of 2MJZ should have 350 charged atoms).
Actual Results
The output is 0
(since only NaN values are present).
Proposed Fix
I'd suggest changing the definition in the pdb_atomdict and pdb_anisoudict to type charges as str
and change the string formatter accordingly.
A setup that seems to be working for me is:
{
"id": "charge",
"line": [78, 80],
"type": str,
"strf": lambda x: (
str(int(re.sub(r"[+-]", "", x)))[-1] + ("-" if "-" in x else "+") if len(x.strip()) > 0 else ""
),
}
The string formatter can probably be improved but this was the safest option I could come up with.
Versions
biopandas 0.5.1
Linux-5.4.0-91-generic-x86_64-with-glibc2.31
Python 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0]
NumPy 1.23.5