Skip to content

Conversation

@marinegor
Copy link
Contributor

@marinegor marinegor commented Sep 20, 2024

Fixes #2367 and also extends #4303 and solves #5089

Changes made in this Pull Request:

  • uses gemmi library (link) to parse mmcif files
  • adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

  • mmCIF (PDBx/mmCIF),
  • PDB (with popular extensions),
  • mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2024

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 72:80: E501 line too long (123 > 79 characters)
Line 82:80: E501 line too long (122 > 79 characters)
Line 106:80: E501 line too long (108 > 79 characters)
Line 113:80: E501 line too long (80 > 79 characters)
Line 128:80: E501 line too long (91 > 79 characters)
Line 175:80: E501 line too long (126 > 79 characters)
Line 185:80: E501 line too long (125 > 79 characters)
Line 224:80: E501 line too long (126 > 79 characters)
Line 242:80: E501 line too long (140 > 79 characters)
Line 281:80: E501 line too long (87 > 79 characters)
Line 292:80: E501 line too long (90 > 79 characters)

Line 56:80: E501 line too long (80 > 79 characters)
Line 57:80: E501 line too long (84 > 79 characters)

Line 335:26: W292 no newline at end of file

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 340:80: E501 line too long (104 > 79 characters)
Line 387:80: E501 line too long (83 > 79 characters)
Line 436:80: E501 line too long (80 > 79 characters)
Line 463:80: E501 line too long (80 > 79 characters)
Line 481:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 494:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (83 > 79 characters)
Line 498:80: E501 line too long (86 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 547:80: E501 line too long (82 > 79 characters)
Line 549:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (88 > 79 characters)
Line 552:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (81 > 79 characters)
Line 778:80: E501 line too long (87 > 79 characters)
Line 779:80: E501 line too long (84 > 79 characters)
Line 780:80: E501 line too long (85 > 79 characters)
Line 781:80: E501 line too long (83 > 79 characters)

Comment last updated at 2024-10-25 11:17:29 UTC

@github-actions
Copy link

github-actions bot commented Sep 20, 2024

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, will require a small test file to check reader/parser halves.

np.array,
list(
zip(
*[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a little comment explaining that

"pytng>=0.2.3",
"gsd>3.0.0",
"rdkit>=2020.03.1",
"gemmi", # for mmcif format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be optional, so other imports will have to respect that too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what to do with that -- it's already in the [project.optional-dependencies] table, is there an example of making it more optional?

@marinegor
Copy link
Contributor Author

thanks to amazing contribution by @PardhavMaradani it seems that the current version is working now! Seems that some extra attributes and/or my own implementation of change_squash were messing things up, and I'm very relieved to see it go.

I've re-requested reviews from @orbeckst @yuxuanzhuang @ljwoods2.

One open question that I have: technically, this topology/coordinates parser can also parse PDB without any changes to it, like that:

import MDAnalysis as mda
mda.Universe('testsuite/MDAnalysisTests/data/mmcif/1BD2.pdb.gz', format='mmcif')
# <Universe with 6378 atoms>

do we want to add this option to the existing PDBParser (e.g. introduce there a backend keyword), or document it here somehow? I feel like it's something very powerful (and potentially more compliant with reading from RCSB's pdb files directly), and would like to have in mdanalysis somehow. But I can also open a separate issue with this discussion.

@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2025

I'd open a separate issue for the PDB discussion – it will only delay this one. If it can be treated separately then do it separately. (Orthogonality is great!)

I don't think I have time to review so please don't wait for me.

-------
MDAnalysis Topology object
"""
structure = gemmi.read_structure(self.filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping so this doesn't get lost, still think this should accept a path obj (unless I'm missing something)

@marinegor
Copy link
Contributor Author

marinegor commented Nov 6, 2025 via email

@ljwoods2
Copy link
Contributor

@marinegor pushed directly here and then reverted, oops. Let me know what you think of the PR, this will allow passing streams, pathlib.Paths, etc, as the existing PDBParser can already handle (also via openany)

@marinegor marinegor requested a review from ljwoods2 November 17, 2025 17:54
@marinegor
Copy link
Contributor Author

not sure who to ping for review -- I feel like @yuxuanzhuang @richardjgowers and @ljwoods2 you've been participating in the process, so you might be able to do it after some months as well? :)

@marinegor
Copy link
Contributor Author

marinegor commented Dec 14, 2025

Just to get back to that: @orbeckst @yuxuanzhuang @ljwoods2 could you please review that again?

Or alternatively, I could add flatstructure (benchmarks here) on top of this PR, since it makes everything faster and more readable (see gemmi syntax here).

I'd be in favor or merging it as is, especially since it also would help with #4943 I guess. After it's merged, we can get to discussion in #5141 and start thinking about perhaps making this reader default, since benchmarks show it's also faster -- e.g. 140 vs 190 ms for Universe).

@BradyAJohnston
Copy link
Member

BradyAJohnston commented Jan 2, 2026

I had another look over this and was looking through some of the gemmi docs as well. I still think overall this if fine for now (and would prefer we get this in to get other things moving).

I did some some small tests with the FlatStructre from gemmi rather than iterating through individual atoms for topology and coordinate reading and saw some speedups:

Changing just the coordinate reader testing on 4OZS

# get_coordinates
In [5]: %timeit u = mda.Universe(p)
6.61 ms ± 51 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# FlatStructure(structure).pos
In [6]: %timeit u = mda.Universe(p)
5.92 ms ± 14.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Changing coordinate reader and some topology reading on 7CGO:

# current map / atom iteration
In [8]: %timeit u = mda.Universe(p)
1.61 s ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# getting values from FlatStructure
In [5]: %timeit u = mda.Universe(p)
1.35 s ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

From what I can see in the docs the FlatStructure in gemmi still isn't up to scratch (and there are even some useful things coming in 0.7.5 over current 0.7.4) so it's maybe just an improvement that can be made in the future.

Details
diff --git a/package/MDAnalysis/coordinates/MMCIF.py b/package/MDAnalysis/coordinates/MMCIF.py
index f9dbbbf20..21a181bda 100644
--- a/package/MDAnalysis/coordinates/MMCIF.py
+++ b/package/MDAnalysis/coordinates/MMCIF.py
@@ -73,8 +73,8 @@ import warnings

 import numpy as np

-from . import base
 from ..lib import util
+from . import base

 try:
     import gemmi
@@ -132,8 +132,8 @@ class MMCIFReader(base.SingleFrameReaderBase):
                 f"File {self.filename} has {len(structure)=} models, but only the first one will be read"
             )

-        model = structure[0]
-        coords = get_coordinates(model)
+        flat = gemmi.FlatStructure(structure)
+        coords = flat.pos
         self.n_atoms = len(coords)
         self.ts = self._Timestep.from_coordinates(coords, **self._ts_kwargs)
         if np.allclose(cell_dims, np.array([1.0, 1.0, 1.0, 90.0, 90.0, 90.0])):
diff --git a/package/MDAnalysis/topology/MMCIFParser.py b/package/MDAnalysis/topology/MMCIFParser.py
index 0ef64618c..2e08d3e60 100644
--- a/package/MDAnalysis/topology/MMCIFParser.py
+++ b/package/MDAnalysis/topology/MMCIFParser.py
@@ -69,8 +69,8 @@ from ..core.topologyattrs import (
     Segids,
     Tempfactors,
 )
-from .base import TopologyReaderBase, change_squash
 from ..lib import util
+from .base import TopologyReaderBase, change_squash


 class MMCIFParser(TopologyReaderBase):
@@ -118,24 +118,33 @@ class MMCIFParser(TopologyReaderBase):
             )
         model = structure[0]

+        def char_array_to_strings(char_array):
+            """Convert (N, 8) char array to 1D string array."""
+            return np.array(
+                [row.tobytes().decode("utf-8").rstrip("\x00") for row in char_array]
+            )
+
+        flat = gemmi.FlatStructure(structure)
+        # flat.strings_as_numbers = False
+
+        resnames = char_array_to_strings(flat.residue_names)
+        names = char_array_to_strings(flat.atom_names)
+        atomtypes = names
+        chainids = char_array_to_strings(flat.chain_ids)
+        tempfactors = flat.b_iso
+        occupancies = flat.occ
+        formalcharges = flat.charge
+
         (
             altlocs,  # at.altloc
             serials,  # at.serial
-            names,  # at.name
-            atomtypes,  # at.name
-            # ------------------
-            chainids,  # chain.name
             elements,  # at.element.name
-            formalcharges,  # at.charge
             weights,  # at.element.weight
             # ------------------
-            occupancies,  # at.occ
             record_types,  # res.het_flag
-            tempfactors,  # at.b_iso
             # ------------------
             icodes,  # residue.seqid.icode
             resids,  # residue.seqid.num
-            resnames,  # residue.name
         ) = map(  # this construct takes np.ndarray of all lists of attributes, extracted from the `gemmi.Model`
             np.array,
             list(
@@ -147,21 +156,14 @@ class MMCIFParser(TopologyReaderBase):
                             # ------------------
                             atom.altloc,  # altlocs
                             atom.serial,  # serials
-                            atom.name,  # names
-                            atom.name,  # atomtypes
                             # ------------------
-                            chain.name,  # chainids
                             atom.element.name,  # elements
-                            atom.charge,  # formalcharges
                             atom.element.weight,  # weights
                             # ------------------
-                            atom.occ,  # occupancies
                             residue.het_flag,  # record_types
-                            atom.b_iso,  # tempfactors
                             # ------------------
                             residue.seqid.icode,  # icodes
                             residue.seqid.num,  # resids
-                            residue.name,  # resnames
                         )
                         # the main loop over the `gemmi.Model` object
                         for chain in model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDBx/mmCIF Reader/Topology Reader

10 participants