Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor · 2024-09-20T21:57:20Z

Fixes #2367 and also extends #4303 and solves #5089

Changes made in this Pull Request:

uses gemmi library (link) to parse mmcif files
adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

mmCIF (PDBx/mmCIF),
PDB (with popular extensions),
mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

pep8speaks · 2024-09-20T21:57:28Z

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/coordinates/MMCIF.py:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

In the file package/MDAnalysis/topology/MMCIFParser.py:

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 72:80: E501 line too long (123 > 79 characters)
Line 82:80: E501 line too long (122 > 79 characters)
Line 106:80: E501 line too long (108 > 79 characters)
Line 113:80: E501 line too long (80 > 79 characters)
Line 128:80: E501 line too long (91 > 79 characters)
Line 175:80: E501 line too long (126 > 79 characters)
Line 185:80: E501 line too long (125 > 79 characters)
Line 224:80: E501 line too long (126 > 79 characters)
Line 242:80: E501 line too long (140 > 79 characters)
Line 281:80: E501 line too long (87 > 79 characters)
Line 292:80: E501 line too long (90 > 79 characters)

In the file package/MDAnalysis/topology/PDBParser.py:

Line 56:80: E501 line too long (80 > 79 characters)
Line 57:80: E501 line too long (84 > 79 characters)

In the file package/MDAnalysis/topology/__init__.py:

Line 335:26: W292 no newline at end of file

In the file testsuite/MDAnalysisTests/datafiles.py:

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 340:80: E501 line too long (104 > 79 characters)
Line 387:80: E501 line too long (83 > 79 characters)
Line 436:80: E501 line too long (80 > 79 characters)
Line 463:80: E501 line too long (80 > 79 characters)
Line 481:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 494:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (83 > 79 characters)
Line 498:80: E501 line too long (86 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 547:80: E501 line too long (82 > 79 characters)
Line 549:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (88 > 79 characters)
Line 552:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (81 > 79 characters)
Line 778:80: E501 line too long (87 > 79 characters)
Line 779:80: E501 line too long (84 > 79 characters)
Line 780:80: E501 line too long (85 > 79 characters)
Line 781:80: E501 line too long (83 > 79 characters)

Comment last updated at 2024-10-25 11:17:29 UTC

github-actions · 2024-09-20T21:59:32Z

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

richardjgowers

Looks good so far, will require a small test file to check reader/parser halves.

package/MDAnalysis/coordinates/MMCIF.py

package/MDAnalysis/topology/MMCIFParser.py

richardjgowers · 2024-09-21T15:12:17Z

package/MDAnalysis/topology/MMCIFParser.py

+            np.array,
+            list(
+                zip(
+                    *[


I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

I've added a little comment explaining that

richardjgowers · 2024-09-21T15:12:54Z

package/pyproject.toml

    "pytng>=0.2.3",
    "gsd>3.0.0",
    "rdkit>=2020.03.1",
+    "gemmi", # for mmcif format


This will probably be optional, so other imports will have to respect that too

not sure what to do with that -- it's already in the [project.optional-dependencies] table, is there an example of making it more optional?

marinegor · 2025-11-05T18:13:06Z

thanks to amazing contribution by @PardhavMaradani it seems that the current version is working now! Seems that some extra attributes and/or my own implementation of change_squash were messing things up, and I'm very relieved to see it go.

I've re-requested reviews from @orbeckst @yuxuanzhuang @ljwoods2.

One open question that I have: technically, this topology/coordinates parser can also parse PDB without any changes to it, like that:

import MDAnalysis as mda
mda.Universe('testsuite/MDAnalysisTests/data/mmcif/1BD2.pdb.gz', format='mmcif')
# <Universe with 6378 atoms>

do we want to add this option to the existing PDBParser (e.g. introduce there a backend keyword), or document it here somehow? I feel like it's something very powerful (and potentially more compliant with reading from RCSB's pdb files directly), and would like to have in mdanalysis somehow. But I can also open a separate issue with this discussion.

orbeckst · 2025-11-05T18:31:15Z

I'd open a separate issue for the PDB discussion – it will only delay this one. If it can be treated separately then do it separately. (Orthogonality is great!)

I don't think I have time to review so please don't wait for me.

ljwoods2 · 2025-11-06T18:47:26Z

package/MDAnalysis/topology/MMCIFParser.py

+        -------
+        MDAnalysis Topology object
+        """
+        structure = gemmi.read_structure(self.filename)


Bumping so this doesn't get lost, still think this should accept a path obj (unless I'm missing something)

marinegor · 2025-11-06T22:57:16Z

oh, sorry @ljwoods2, I forgot to reply earlier! what do you actually mean by that? like, should we explicitly convert a string to path, that's it? or are there more checks to be done?

This reverts commit 1a7f607.

ljwoods2 · 2025-11-16T07:24:52Z

@marinegor pushed directly here and then reverted, oops. Let me know what you think of the PR, this will allow passing streams, pathlib.Paths, etc, as the existing PDBParser can already handle (also via openany)

marinegor · 2025-11-19T09:19:06Z

not sure who to ping for review -- I feel like @yuxuanzhuang @richardjgowers and @ljwoods2 you've been participating in the process, so you might be able to do it after some months as well? :)

marinegor · 2025-12-14T16:49:17Z

Just to get back to that: @orbeckst @yuxuanzhuang @ljwoods2 could you please review that again?

Or alternatively, I could add flatstructure (benchmarks here) on top of this PR, since it makes everything faster and more readable (see gemmi syntax here).

I'd be in favor or merging it as is, especially since it also would help with #4943 I guess. After it's merged, we can get to discussion in #5141 and start thinking about perhaps making this reader default, since benchmarks show it's also faster -- e.g. 140 vs 190 ms for Universe).

BradyAJohnston · 2026-01-02T12:54:28Z

I had another look over this and was looking through some of the gemmi docs as well. I still think overall this if fine for now (and would prefer we get this in to get other things moving).

I did some some small tests with the FlatStructre from gemmi rather than iterating through individual atoms for topology and coordinate reading and saw some speedups:

Changing just the coordinate reader testing on `4OZS`

# get_coordinates
In [5]: %timeit u = mda.Universe(p)
6.61 ms ± 51 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# FlatStructure(structure).pos
In [6]: %timeit u = mda.Universe(p)
5.92 ms ± 14.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Changing coordinate reader and some topology reading on `7CGO`:

# current map / atom iteration
In [8]: %timeit u = mda.Universe(p)
1.61 s ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# getting values from FlatStructure
In [5]: %timeit u = mda.Universe(p)
1.35 s ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

From what I can see in the docs the FlatStructure in gemmi still isn't up to scratch (and there are even some useful things coming in 0.7.5 over current 0.7.4) so it's maybe just an improvement that can be made in the future.

Details

diff --git a/package/MDAnalysis/coordinates/MMCIF.py b/package/MDAnalysis/coordinates/MMCIF.py
index f9dbbbf20..21a181bda 100644
--- a/package/MDAnalysis/coordinates/MMCIF.py
+++ b/package/MDAnalysis/coordinates/MMCIF.py
@@ -73,8 +73,8 @@ import warnings

 import numpy as np

-from . import base
 from ..lib import util
+from . import base

 try:
     import gemmi
@@ -132,8 +132,8 @@ class MMCIFReader(base.SingleFrameReaderBase):
                 f"File {self.filename} has {len(structure)=} models, but only the first one will be read"
             )

-        model = structure[0]
-        coords = get_coordinates(model)
+        flat = gemmi.FlatStructure(structure)
+        coords = flat.pos
         self.n_atoms = len(coords)
         self.ts = self._Timestep.from_coordinates(coords, **self._ts_kwargs)
         if np.allclose(cell_dims, np.array([1.0, 1.0, 1.0, 90.0, 90.0, 90.0])):
diff --git a/package/MDAnalysis/topology/MMCIFParser.py b/package/MDAnalysis/topology/MMCIFParser.py
index 0ef64618c..2e08d3e60 100644
--- a/package/MDAnalysis/topology/MMCIFParser.py
+++ b/package/MDAnalysis/topology/MMCIFParser.py
@@ -69,8 +69,8 @@ from ..core.topologyattrs import (
     Segids,
     Tempfactors,
 )
-from .base import TopologyReaderBase, change_squash
 from ..lib import util
+from .base import TopologyReaderBase, change_squash


 class MMCIFParser(TopologyReaderBase):
@@ -118,24 +118,33 @@ class MMCIFParser(TopologyReaderBase):
             )
         model = structure[0]

+        def char_array_to_strings(char_array):
+            """Convert (N, 8) char array to 1D string array."""
+            return np.array(
+                [row.tobytes().decode("utf-8").rstrip("\x00") for row in char_array]
+            )
+
+        flat = gemmi.FlatStructure(structure)
+        # flat.strings_as_numbers = False
+
+        resnames = char_array_to_strings(flat.residue_names)
+        names = char_array_to_strings(flat.atom_names)
+        atomtypes = names
+        chainids = char_array_to_strings(flat.chain_ids)
+        tempfactors = flat.b_iso
+        occupancies = flat.occ
+        formalcharges = flat.charge
+
         (
             altlocs,  # at.altloc
             serials,  # at.serial
-            names,  # at.name
-            atomtypes,  # at.name
-            # ------------------
-            chainids,  # chain.name
             elements,  # at.element.name
-            formalcharges,  # at.charge
             weights,  # at.element.weight
             # ------------------
-            occupancies,  # at.occ
             record_types,  # res.het_flag
-            tempfactors,  # at.b_iso
             # ------------------
             icodes,  # residue.seqid.icode
             resids,  # residue.seqid.num
-            resnames,  # residue.name
         ) = map(  # this construct takes np.ndarray of all lists of attributes, extracted from the `gemmi.Model`
             np.array,
             list(
@@ -147,21 +156,14 @@ class MMCIFParser(TopologyReaderBase):
                             # ------------------
                             atom.altloc,  # altlocs
                             atom.serial,  # serials
-                            atom.name,  # names
-                            atom.name,  # atomtypes
                             # ------------------
-                            chain.name,  # chainids
                             atom.element.name,  # elements
-                            atom.charge,  # formalcharges
                             atom.element.weight,  # weights
                             # ------------------
-                            atom.occ,  # occupancies
                             residue.het_flag,  # record_types
-                            atom.b_iso,  # tempfactors
                             # ------------------
                             residue.seqid.icode,  # icodes
                             residue.seqid.num,  # resids
-                            residue.name,  # resnames
                         )
                         # the main loop over the `gemmi.Model` object
                         for chain in model

marinegor added 13 commits May 22, 2024 20:04

Start working on MMCIF parser

aa2a88f

Add first (not working) version of MMCIFReader and MMCIF topology parser

218cf43

Do some squashing

7f78e02

Remove inherited docs

6682d6e

Try improving the parsing

817f3a0

Try three independent loops over the model

3cc8c80

Merge remote-tracking branch 'upstream/develop' into feature/mmcif

f1bf325

Add gemmi dependency

d21c220

necessary params

2a1be15

finished sorting atom attrs

77645e6

add function for transformation into *idx

91e6942

oh damn seems to finally be working

9a0c086

remove TODOs

9c731df

Remove debug prints

8b40ec7

richardjgowers reviewed Sep 21, 2024

View reviewed changes

marinegor added 13 commits September 23, 2024 00:31

Merge branch 'develop' into feature/mmcif

bdcbd73

try to pack things into separate class in utils?

401a4d3

remove unnecessary functions

9c336bd

copy all loops into separate functions

def88e4

Move loops over structures into functions

cabfd37

Move coordinate fetching into function for the coordinate reader as well

4c9d930

Fix imports

184491a

Start adding documentation

3de8565

Reference MMCIFParser in PDBParser

ca6ebbb

Add documentation for trajectory and topology parsers

45077ad

Add mmcif tests

9a1a59a

Update format specifications

27c10d6

Write simple tests

950cfcf

marinegor added 3 commits November 5, 2025 18:41

add slightly more smoke tests

0b0ec81

update changelog

b3d7c1c

remove fixmes and move some tests to topology tests

801d85f

marinegor requested review from ljwoods2 and yuxuanzhuang November 5, 2025 18:11

black formatter

bdd070e

marinegor mentioned this pull request Nov 6, 2025

Add option for gemmi backend for PDB reading #5141

Open

match formats in toplogy and coordinate parsers for mmcif

6b2c6c6

marinegor mentioned this pull request Nov 6, 2025

Parse hexadecimal resid from OpenMM in PDBParser #5089

Open

ljwoods2 requested changes Nov 6, 2025

View reviewed changes

marinegor mentioned this pull request Nov 11, 2025

Add option for gemmi backend for pdb reading #5142

Draft

5 tasks

ljwoods2 added 3 commits November 15, 2025 17:52

replace gemmi.get_structure

1a7f607

Revert "replace gemmi.get_structure"

157c365

This reverts commit 1a7f607.

add input fmt tests

1d01a3f

ljwoods2 mentioned this pull request Nov 16, 2025

Make input method for MMCIFParser and MMCIFReader more flexible marinegor/mdanalysis#4

Merged

ljwoods2 added 3 commits November 15, 2025 18:31

topology format kwarg bug

2020484

error handling tweaks

d362f91

remove tmp test files

959d78b

marinegor added 2 commits November 17, 2025 17:00

apply black

15441f0

fix changelog

e2e097f

marinegor requested a review from ljwoods2 November 17, 2025 17:54

add one more issue to changelog

bc89c0b

marinegor requested a review from BradyAJohnston December 23, 2025 18:50

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Are you sure you want to change the base?

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Conversation

marinegor commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers certificate of origin

Uh oh!

pep8speaks commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2024-10-25 11:17:29 UTC

Uh oh!

github-actions bot commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linter Bot Results:

Uh oh!

richardjgowers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

Uh oh!

marinegor Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

Uh oh!

marinegor Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

marinegor commented Nov 5, 2025

Uh oh!

orbeckst commented Nov 5, 2025

Uh oh!

ljwoods2 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

marinegor commented Nov 6, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljwoods2 commented Nov 16, 2025

Uh oh!

marinegor commented Nov 19, 2025

Uh oh!

marinegor commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BradyAJohnston commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changing just the coordinate reader testing on 4OZS

Changing coordinate reader and some topology reading on 7CGO:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor commented Sep 20, 2024 •

edited

Loading

pep8speaks commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

marinegor commented Nov 6, 2025 via email •

edited

Loading

marinegor commented Dec 14, 2025 •

edited

Loading

BradyAJohnston commented Jan 2, 2026 •

edited

Loading

Changing just the coordinate reader testing on `4OZS`

Changing coordinate reader and some topology reading on `7CGO`: