Implementation of fetch_pdb() #4943

jauy123 · 2025-03-03T18:28:48Z

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
Added requests as a dependency
mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

Issue raised/referenced?
Tests updated/added?
Documentation updated/added?
package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

jauy123 · 2025-03-03T18:31:48Z

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
Integration with Mdanalysis' logger
Probably could wrap the cache logic into a separate function

BradyAJohnston · 2025-03-04T01:20:19Z

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

codecov · 2025-03-04T20:44:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.74%. Comparing base (9fb8b0a) to head (f1249e7).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
+ Coverage    92.72%   92.74%   +0.01%     
===========================================
  Files          180      182       +2     
  Lines        22472    22496      +24     
  Branches      3188     3192       +4     
===========================================
+ Hits         20838    20864      +26     
+ Misses        1176     1175       -1     
+ Partials       458      457       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jauy123 · 2025-03-05T02:49:30Z

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

jauy123 · 2025-03-10T17:41:23Z

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

BradyAJohnston · 2025-03-19T12:36:18Z

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

mdanalysis/package/MDAnalysis/analysis/align.py

Lines 200 to 207 in dcaa087

    
           try: 
        
               import Bio.AlignIO 
        
               import Bio.Align 
        
               import Bio.Align.Applications 
        
           except ImportError: 
        
               HAS_BIOPYTHON = False 
        
           else: 
        
               HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

jauy123 · 2025-03-19T23:14:58Z

Thanks for the comment!

jauy123 · 2025-03-19T23:15:52Z

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

orbeckst · 2025-03-20T18:17:46Z

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

orbeckst · 2025-06-12T16:18:16Z

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

jauy123 · 2025-06-12T21:25:55Z

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

…xt files

…since text files are just binary files with special encoding

orbeckst

Thanks for addressing the last comments. All looking good to me now – ✅ !

Let's keep the top level import mda.fetch_pdb() for now. If we expand "fetchers" then we can revisit. Given that we currently do not document our top-level imports it is sufficient to document topology.PDBParser.fetch_pdb. Perhaps I could encourage you to add an entry to the UserGuide (see Contributing to the User Guide)?

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances. I do not propose to change fetch_pdb() as this can be the bare-bones workhorse function that we may want to wrap with a generic fetch() function or Universe.from_fetched() static method.

package/MDAnalysis/topology/PDBParser.py

jauy123 · 2025-11-01T18:31:37Z

Thanks for the feedback!

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances.

I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. os.path). Unless the only way to work with paths in python are pathlib Path objects, I prefer to return paths as a string as that has been historical python behavior, and leave users to convert to a pathlib object if they want to.

jauy123 · 2025-11-12T20:46:46Z

@BradyAJohnston Nice to meet you at the UGM!

Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase.

orbeckst · 2025-11-12T21:37:17Z

@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

orbeckst · 2025-11-12T21:41:12Z

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

jauy123 · 2025-11-12T21:49:17Z

Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.

Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD.

jauy123 · 2025-11-12T22:02:44Z

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

I see the point, and I personally use pathlib.Path for all my scripts. But is there like an official PEP saying that pathlib.Path is the preferred way to deal with paths over strings? Or is it like just a community practice since pathlib has been in the standard library for so long?

My next question if anything in the standard library which only take strings, or that they all of them take Paths in python 3.10+, the minimum supported version of python that MDAnalysis supports? If pathlib.Path are basically first class objects on the same level of strings in python 3.10+ (like you mentioned in the os.path), then I see no reason to not return paths as pathlib.Path objects instead of strings.

p-j-smith · 2025-11-12T22:44:16Z

the standard library still has modules that still takes paths as string (i.e. os.path)

Functions in os.path will take anything that is path-like, including a pathlib.Path

orbeckst · 2025-11-12T23:16:48Z

I didn't know that os.path was so flexible. I learned something new.

p-j-smith

Nice work @jauy123!

Blocking whilst we consider moving fetch_pdb into a new module e.g. mda.fetch. It feels like it would be better placed there, rather than in topology.PDBParser as it's not doing any topology parsing

package/MDAnalysis/topology/PDBParser.py

testsuite/MDAnalysisTests/fetch/test_from_PDB.py

package/pyproject.toml

marinegor · 2025-11-13T23:31:58Z

Small suggestion here: could we perhaps make the default format not pdb.gz but cif.gz, in line with #5142, its fixes for #5089 and the fact that:

as of April 2025 3.7% of the PDB archive is not available in legacy PDB format

source

Or at least make it default if gemmi is installed?

orbeckst · 2025-12-05T00:09:53Z

I support @p-j-smith ‘s suggestion to create a new top-level MDAnalysis.fetch module. Then this function could be named from_PDB(). This makes for nice semantics: pth = fetch.from_PDB(“1ake”) and u = Universe(fetch.from_PDB(“4bwz”)). And later we may have fetch.from_alphafold() , fetch.from_CSD(), …

orbeckst

Based on the discussion, please refactor and create a new top-level MDAnalysis.fetch module to house your new function (and any future "fetch" functionality).

Create a new fetch module; organize tests and docs accordingly. The docs will get a new top-level section (after coordinates).
Create fetch/pdb.py and put your function into this file.
Name your function from_PDB(). (This makes for nice semantics: pth = fetch.from_PDB(“1ake”))
Import your function in fetch/__init__.py: from .pdb import from_PDB

(PS: Please also address the other comments.)

jauy123 · 2025-12-18T01:22:59Z

Ok, I can get back to work on this since the semester ended.

hadn't implemented changes yet

orbeckst

Looking good. Primarily add docs and integrate into the docs under package/doc/sphinx/source/documentation_pages. You'll need a new fetch_modules.rst and also link that file from ../index.rst. Check the rendered docs to see that your docs show up correctly. Put yourself in the shoes of someone who wants to learn more about this new functionality.

package/MDAnalysis/__init__.py

package/MDAnalysis/fetch/pdb.py

testsuite/MDAnalysisTests/fetch/test_from_PDB.py

package/MDAnalysis/fetch/__init__.py

Changed default file_format to cif.gz Chaned return type to pathlib.Path Updated test appropriately removed true_basename()

However, I don't know where to put it? And I kinda finicking around with the sphinx code in order to get it actually show up modified: package/MDAnalysis/fetch/__init__.py modified: package/MDAnalysis/fetch/pdb.py new file: package/doc/sphinx/source/documentation_pages/auxiliary/fetch_init.rst new file: package/doc/sphinx/source/documentation_pages/auxiliary/fetch_pdb.rst modified: package/doc/sphinx/source/documentation_pages/auxiliary_modules.rst

jauy123 · 2025-12-23T20:09:33Z

Comments addressed and fixed:
@p-j-smith comment on making pdb_ids required for from_pdb
@p-j-smith comment on making pdb_ids return pathlib.Path instead of strings.
@marinegor comment on making the default file format download as cif.gz instead of pdb.gz -- However since there is no reader for cif yet, a Universe can't actually read cif.gz yet.
@orbeckst comment on removing from_pdb from top level import
@orbeckst various comments on documentation -- I also added various amount of documentation

One thing that I need to do (and something that I need input for from the core devs) is where to place mda.fetch in the docs/sphinx code? I kinda shoehorned in under auxilary just to visualized what I was doing, but I don't see an obvious place where to put this in the web documentation.

BradyAJohnston self-assigned this Mar 4, 2025

jauy123 added 18 commits June 23, 2025 10:25

Added requests as a dependency

7899f3d

Inital download code

44393be

fixed typo

b1f6002

cleaner convert_to_universe()

9c6e87a

Added abc module and allowed closing of file stream for downloaded te…

aecefc9

…xt files

Fixed __all__ -- should fixed pull request test on github

9510cc6

refactored cache logic

8c1a196

Initial tests

f0e30ed

Added __init.py to make tests work

1c7d909

typos fixed

eb23ed1

Refactored Tests -- put them in classes!

b0c7f5a

PdbDownloader().download() now downloads in binary rather than text (…

f2ec203

…since text files are just binary files with special encoding

Updated Tests to comply with pdb.gz

a21fd94

Added Progress bar to PdbDownloader().download()

d58bed9

Added a few clarifications to _requests_progress_bar

1147b6d

Added filename attribute() to BaseDownloader()

91feb16

made _requests_progress_bar a private method of PdbDownloader

ddcef9e

minor comments

bf3e07f

jauy123 requested a review from orbeckst October 29, 2025 15:20

orbeckst approved these changes Oct 30, 2025

View reviewed changes

package/MDAnalysis/topology/PDBParser.py Outdated Show resolved Hide resolved

p-j-smith requested changes Nov 13, 2025

View reviewed changes

directly linked to PDB entry

04275e9

marinegor mentioned this pull request Dec 14, 2025

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Open

5 tasks

Merge branch 'develop' into downloads

28daf45

orbeckst requested changes Dec 18, 2025

View reviewed changes

jauy123 added 2 commits December 19, 2025 10:29

Move all existing code to new fetch/ module

bb84830

hadn't implemented changes yet

Removed code from topology/PDBParser

dc2bbee

orbeckst requested changes Dec 19, 2025

View reviewed changes

jauy123 added 5 commits December 20, 2025 11:32

post-black run

2d77d50

Changed from_PDB

bcb42fa

Changed default file_format to cif.gz Chaned return type to pathlib.Path Updated test appropriately removed true_basename()

made pdb_ids a required arguement

d9fc5e6

Removed top level import -- need to do docs still

cac8169

oops forgot to replace docstring examples

f1249e7

jauy123 requested review from orbeckst and p-j-smith December 23, 2025 20:16

Implementation of fetch_pdb() #4943

Are you sure you want to change the base?

Implementation of fetch_pdb() #4943

Uh oh!

Conversation

jauy123 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers Certificate of Origin

Uh oh!

jauy123 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BradyAJohnston commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jauy123 commented Mar 5, 2025

Uh oh!

jauy123 commented Mar 10, 2025

Uh oh!

BradyAJohnston commented Mar 19, 2025

Uh oh!

jauy123 commented Mar 19, 2025

Uh oh!

jauy123 commented Mar 19, 2025

Uh oh!

orbeckst commented Mar 20, 2025

Uh oh!

orbeckst commented Jun 12, 2025

Uh oh!

jauy123 commented Jun 12, 2025

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jauy123 commented Nov 1, 2025

Uh oh!

jauy123 commented Nov 12, 2025

Uh oh!

orbeckst commented Nov 12, 2025

Uh oh!

orbeckst commented Nov 12, 2025

Uh oh!

jauy123 commented Nov 12, 2025

Uh oh!

jauy123 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-j-smith commented Nov 12, 2025

Uh oh!

orbeckst commented Nov 12, 2025

Uh oh!

p-j-smith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marinegor commented Nov 13, 2025

Uh oh!

orbeckst commented Dec 5, 2025

Uh oh!

orbeckst left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jauy123 commented Dec 18, 2025

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jauy123 commented Mar 3, 2025 •

edited

Loading

jauy123 commented Mar 3, 2025 •

edited

Loading

BradyAJohnston commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading

jauy123 commented Nov 12, 2025 •

edited

Loading

orbeckst left a comment •

edited

Loading