Skip to content

Conversation

@jauy123
Copy link
Contributor

@jauy123 jauy123 commented Mar 3, 2025

Fixes #4907

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

  • Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
  • Added requests as a dependency
  • mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 3, 2025

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

  1. Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
  2. Integration with Mdanalysis' logger
  3. Probably could wrap the cache logic into a separate function

@BradyAJohnston BradyAJohnston self-assigned this Mar 4, 2025
@BradyAJohnston
Copy link
Member

BradyAJohnston commented Mar 4, 2025

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

@codecov
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.74%. Comparing base (9fb8b0a) to head (f1249e7).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
+ Coverage    92.72%   92.74%   +0.01%     
===========================================
  Files          180      182       +2     
  Lines        22472    22496      +24     
  Branches      3188     3192       +4     
===========================================
+ Hits         20838    20864      +26     
+ Misses        1176     1175       -1     
+ Partials       458      457       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 5, 2025

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 10, 2025

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

@BradyAJohnston
Copy link
Member

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

try:
import Bio.AlignIO
import Bio.Align
import Bio.Align.Applications
except ImportError:
HAS_BIOPYTHON = False
else:
HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

Thanks for the comment!

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

@orbeckst
Copy link
Member

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

@orbeckst
Copy link
Member

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

@jauy123
Copy link
Contributor Author

jauy123 commented Jun 12, 2025

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

@jauy123 jauy123 requested a review from orbeckst October 29, 2025 15:20
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the last comments. All looking good to me now – ✅ !

Let's keep the top level import mda.fetch_pdb() for now. If we expand "fetchers" then we can revisit. Given that we currently do not document our top-level imports it is sufficient to document topology.PDBParser.fetch_pdb. Perhaps I could encourage you to add an entry to the UserGuide (see Contributing to the User Guide)?

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances. I do not propose to change fetch_pdb() as this can be the bare-bones workhorse function that we may want to wrap with a generic fetch() function or Universe.from_fetched() static method.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 1, 2025

Thanks for the feedback!

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances.

I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. os.path). Unless the only way to work with paths in python are pathlib Path objects, I prefer to return paths as a string as that has been historical python behavior, and leave users to convert to a pathlib object if they want to.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

@BradyAJohnston Nice to meet you at the UGM!

Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase.

@orbeckst
Copy link
Member

@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

@orbeckst
Copy link
Member

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.

Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

I see the point, and I personally use pathlib.Path for all my scripts. But is there like an official PEP saying that pathlib.Path is the preferred way to deal with paths over strings? Or is it like just a community practice since pathlib has been in the standard library for so long?

My next question if anything in the standard library which only take strings, or that they all of them take Paths in python 3.10+, the minimum supported version of python that MDAnalysis supports? If pathlib.Path are basically first class objects on the same level of strings in python 3.10+ (like you mentioned in the os.path), then I see no reason to not return paths as pathlib.Path objects instead of strings.

@p-j-smith
Copy link
Member

the standard library still has modules that still takes paths as string (i.e. os.path)

Functions in os.path will take anything that is path-like, including a pathlib.Path

@orbeckst
Copy link
Member

I didn't know that os.path was so flexible. I learned something new.

Copy link
Member

@p-j-smith p-j-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jauy123!

Blocking whilst we consider moving fetch_pdb into a new module e.g. mda.fetch. It feels like it would be better placed there, rather than in topology.PDBParser as it's not doing any topology parsing

@marinegor
Copy link
Contributor

Small suggestion here: could we perhaps make the default format not pdb.gz but cif.gz, in line with #5142, its fixes for #5089 and the fact that:

as of April 2025 3.7% of the PDB archive is not available in legacy PDB format

source

Or at least make it default if gemmi is installed?

@orbeckst
Copy link
Member

orbeckst commented Dec 5, 2025

I support @p-j-smith ‘s suggestion to create a new top-level MDAnalysis.fetch module. Then this function could be named from_PDB(). This makes for nice semantics: pth = fetch.from_PDB(“1ake”) and u = Universe(fetch.from_PDB(“4bwz”)). And later we may have fetch.from_alphafold() , fetch.from_CSD(), …

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion, please refactor and create a new top-level MDAnalysis.fetch module to house your new function (and any future "fetch" functionality).

  • Create a new fetch module; organize tests and docs accordingly. The docs will get a new top-level section (after coordinates).
  • Create fetch/pdb.py and put your function into this file.
  • Name your function from_PDB(). (This makes for nice semantics: pth = fetch.from_PDB(“1ake”))
  • Import your function in fetch/__init__.py: from .pdb import from_PDB

(PS: Please also address the other comments.)

@jauy123
Copy link
Contributor Author

jauy123 commented Dec 18, 2025

Ok, I can get back to work on this since the semester ended.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Primarily add docs and integrate into the docs under package/doc/sphinx/source/documentation_pages. You'll need a new fetch_modules.rst and also link that file from ../index.rst. Check the rendered docs to see that your docs show up correctly. Put yourself in the shoes of someone who wants to learn more about this new functionality.

Changed default file_format to cif.gz
Chaned return type to pathlib.Path

Updated test appropriately
removed true_basename()
However, I don't know where to put it? And I kinda finicking around with the sphinx code in order to get it actually show up

	modified:   package/MDAnalysis/fetch/__init__.py
	modified:   package/MDAnalysis/fetch/pdb.py
	new file:   package/doc/sphinx/source/documentation_pages/auxiliary/fetch_init.rst
	new file:   package/doc/sphinx/source/documentation_pages/auxiliary/fetch_pdb.rst
	modified:   package/doc/sphinx/source/documentation_pages/auxiliary_modules.rst
@jauy123
Copy link
Contributor Author

jauy123 commented Dec 23, 2025

Comments addressed and fixed:
@p-j-smith comment on making pdb_ids required for from_pdb
@p-j-smith comment on making pdb_ids return pathlib.Path instead of strings.
@marinegor comment on making the default file format download as cif.gz instead of pdb.gz -- However since there is no reader for cif yet, a Universe can't actually read cif.gz yet.
@orbeckst comment on removing from_pdb from top level import
@orbeckst various comments on documentation -- I also added various amount of documentation

One thing that I need to do (and something that I need input for from the core devs) is where to place mda.fetch in the docs/sphinx code? I kinda shoehorned in under auxilary just to visualized what I was doing, but I don't see an obvious place where to put this in the web documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mda.fetch_pdb() to generate Universe from Protein Databank structures

7 participants