Enabling of parallelization of `analysis.hydrogenbonds.WaterBridgeAnalysis` #5151

talagayev · 2025-11-20T12:20:32Z

Changes made in this Pull Request:

Addition of parallelization to WaterBridgeAnalysis class in analysis.hydrogenbonds.wbridge_analysis.
Addition of client_WaterBridgeAnalysis to conftest.py.
Adjustment of tests in test_wbridge.py to work with parallelization.

Here currently I adjusted analysis.hydrogenbonds.wbridge_analysis to work with parallelization and added an example of pytests that will show that it works.

The problem that I have encountered is that I had to modify universe_DA_PBC, since if I directly add the client_WaterBridgeAnalysis it will led to an infinite pytest on my local PC, mainly connected with us using here StringIO that appearantly has difficulties with parallelization.

The second thing is that currently the tests in test_wbridge.py are mainly with 1 frame in mind, so to showcase that parallelization works I could add additional tests similar like universe_DA_PBC_10frames().

PR Checklist

Issue raised/referenced?
Tests updated/added?
Documentation updated/added?
package/CHANGELOG file updated?
Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--5151.org.readthedocs.build/en/5151/

codecov · 2025-11-20T12:30:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.72%. Comparing base (6b51eac) to head (3b4812f).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #5151   +/-   ##
========================================
  Coverage    92.72%   92.72%           
========================================
  Files          180      180           
  Lines        22458    22466    +8     
  Branches      3186     3187    +1     
========================================
+ Hits         20824    20832    +8     
  Misses        1177     1177           
  Partials       457      457

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

talagayev · 2025-11-20T13:25:24Z

@marinegor I would ping you in this PR, since it is parallelization :)

Here I would need your opinion on how to approach the pytest cases. I didn't add client_WaterBridgeAnalysis to the tests with StringIO since these would lead to the tests being infinite. The question is, is there any goaround for parallelization with StringIO or do we just modify the pytests to work?

Also as mentioned with the pytests being only for 1 frame, should I add for each test cases with multiple frames and test them as in the example that I have now or what would be the prefereable approach?

marinegor · 2025-11-20T19:09:19Z

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

talagayev · 2025-11-21T00:07:51Z

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

Agree, we can do them similar to the other cases with the files. I will create the test filles, put them into testsuite/MDAnalysisTests/data/waterbridge and adjust the pytests to use them. I will then do for now for the tests the case where I check for now for for each case a test with 1 frame, basically as they are now and a test for multiple frames, so similar like the added test_donor_accepter_pbc_multi, just name them more correspondingly to the tests with 1 frame.

marinegor · 2025-11-26T15:41:41Z

@talagayev ok, sure! just @ me here or request a review when you're done.

talagayev · 2025-11-26T19:59:35Z

@marinegor okay should be ready to be reviewed now, had to do adjust timesteps for parallelization cases, due to an error that was appearing, mainly that one that was appearing

analysis_func = <function TestWaterBridgeAnalysis.test_count_by_time_weight.<locals>.analysis at 0x000002322B345A80>, kwargs = {}, result = []

    def count_by_time(self, analysis_func=None, **kwargs):
        """Counts the number of water bridges per timestep.

        The counting behaviour can be adjusted by supplying analysis_func.
        See :ref:`wb_count_by_time` for details.

        Returns
        -------
        counts : list
             Returns a time series ``N(t)`` where ``N`` is the total
             number of observed water bridges at time ``t``.

        """
        if analysis_func is None:
            analysis_func = self._count_by_time_analysis
        if self.results.network:
            result = []
>           for time, frame in zip(self.timesteps, self.results.network):
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           TypeError: 'NoneType' object is not iterable

..\..\..\package\MDAnalysis\analysis\hydrogenbonds\wbridge_analysis.py:1966: TypeError

So had to adjust timesteps to not be NoneType during parallelization.

As for the testfiles, I did create those that were created in the tests, put them in the data and import them similar as in the other tests.

talagayev · 2025-11-26T23:48:26Z

@marinegor Ah and I can update the CHANGELOG and add the Documentation if this looks good, wasn't sure if something needs to be modified so I kept it for last :)

marinegor · 2025-11-26T23:57:55Z

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:

        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):

and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:

❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):

@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

talagayev · 2025-11-27T00:48:22Z

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:
        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):
and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:
❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):
@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

True, yes was not sure about the solution, if it will work for other cases, that are not covered by the pytests 🙈 Yes should be timesteps, which is currently tricky to get during multiprocessing, which is as you mentioned that they have to exist after the run, which works well with serial, but difficult with multiprocessing.

hm, yes the moving to of timesteps to results.timesteps could be an option I think, would need to look more into it as well.

marinegor · 2025-11-27T00:54:21Z

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

talagayev · 2025-11-27T01:18:55Z

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

Yep, sounds good, let's see what the others say, in the meantime I can check the PR and how it was done there.

orbeckst · 2025-11-27T01:30:26Z

timesteps is not part of the AnalysisBase API (as far as I can tell) so I don't see an issue putting it under results.

If it was documented as a top level attribute of the specific class before then you have to deprecate it and keep a pointer around until we can remove it in 3.0.

(Apologies, quick comment before I am now starting my 🦃 days.)

IAlibay · 2025-11-27T01:53:22Z

testsuite/MDAnalysisTests/analysis/test_wbridge.py

-    4ALA      H    4   0.500   0.000   0.000
-    4ALA      N    5   0.600   0.000   0.000
-  1.0   1.0   1.0"""
-        u = MDAnalysis.Universe(StringIO(grofile), format="gro")


I'm not sure I follow the whole StringIO thing - but being able to read a StringIO is a core component of MDAnalysis. Are we saying that this is no longer possible?

The problem in the current iteration is that when StringIO is combined with parallelization it led to infinite pytests, so reading StringIO + doing analysis with dask/multiprocessing is not compatible. Maybe an option would to add then an exception case and some check before the running of an analysis if the file was read from StringIO 🤔

@talagayev can you perhaps provide a minimal running example of what was happening with StringIO, so that we could try running it locally and see the infiniteness for ourselves?

yes, I can try maybe to do a small push, put one StringIO case back so that you can then git clone it and run/test locally. Can do it in around 30 minutes.

okay got the test_loop to the intial state where it reads the StringIO to create the universe. It currently fails the tests here due to the time limit, so you can git clone and run it locally

for anyone wondering, here's the standalone code:

from io import StringIO import MDAnalysis as mda from MDAnalysis.analysis.hydrogenbonds.wbridge_analysis import ( WaterBridgeAnalysis, ) s = """Test gro file 5 1ALA O 1 0.000 0.001 0.000 2SOL OW 2 0.300 0.001 0.000 2SOL HW1 3 0.200 0.002 0.000 2SOL HW2 4 0.200 0.000 0.000 4ALA O 5 0.600 0.000 0.000 1.0 1.0 1.0""" grofile = StringIO(s) wb = WaterBridgeAnalysis( mda.Universe(grofile, format="gro"), "protein and (resid 1)", "protein and (resid 1 or resid 4)", ) wb.run(backend="multiprocessing")

and here's the stacktrace:

In [9]: wb.run(backend='multiprocessing') /home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/reduction.py:51: UserWarning: Reader has no dt information, set to 1.0 ps cls(buf, protocol).dump(obj) Process ForkPoolWorker-1: Traceback (most recent call last): File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap self.run() ~~~~~~~~^^ File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 114, in worker task = get() File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 387, in get return _ForkingPickler.loads(res) ~~~~~~~~~~~~~~~~~~~~~^^^^^ File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) ^^^^^^^^^^^ File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) ^^^^^^^^^^^ File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) ^^^^^^^^^^^ [Previous line repeated 2970 more times] RecursionError: maximum recursion depth exceeded Exception ignored in: <function NamedStream.__del__ at 0x70e968095800> Traceback (most recent call last): File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__ return getattr(self.stream, x) [Previous line repeated 997 more times] RecursionError: maximum recursion depth exceeded ^CProcess ForkPoolWorker-2:

@talagayev I'd suggest to make a separate issue out of it and not include it in this PR.

not include it in this PR

If you decide to move everything out of the StringIO representation, I would ask that you open an issue to fix it once the underlying issue is fixed.

I'd prefer if we didn't need to add all these files and could continue using the in-line test snippets.

More broadly, before working around issues we should understand what's going wrong.

get one StringIO case back for showcase

add missing

import addition

marinegor · 2025-11-27T21:54:27Z

also @talagayev you can have a look at InterRDS_s and DeprecationWarning there to have a look at how things get moved to results. I'd do something similar here as well.

marinegor · 2025-11-27T22:06:29Z

package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py

+        if self.timesteps is None:
+            timesteps = range(len(self.results.network))
+        else:
+            timesteps = self.timesteps


should be actual timesteps, and not integer indices. Solution would be to save them to results._timesteps and later use them where needed, while adding deprecation warning to self.timesteps alike InterRDS_s's approach.

IAlibay · 2025-11-28T01:06:48Z

testsuite/MDAnalysisTests/datafiles.py

    "SURFACE_PDB",  # 111 FCC lattice topology for NSGrid bug #2345
    "SURFACE_TRR",  # full precision coordinates for NSGrid bug #2345
    "DSSP",  # DSSP test suite
+    "WB_AD",


Please add desriptors for all these files, the filenames themselves won't make a ton of sense a couple months down the line.

IAlibay · 2025-11-28T01:08:03Z

testsuite/MDAnalysisTests/analysis/conftest.py

    return request.param
+
+
+# MDAnalysis.analysis.hydrogenbonds.wbridge_analysis


orbeckst · 2025-12-01T16:40:00Z

@talagayev why is StringIO not working here?

marinegor · 2025-12-01T17:37:47Z

I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.

talagayev · 2025-12-01T18:22:37Z

I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.

Could it be connected somehow to the case, that at some point in the parallelization if it uses ReaderBase to read the file it would try to read the file and would not be able to access it, since it gets only a string that is not pickable and iteratable due to this part of the code in the coordinates/base.py? Since here he only exception being NamedStream files.

    if isinstance(filename, NamedStream):
        self.filename = filename
    else:
        self.filename = str(filename)

maybe we can adjust it somehow for StringIO cases. How is it currently regulated with NamedStream cases? I assume those would possibly also fall into the same line of them being difficult to parallelize?

marinegor · 2025-12-14T16:11:25Z

@orbeckst @IAlibay do you have any feedback on moving StringIO-based tests for this PR?

I honestly see this as a separate issue, which we totally must raise (@talagayev please tell me if you wanna do that, otherwise I'll do it). But, I don't see why it would block this particular PR enabling parallelization.

@talagayev could you please implement all non-StringIO related changes requested by me and Irfan and re-request review when it's ready?

orbeckst · 2025-12-15T15:52:47Z

Do you suggest to make the tests here run with file-based access and then revert to StringIO when it's fixed?

marinegor · 2025-12-15T16:03:18Z

I generally do not understand why particularly here tests are StringIO-based -- it doesn't seem to be that common afaik. I'd suggest making them file-based (as the rest of the tests) and raising an issue about using StringIO topology/universe with non-serial backends.

orbeckst · 2025-12-15T16:46:28Z

StringIO cuts down on the number of files to ship and it's neat when you only have mini-files that you can keep in-line.

orbeckst

Although I prefer the StringIO-based testing here, I am ok with moving forward with the file-based workaround, provided you raise an issue and add comments to tests.filenames referencing this PR, the new issue, and a note that we may want to turn files back into in-line when the issue about StringIO/multiprocessing/pickling has been resolved.

marinegor · 2025-12-15T16:56:47Z

I agree, it reduces numbers of files indeed, and is more readable. But in this particular case I feel like passing tests for the class that actually should run on real-world data (=files) could have a higher priority. I'm not sure if StringIO is ever used in real-world scenarios, and it feels weird to me to block this PR because of the (inconsistent with other AnalysisBase subclasses) way tests are written here. And indeed as you mentioned, we can indeed revert to StringIO when the issue with it in parallel context is resolved.

IAlibay · 2025-12-15T17:24:07Z

@marinegor StringIO is very frequently used in the real world, please do not discount it as a second class citizen, otherwise a chunk of the ecosystem is going to fall apart.

marinegor · 2025-12-15T17:33:29Z

Sorry if I came across a bit too strong, I don't have problems with StringIO in general. But in this particular case, seems that - StringIO doesn't work for any non-serial run for any AnalysisBase subclass - this particular PR is correct, and fails because of the issue above Hence I suggest not fixing StringIO issue here but rather raising it separately, while accepting this PR (and switching it to file-based testing).

talagayev · 2025-12-15T17:35:48Z

@orbeckst @IAlibay do you have any feedback on moving StringIO-based tests for this PR?

I honestly see this as a separate issue, which we totally must raise (@talagayev please tell me if you wanna do that, otherwise I'll do it). But, I don't see why it would block this particular PR enabling parallelization.

@talagayev could you please implement all non-StringIO related changes requested by me and Irfan and re-request review when it's ready?

Yup can do, but will quickly wait for an decision how we proceed with StringIO and can then adjust the code and create the Issue.

added parallelization to WaterBridgeAnalysis

de9cd4c

talagayev added 4 commits November 25, 2025 03:30

adjusted the test files for WB

84c74d0

black formatting

666aabf

removed files that were moved

946ec67

black formatting

8db21b4

fixed case when empty results with parallelization

3b4812f

marinegor self-requested a review November 26, 2025 23:31

remove unused imports

dd2e21b

IAlibay reviewed Nov 27, 2025

View reviewed changes

talagayev added 4 commits November 27, 2025 13:00

Merge branch 'develop' into wb_parallel

6708442

Update test_wbridge.py

d6bdae9

get one StringIO case back for showcase

Update conftest.py

e10772c

add missing

Update test_wbridge.py

fca6ca2

import addition

marinegor requested changes Nov 27, 2025

View reviewed changes

IAlibay requested changes Nov 28, 2025

View reviewed changes

orbeckst reviewed Dec 15, 2025

View reviewed changes

		return request.param


		# MDAnalysis.analysis.hydrogenbonds.wbridge_analysis

Enabling of parallelization of analysis.hydrogenbonds.WaterBridgeAnalysis #5151

Are you sure you want to change the base?

Enabling of parallelization of analysis.hydrogenbonds.WaterBridgeAnalysis #5151

Uh oh!

Conversation

talagayev commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers Certificate of Origin

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

talagayev commented Nov 20, 2025

Uh oh!

marinegor commented Nov 20, 2025

Uh oh!

talagayev commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marinegor commented Nov 26, 2025

Uh oh!

talagayev commented Nov 26, 2025

Uh oh!

talagayev commented Nov 26, 2025

Uh oh!

marinegor commented Nov 26, 2025

Uh oh!

talagayev commented Nov 27, 2025

Uh oh!

marinegor commented Nov 27, 2025

Uh oh!

talagayev commented Nov 27, 2025

Uh oh!

orbeckst commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marinegor commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orbeckst commented Dec 1, 2025

Uh oh!

marinegor commented Dec 1, 2025 via email

Uh oh!

talagayev commented Dec 1, 2025

Uh oh!

marinegor commented Dec 14, 2025

Uh oh!

orbeckst commented Dec 15, 2025

Uh oh!

marinegor commented Dec 15, 2025 via email

Uh oh!

orbeckst commented Dec 15, 2025

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

marinegor commented Dec 15, 2025 via email

Uh oh!

Enabling of parallelization of `analysis.hydrogenbonds.WaterBridgeAnalysis` #5151

Enabling of parallelization of `analysis.hydrogenbonds.WaterBridgeAnalysis` #5151

talagayev commented Nov 20, 2025 •

edited by github-actions bot

Loading

codecov bot commented Nov 20, 2025 •

edited

Loading

talagayev commented Nov 21, 2025 •

edited

Loading