-
Notifications
You must be signed in to change notification settings - Fork 761
Enabling of parallelization of analysis.hydrogenbonds.WaterBridgeAnalysis
#5151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5151 +/- ##
========================================
Coverage 92.72% 92.72%
========================================
Files 180 180
Lines 22458 22466 +8
Branches 3186 3187 +1
========================================
+ Hits 20824 20832 +8
Misses 1177 1177
Partials 457 457 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@marinegor I would ping you in this PR, since it is parallelization :) Here I would need your opinion on how to approach the pytest cases. I didn't add Also as mentioned with the pytests being only for 1 frame, should I add for each test cases with multiple frames and test them as in the example that I have now or what would be the prefereable approach? |
|
@talagayev I don't see a reason why these files take |
Agree, we can do them similar to the other cases with the files. I will create the test filles, put them into |
|
@talagayev ok, sure! just |
|
@marinegor okay should be ready to be reviewed now, had to do adjust So had to adjust As for the testfiles, I did create those that were created in the tests, put them in the |
|
@marinegor Ah and I can update the |
|
I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers? I.e. if I modify lines 1954-ish it to: if self.results.network:
result = []
print(f'{self.timesteps=}')
assert False
for time, frame in zip(self.timesteps, self.results.network):and run tests, I get I guess the issue here is that attribute It's weirdly the only class that actually does this: @MDAnalysis/coredevs do you have any opinion on moving |
True, yes was not sure about the solution, if it will work for other cases, that are not covered by the pytests 🙈 Yes should be timesteps, which is currently tricky to get during multiprocessing, which is as you mentioned that they have to exist after the run, which works well with serial, but difficult with multiprocessing. hm, yes the moving to of |
Yep, sounds good, let's see what the others say, in the meantime I can check the PR and how it was done there. |
|
If it was documented as a top level attribute of the specific class before then you have to deprecate it and keep a pointer around until we can remove it in 3.0. (Apologies, quick comment before I am now starting my 🦃 days.) |
| 4ALA H 4 0.500 0.000 0.000 | ||
| 4ALA N 5 0.600 0.000 0.000 | ||
| 1.0 1.0 1.0""" | ||
| u = MDAnalysis.Universe(StringIO(grofile), format="gro") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the whole StringIO thing - but being able to read a StringIO is a core component of MDAnalysis. Are we saying that this is no longer possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in the current iteration is that when StringIO is combined with parallelization it led to infinite pytests, so reading StringIO + doing analysis with dask/multiprocessing is not compatible. Maybe an option would to add then an exception case and some check before the running of an analysis if the file was read from StringIO 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talagayev can you perhaps provide a minimal running example of what was happening with StringIO, so that we could try running it locally and see the infiniteness for ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I can try maybe to do a small push, put one StringIO case back so that you can then git clone it and run/test locally. Can do it in around 30 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay got the test_loop to the intial state where it reads the StringIO to create the universe. It currently fails the tests here due to the time limit, so you can git clone and run it locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for anyone wondering, here's the standalone code:
from io import StringIO
import MDAnalysis as mda
from MDAnalysis.analysis.hydrogenbonds.wbridge_analysis import (
WaterBridgeAnalysis,
)
s = """Test gro file
5
1ALA O 1 0.000 0.001 0.000
2SOL OW 2 0.300 0.001 0.000
2SOL HW1 3 0.200 0.002 0.000
2SOL HW2 4 0.200 0.000 0.000
4ALA O 5 0.600 0.000 0.000
1.0 1.0 1.0"""
grofile = StringIO(s)
wb = WaterBridgeAnalysis(
mda.Universe(grofile, format="gro"),
"protein and (resid 1)",
"protein and (resid 1 or resid 4)",
)
wb.run(backend="multiprocessing")and here's the stacktrace:
In [9]: wb.run(backend='multiprocessing')
/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/reduction.py:51: UserWarning: Reader has no dt information, set to 1.0 ps
cls(buf, protocol).dump(obj)
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 114, in worker
task = get()
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 387, in get
return _ForkingPickler.loads(res)
~~~~~~~~~~~~~~~~~~~~~^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
[Previous line repeated 2970 more times]
RecursionError: maximum recursion depth exceeded
Exception ignored in: <function NamedStream.__del__ at 0x70e968095800>
Traceback (most recent call last):
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
[Previous line repeated 997 more times]
RecursionError: maximum recursion depth exceeded
^CProcess ForkPoolWorker-2:
@talagayev I'd suggest to make a separate issue out of it and not include it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not include it in this PR
If you decide to move everything out of the StringIO representation, I would ask that you open an issue to fix it once the underlying issue is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we didn't need to add all these files and could continue using the in-line test snippets.
More broadly, before working around issues we should understand what's going wrong.
get one StringIO case back for showcase
add missing
import addition
|
also @talagayev you can have a look at |
| if self.timesteps is None: | ||
| timesteps = range(len(self.results.network)) | ||
| else: | ||
| timesteps = self.timesteps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be actual timesteps, and not integer indices. Solution would be to save them to results._timesteps and later use them where needed, while adding deprecation warning to self.timesteps alike InterRDS_s's approach.
| "SURFACE_PDB", # 111 FCC lattice topology for NSGrid bug #2345 | ||
| "SURFACE_TRR", # full precision coordinates for NSGrid bug #2345 | ||
| "DSSP", # DSSP test suite | ||
| "WB_AD", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add desriptors for all these files, the filenames themselves won't make a ton of sense a couple months down the line.
| return request.param | ||
|
|
||
|
|
||
| # MDAnalysis.analysis.hydrogenbonds.wbridge_analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
|
@talagayev why is StringIO not working here? |
|
I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.
|
Could it be connected somehow to the case, that at some point in the parallelization if it uses maybe we can adjust it somehow for |
|
@orbeckst @IAlibay do you have any feedback on moving I honestly see this as a separate issue, which we totally must raise (@talagayev please tell me if you wanna do that, otherwise I'll do it). But, I don't see why it would block this particular PR enabling parallelization. @talagayev could you please implement all non-StringIO related changes requested by me and Irfan and re-request review when it's ready? |
|
Do you suggest to make the tests here run with file-based access and then revert to StringIO when it's fixed? |
|
I generally do not understand why particularly here tests are StringIO-based -- it doesn't seem to be that common afaik.
I'd suggest making them file-based (as the rest of the tests) and raising an issue about using StringIO topology/universe with non-serial backends.
|
|
StringIO cuts down on the number of files to ship and it's neat when you only have mini-files that you can keep in-line. |
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I prefer the StringIO-based testing here, I am ok with moving forward with the file-based workaround, provided you raise an issue and add comments to tests.filenames referencing this PR, the new issue, and a note that we may want to turn files back into in-line when the issue about StringIO/multiprocessing/pickling has been resolved.
|
I agree, it reduces numbers of files indeed, and is more readable. But in this particular case I feel like passing tests for the class that actually should run on real-world data (=files) could have a higher priority. I'm not sure if StringIO is ever used in real-world scenarios, and it feels weird to me to block this PR because of the (inconsistent with other AnalysisBase subclasses) way tests are written here.
And indeed as you mentioned, we can indeed revert to StringIO when the issue with it in parallel context is resolved.
|
|
@marinegor StringIO is very frequently used in the real world, please do not discount it as a second class citizen, otherwise a chunk of the ecosystem is going to fall apart. |
|
Sorry if I came across a bit too strong, I don't have problems with StringIO in general. But in this particular case, seems that
- StringIO doesn't work for any non-serial run for any AnalysisBase subclass
- this particular PR is correct, and fails because of the issue above
Hence I suggest not fixing StringIO issue here but rather raising it separately, while accepting this PR (and switching it to file-based testing).
|
Yup can do, but will quickly wait for an decision how we proceed with |
Fixes #4666.
Changes made in this Pull Request:
WaterBridgeAnalysisclass inanalysis.hydrogenbonds.wbridge_analysis.client_WaterBridgeAnalysistoconftest.py.test_wbridge.pyto work with parallelization.Here currently I adjusted
analysis.hydrogenbonds.wbridge_analysisto work with parallelization and added an example of pytests that will show that it works.The problem that I have encountered is that I had to modify
universe_DA_PBC, since if I directly add theclient_WaterBridgeAnalysisit will led to an infinite pytest on my local PC, mainly connected with us using hereStringIOthat appearantly has difficulties with parallelization.The second thing is that currently the tests in
test_wbridge.pyare mainly with 1 frame in mind, so to showcase that parallelization works I could add additional tests similar likeuniverse_DA_PBC_10frames().PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5151.org.readthedocs.build/en/5151/