Skip to content

Conversation

@caetera
Copy link
Contributor

@caetera caetera commented Mar 20, 2025

Preparation for release 1.5.0

Changes

  • Code is fully migrated to .NET 8, thus, Mono dependency is obsolete
  • Added an action for automatic building and testing on all code changes
  • Added an action to produce self-contained and framework-based binaries for all major platforms on release
  • Parquet format is updated to version 0.2
  • Charge state array can be included in mzML and MGF output. (Close include charge state array #189)

caetera and others added 18 commits October 18, 2024 19:55
Known issues:
Parquet needs refactoring - new API
AWS S3 needs testing
Tests cannot be executed
* Log4net configuration moved to App.config
* Fixed unhandled exception during mzML precursor creation
* Explicit usings
* Properties (some build/debug information from VS) ignored

TODO Check all "possibly null reference" warnings
The same precursor selection logic as for other file formats
TBD: Move some of the service functions to the SpectrumWriter base class
Use the same method for mass-spec data creation
mzParquet writer v0.2 full implementation
@caetera caetera requested a review from ypriverol March 20, 2025 14:27
@ypriverol ypriverol linked an issue Mar 24, 2025 that may be closed by this pull request
@caetera caetera mentioned this pull request Mar 24, 2025
@caetera caetera mentioned this pull request Apr 9, 2025
@timosachsenberg
Copy link

awesome :)

@ypriverol
Copy link
Contributor

@caetera do you think you can produce some examples of the parquet implementation to see how it looks, the idea is to have some parquet users evaluating the implementation.

@caetera
Copy link
Contributor Author

caetera commented Sep 16, 2025

@ypriverol I certainly can convert a few files to parquet format. What kind of files do you want to test on? Are there any specific requirements?

@ypriverol
Copy link
Contributor

I think having a few examples from different instruments would be good to know how the format works. I can try to viz here, it is only to see if we need to extend the current implementation further.

@caetera
Copy link
Contributor Author

caetera commented Sep 16, 2025

I shared some files here - https://syddanskuni-my.sharepoint.com/:f:/g/personal/vgor_bmb_sdu_dk/EiXlHHyZY35LjvUCX_J9eJ0BJXMBbL7lqvQ2HkKqiyEzyA?e=Ikk9au converted with the latest build (5bac912)

File names are [Instrument]-[MS1Mode]-[MS2Mode]_[Other]

OTc - Orbitrap in centroided mode
OTp - Orbitrap in profile mode
ITC - Ion Trap in centroided mode
ITp - Ion Trap in profile mode

I converted them to mzparquet with vendor centroiding enabled (default) and disabled.

@ypriverol
Copy link
Contributor

I shared some files here - https://syddanskuni-my.sharepoint.com/:f:/g/personal/vgor_bmb_sdu_dk/EiXlHHyZY35LjvUCX_J9eJ0BJXMBbL7lqvQ2HkKqiyEzyA?e=Ikk9au converted with the latest build (5bac912)

File names are [Instrument]-[MS1Mode]-[MS2Mode]_[Other]

OTc - Orbitrap in centroided mode OTp - Orbitrap in profile mode ITC - Ion Trap in centroided mode ITp - Ion Trap in profile mode

I converted them to mzparquet with vendor centroiding enabled (default) and disabled.

Thanks I will take a look.

@mobiusklein
Copy link

I reviewed a few of the mzparquet files, but had some questions. A lot of this stems from my lack of familiarity with the intended scope of mzparquet though.

Would it make sense to make the mass analyzer a column? Looking at an Orbitrap-Astral file, the only way to know that a scan was acquired in the Astral and not the Orbitrap was to assume the MS1 were orbitrap and the MS2 were Astral. The same goes for Tribrid instruments where you would want to know whether the ion trap or orbitrap was used for the MS2 spectra.

The same applies for centroiding. I get the impression that mzparquet is primarily intended as a container for centroid data, but you have a bunch of profile-mode files here too. Also, Thermo's profile spectra all zero-padded, so you get long runs of m/z-intensity pairs where the intensity is zero but the m/z is just incrementing by a step function. The 0s can be RLE'd away if there are enough of them to constitute using RLE as the encoding for that data page, but the m/z values are stored as-is and they are the most expensive thing in the file. For https://github.com/mobiusklein/mzpeak_prototyping, I have been dropping long runs of zeros and retaining only the flanking zeros without a substantial loss of information.

How would you represent a dataset with mixed dissociation methods, like HCD and ETD, or EThcD? The schema as-is assumes that all you wish to know is whether there is dissociation or not via the level.

@timosachsenberg
Copy link

timosachsenberg commented Sep 17, 2025

@mobiusklein raises some good points, and the answer depends somewhat on the intended scope of mzParquet. As I understand it, mzParquet is primarily designed as a low-metadata format, for example to support quick visualization of spectra. If the scope were to be extended to use cases such as database searches, the mentioned columns could be valuable in enabling those applications. Ultimately, a more complete approach like mzpeak is needed to support all analyses - but I think this is not in the scope of mzparquet (correct me if I am wrong).

@ypriverol
Copy link
Contributor

Thanks @mobiusklein @timosachsenberg @caetera for the feedback. @lazear, it would be great if you could give also some feedback here, mzparquet is not a PSI format and do not aim to replace mzML or other file formats. Instead, it aims for the following use cases:

  • to produce a parquet-based format for a query of MS2 spectra, for use cases like Viz, clustering, and AI training
  • a possible replacement for MS2 files, like MGF or MS2 that enable search engines and other tools to build on top of them.
  • As a possible store system for archiving results. I can see people querying spectra using parquet for novel PTMs, etc.

These are the two major use cases I see.

@caetera
Copy link
Contributor Author

caetera commented Sep 17, 2025

I can comment on the use of the profile data. Profile output is implemented mostly to be consistent with other output formats. I don't know if it is ever going to be used in real-life applications. Centroided data is the default for all output formats and to the best of my knowledge, profile data is rarely required. However, I believe it is less confusing if the switch_off_centroiding flag work in the same way for all output formats.

If you believe that mzparquet and/or mgf should always imply centroiding that can be implemented.

@mobiusklein
Copy link

mobiusklein commented Sep 17, 2025

MGF is by definition a peak list format, according to Mascot's website: https://www.matrixscience.com/help/data_file_help.html. If mzparquet is a binary replacement for that same use-case, then it doesn't make sense to store profile data in that format.

MGF doesn't specify how to signify dissociation method, mass analyzer, or most other things about a spectrum either, but many tools have learned to shove that information in through custom parameters or by way of the TITLE or INSTRUMENT parameters. The values there don't have any globally controlled rules either, they're just defined by context. The fewer things you promise to define in a format, the fewer places where the consensus reality of how it is used in practice can break.

Given what's described, it's probably workable without those extra columns since we've gotten by without it for so long, but it leaves it up to the consuming tools to decide what to do or not without that information. Do those tools already exist with extant use-cases or support for other formats?

@lazear
Copy link
Contributor

lazear commented Sep 17, 2025

I view mzparquet similarly to mzML, MGF, MS2 - centroided input for a search engine, visualization tools, etc. In my view (not reflective of PSI's view most likely), these are 'necessary evil' intermediate file formats to sidestep proprietary vendor libraries for reading MS data, not intended to necessarily capture 100% of the metadata from the raw files - that information can just be extracted from the raw files if required. In my perfect world, mzML/mzparquet don't exist, because vendors have open-sourced their file formats.

mzparquet reflects the need for a lighter-weight, higher performance mzML/MGF replacement. In contrast to the other formats, the data is readable with pandas/polars/distributed query engines/etc, making it easier to integrate data for both interactive data exploration use cases or repository-scale analyses.

The nice thing about a parquet-defined format is that we could define the required minimum usable set of columns (m/z, intensity, rt, ion-mobility, etc) that cover 90% of use-cases. Additional columns could be added for mass analyzers or whatever other metadata is desired, and downstream applications that don't need those columns can simply skip reading them - it is essentially zero cost (outside of storage space) to add new columns, something not possible with mzML, etc. More sophisticated strategies for building row groups could account for things like DIA isolation windows to enable really fast targeted extraction.

@ypriverol
Copy link
Contributor

@lazear can you test the current implementation to see if it works out of the box with SAGE?

@lazear
Copy link
Contributor

lazear commented Sep 17, 2025

AFAIK there are no public users of mzparquet (including sage). I have some internal tools that use it, and I can test it out.

TRFP already had an obscure parquet based format that no one was using, so this was just a swap to another format no one is using yet :)

@ypriverol
Copy link
Contributor

ypriverol commented Sep 17, 2025

I know, the idea is to have an implementation that at least some groups could potentially use, for example we are planning to use it in quantms.io for clustering, then is 1 and half users. Better than the previous obscure version.

BTW, if mzparquet is not something definitive, and keeping in mid the main use case as @lazear that we are aiming mgf kind of replacement, would be great if @mobiusklein @timosachsenberg @lazear @caetera and myself we can give ideas about how to improve it within TRFP.

@ypriverol
Copy link
Contributor

@caetera, another thing. This new version of the tool is a big change in the architecture, no mono dependency anymore, new parquet, etc. Should we increase to 2.0.0 rather than continue in 1.X.X versions? What do you think @caetera @timosachsenberg @mobiusklein

@caetera
Copy link
Contributor Author

caetera commented Sep 18, 2025

In the current implementation, mzparquet is well-suited for visualization, XIC extraction, etc, but if it is expected to be used also for search engines, adding columns like the type of activation, collision energy, supplementary activation type and energy, makes sense.

If we want someone to adopt it, I believe it will be necessary to present the schema somehow, i.e., required columns (names) and datatypes (float32/float64 could be interchangeable, though, since the downstream readers will likely take care of that). For activation and other similar fields, I would suggest using a qualified CV term name (collision-induced dissociation, electron transfer dissociation, etc) to keep it lightweight.

@caetera
Copy link
Contributor Author

caetera commented Sep 18, 2025

@caetera, another thing. This new version of the tool is a big change in the architecture, no mono dependency anymore, new parquet, etc. Should we increase to 2.0.0 rather than continue in 1.X.X versions? What do you think @caetera @timosachsenberg @mobiusklein

Strictly speaking, since "the API" (dependencies and one of the output formats) changes, we should increment the major version; however, it does not change much "under the hood".

@ypriverol
Copy link
Contributor

I think if we are moving away from Mono and the API of Thermo is different, it has enough credits to go to 2.X.X. If you guys think it's a good idea, including you @caetera give us a 👍

@caetera
Copy link
Contributor Author

caetera commented Sep 18, 2025

I mean "the API" of TRFP, the API of Thermo libraries did not change between dotnet and net-framework version.

@mobiusklein
Copy link

If we want someone to adopt it, I believe it will be necessary to present the schema somehow, i.e., required columns (names) and datatypes (float32/float64 could be interchangeable, though, since the downstream readers will likely take care of that)

That's probably the case. For statically typed languages rather than query engine embedded in a higher level language, e.g. in C++ and Rust (where I have actually read/written things, but I believe also the case in C#), Float32Array and Float64Array are different types, so you have to effectively put a switch/if-else tree over operations of the same kind on slightly different width types.

Added new instrument CV mappings and detector assignments to follow PSI-MS 4.1.204
@caetera
Copy link
Contributor Author

caetera commented Oct 24, 2025

Converted to Release 2.0.0-dev

Summary of changes

@caetera caetera changed the title Version 1.5.0 Version 2.0.0-dev Oct 24, 2025
@ypriverol
Copy link
Contributor

@timosachsenberg @caetera I think we can merge this PR and start testing for some of our tools and workflows.

@caetera caetera merged commit a30bbac into master Oct 27, 2025
2 checks passed
@caetera caetera deleted the dotnetcore branch October 27, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants