Version 2.0.0-dev #193

caetera · 2025-03-20T14:27:51Z

Preparation for release 1.5.0

Changes

Code is fully migrated to .NET 8, thus, Mono dependency is obsolete
Added an action for automatic building and testing on all code changes
Added an action to produce self-contained and framework-based binaries for all major platforms on release
Parquet format is updated to version 0.2
Charge state array can be included in mzML and MGF output. (Close include charge state array #189)

Known issues: Parquet needs refactoring - new API AWS S3 needs testing Tests cannot be executed

* Log4net configuration moved to App.config * Fixed unhandled exception during mzML precursor creation * Explicit usings * Properties (some build/debug information from VS) ignored TODO Check all "possibly null reference" warnings

The same precursor selection logic as for other file formats TBD: Move some of the service functions to the SpectrumWriter base class

Use the same method for mass-spec data creation

mzParquet writer v0.2 full implementation

timosachsenberg · 2025-09-16T11:17:23Z

awesome :)

ypriverol · 2025-09-16T12:17:32Z

@caetera do you think you can produce some examples of the parquet implementation to see how it looks, the idea is to have some parquet users evaluating the implementation.

caetera · 2025-09-16T12:43:53Z

@ypriverol I certainly can convert a few files to parquet format. What kind of files do you want to test on? Are there any specific requirements?

ypriverol · 2025-09-16T12:47:57Z

I think having a few examples from different instruments would be good to know how the format works. I can try to viz here, it is only to see if we need to extend the current implementation further.

caetera · 2025-09-16T13:23:34Z

I shared some files here - https://syddanskuni-my.sharepoint.com/:f:/g/personal/vgor_bmb_sdu_dk/EiXlHHyZY35LjvUCX_J9eJ0BJXMBbL7lqvQ2HkKqiyEzyA?e=Ikk9au converted with the latest build (5bac912)

File names are [Instrument]-[MS1Mode]-[MS2Mode]_[Other]

OTc - Orbitrap in centroided mode
OTp - Orbitrap in profile mode
ITC - Ion Trap in centroided mode
ITp - Ion Trap in profile mode

I converted them to mzparquet with vendor centroiding enabled (default) and disabled.

ypriverol · 2025-09-16T15:59:01Z

I shared some files here - https://syddanskuni-my.sharepoint.com/:f:/g/personal/vgor_bmb_sdu_dk/EiXlHHyZY35LjvUCX_J9eJ0BJXMBbL7lqvQ2HkKqiyEzyA?e=Ikk9au converted with the latest build (5bac912)

File names are [Instrument]-[MS1Mode]-[MS2Mode]_[Other]

OTc - Orbitrap in centroided mode OTp - Orbitrap in profile mode ITC - Ion Trap in centroided mode ITp - Ion Trap in profile mode

I converted them to mzparquet with vendor centroiding enabled (default) and disabled.

Thanks I will take a look.

mobiusklein · 2025-09-16T16:48:57Z

I reviewed a few of the mzparquet files, but had some questions. A lot of this stems from my lack of familiarity with the intended scope of mzparquet though.

Would it make sense to make the mass analyzer a column? Looking at an Orbitrap-Astral file, the only way to know that a scan was acquired in the Astral and not the Orbitrap was to assume the MS1 were orbitrap and the MS2 were Astral. The same goes for Tribrid instruments where you would want to know whether the ion trap or orbitrap was used for the MS2 spectra.

The same applies for centroiding. I get the impression that mzparquet is primarily intended as a container for centroid data, but you have a bunch of profile-mode files here too. Also, Thermo's profile spectra all zero-padded, so you get long runs of m/z-intensity pairs where the intensity is zero but the m/z is just incrementing by a step function. The 0s can be RLE'd away if there are enough of them to constitute using RLE as the encoding for that data page, but the m/z values are stored as-is and they are the most expensive thing in the file. For https://github.com/mobiusklein/mzpeak_prototyping, I have been dropping long runs of zeros and retaining only the flanking zeros without a substantial loss of information.

How would you represent a dataset with mixed dissociation methods, like HCD and ETD, or EThcD? The schema as-is assumes that all you wish to know is whether there is dissociation or not via the level.

timosachsenberg · 2025-09-17T06:33:52Z

@mobiusklein raises some good points, and the answer depends somewhat on the intended scope of mzParquet. As I understand it, mzParquet is primarily designed as a low-metadata format, for example to support quick visualization of spectra. If the scope were to be extended to use cases such as database searches, the mentioned columns could be valuable in enabling those applications. Ultimately, a more complete approach like mzpeak is needed to support all analyses - but I think this is not in the scope of mzparquet (correct me if I am wrong).

ypriverol · 2025-09-17T06:56:21Z

Thanks @mobiusklein @timosachsenberg @caetera for the feedback. @lazear, it would be great if you could give also some feedback here, mzparquet is not a PSI format and do not aim to replace mzML or other file formats. Instead, it aims for the following use cases:

to produce a parquet-based format for a query of MS2 spectra, for use cases like Viz, clustering, and AI training
a possible replacement for MS2 files, like MGF or MS2 that enable search engines and other tools to build on top of them.
As a possible store system for archiving results. I can see people querying spectra using parquet for novel PTMs, etc.

These are the two major use cases I see.

caetera · 2025-09-17T13:06:53Z

I can comment on the use of the profile data. Profile output is implemented mostly to be consistent with other output formats. I don't know if it is ever going to be used in real-life applications. Centroided data is the default for all output formats and to the best of my knowledge, profile data is rarely required. However, I believe it is less confusing if the switch_off_centroiding flag work in the same way for all output formats.

If you believe that mzparquet and/or mgf should always imply centroiding that can be implemented.

mobiusklein · 2025-09-17T17:59:18Z

MGF is by definition a peak list format, according to Mascot's website: https://www.matrixscience.com/help/data_file_help.html. If mzparquet is a binary replacement for that same use-case, then it doesn't make sense to store profile data in that format.

MGF doesn't specify how to signify dissociation method, mass analyzer, or most other things about a spectrum either, but many tools have learned to shove that information in through custom parameters or by way of the TITLE or INSTRUMENT parameters. The values there don't have any globally controlled rules either, they're just defined by context. The fewer things you promise to define in a format, the fewer places where the consensus reality of how it is used in practice can break.

Given what's described, it's probably workable without those extra columns since we've gotten by without it for so long, but it leaves it up to the consuming tools to decide what to do or not without that information. Do those tools already exist with extant use-cases or support for other formats?

lazear · 2025-09-17T18:31:23Z

I view mzparquet similarly to mzML, MGF, MS2 - centroided input for a search engine, visualization tools, etc. In my view (not reflective of PSI's view most likely), these are 'necessary evil' intermediate file formats to sidestep proprietary vendor libraries for reading MS data, not intended to necessarily capture 100% of the metadata from the raw files - that information can just be extracted from the raw files if required. In my perfect world, mzML/mzparquet don't exist, because vendors have open-sourced their file formats.

mzparquet reflects the need for a lighter-weight, higher performance mzML/MGF replacement. In contrast to the other formats, the data is readable with pandas/polars/distributed query engines/etc, making it easier to integrate data for both interactive data exploration use cases or repository-scale analyses.

The nice thing about a parquet-defined format is that we could define the required minimum usable set of columns (m/z, intensity, rt, ion-mobility, etc) that cover 90% of use-cases. Additional columns could be added for mass analyzers or whatever other metadata is desired, and downstream applications that don't need those columns can simply skip reading them - it is essentially zero cost (outside of storage space) to add new columns, something not possible with mzML, etc. More sophisticated strategies for building row groups could account for things like DIA isolation windows to enable really fast targeted extraction.

ypriverol · 2025-09-17T18:41:32Z

@lazear can you test the current implementation to see if it works out of the box with SAGE?

lazear · 2025-09-17T18:49:12Z

AFAIK there are no public users of mzparquet (including sage). I have some internal tools that use it, and I can test it out.

TRFP already had an obscure parquet based format that no one was using, so this was just a swap to another format no one is using yet :)

ypriverol · 2025-09-17T18:55:55Z

I know, the idea is to have an implementation that at least some groups could potentially use, for example we are planning to use it in quantms.io for clustering, then is 1 and half users. Better than the previous obscure version.

BTW, if mzparquet is not something definitive, and keeping in mid the main use case as @lazear that we are aiming mgf kind of replacement, would be great if @mobiusklein @timosachsenberg @lazear @caetera and myself we can give ideas about how to improve it within TRFP.

ypriverol · 2025-09-18T05:06:22Z

@caetera, another thing. This new version of the tool is a big change in the architecture, no mono dependency anymore, new parquet, etc. Should we increase to 2.0.0 rather than continue in 1.X.X versions? What do you think @caetera @timosachsenberg @mobiusklein

caetera · 2025-09-18T09:22:55Z

In the current implementation, mzparquet is well-suited for visualization, XIC extraction, etc, but if it is expected to be used also for search engines, adding columns like the type of activation, collision energy, supplementary activation type and energy, makes sense.

If we want someone to adopt it, I believe it will be necessary to present the schema somehow, i.e., required columns (names) and datatypes (float32/float64 could be interchangeable, though, since the downstream readers will likely take care of that). For activation and other similar fields, I would suggest using a qualified CV term name (collision-induced dissociation, electron transfer dissociation, etc) to keep it lightweight.

caetera · 2025-09-18T09:29:21Z

@caetera, another thing. This new version of the tool is a big change in the architecture, no mono dependency anymore, new parquet, etc. Should we increase to 2.0.0 rather than continue in 1.X.X versions? What do you think @caetera @timosachsenberg @mobiusklein

Strictly speaking, since "the API" (dependencies and one of the output formats) changes, we should increment the major version; however, it does not change much "under the hood".

ypriverol · 2025-09-18T09:49:44Z

I think if we are moving away from Mono and the API of Thermo is different, it has enough credits to go to 2.X.X. If you guys think it's a good idea, including you @caetera give us a 👍

caetera · 2025-09-18T09:58:40Z

I mean "the API" of TRFP, the API of Thermo libraries did not change between dotnet and net-framework version.

mobiusklein · 2025-09-20T20:38:25Z

If we want someone to adopt it, I believe it will be necessary to present the schema somehow, i.e., required columns (names) and datatypes (float32/float64 could be interchangeable, though, since the downstream readers will likely take care of that)

That's probably the case. For statically typed languages rather than query engine embedded in a higher level language, e.g. in C++ and Rust (where I have actually read/written things, but I believe also the case in C#), Float32Array and Float64Array are different types, so you have to effectively put a switch/if-else tree over operations of the same kind on slightly different width types.

Added new instrument CV mappings and detector assignments to follow PSI-MS 4.1.204

caetera · 2025-10-24T18:06:31Z

Converted to Release 2.0.0-dev

Summary of changes

Code is fully migrated to .NET 8, thus, Mono dependency is obsolete
Added an action for automatic building and testing on all code changes
Added an action to produce self-contained and framework-based binaries for all major platforms on release
Parquet format is updated to version 0.2
Charge state array can be included in mzML and MGF output. (Close include charge state array #189)
Precursor ion intensity in MGF output
Support for precursor ion and neutral/loss gain scans (Close Error parsing Precursor Ion Scan files #197 and MS level for precursor-ion, neutal-loss/gain scans #198)
Cleaner error messaging and improved error handling (Close Error Raw #194 and Error: Insufficient memory to continue the execution of the program #195)
Introduce the latest instrument CV terms (PSI-MS 4.1.204)

ypriverol · 2025-10-26T08:09:30Z

@timosachsenberg @caetera I think we can merge this PR and start testing for some of our tools and workflows.

caetera and others added 18 commits October 18, 2024 19:55

Switching to Net 8.0

ea42660

Known issues: Parquet needs refactoring - new API AWS S3 needs testing Tests cannot be executed

Tests updated to the latest NUnit

df89589

Fix logging and mzML

95c7e22

* Log4net configuration moved to App.config * Fixed unhandled exception during mzML precursor creation * Explicit usings * Properties (some build/debug information from VS) ignored TODO Check all "possibly null reference" warnings

feat: writer for mz_parquet v0.2 format

3b497a9

Implement ion charges

6ee904c

Convert AssemblyInfo to project properties

2009886

Unified output path resolution

14afbf6

Precursor logic streamlined

a26a5ab

The same precursor selection logic as for other file formats TBD: Move some of the service functions to the SpectrumWriter base class

Unified mass spectrometry data

5d955be

Use the same method for mass-spec data creation

MS level filter and progress

8d880ad

Move PrecursorTree to general SpectrumWriter

1e9b500

Test for ParquetWriter

5021515

Precursor scan fix for MS level restricted files

1ef4d88

Merge pull request #192

01cd3af

mzParquet writer v0.2 full implementation

Added build and test action

e23f499

Action to add artifacts on release

4aa8111

Update Readme

bf2f02f

Tidying

adfc246

caetera requested a review from ypriverol March 20, 2025 14:27

ypriverol mentioned this pull request Mar 22, 2025

v1.3.3 Fails Compilation on Linux #110

Closed

ypriverol linked an issue Mar 24, 2025 that may be closed by this pull request

v1.3.3 Fails Compilation on Linux #110

Closed

caetera mentioned this pull request Mar 24, 2025

self contain app #82

Closed

Refine error messages

fcbc8db

caetera mentioned this pull request Apr 9, 2025

Error Raw #194

Closed

caetera added 6 commits June 16, 2025 19:27

Zero-length mz data fix

27f438c

Precursor intensity in MGF

c9fa107

Preserve file permissions when zipping artifacts

0c4b354

Workaround for precursors mapped to missing scans

a3880ff

Refactoring - move reading mz data to a separate function

9eb22bf

Updating Tests

1f1a837

caetera mentioned this pull request Sep 18, 2025

Java wrapper for big DIA-MS Thermo files #196

Open

Update CV mappings for new instruments

791bd0e

Added new instrument CV mappings and detector assignments to follow PSI-MS 4.1.204

Convert to 2.0.0-dev

22cfd67

caetera changed the title ~~Version 1.5.0~~ Version 2.0.0-dev Oct 24, 2025

ypriverol approved these changes Oct 26, 2025

View reviewed changes

caetera merged commit a30bbac into master Oct 27, 2025
2 checks passed

caetera deleted the dotnetcore branch October 27, 2025 19:03

Version 2.0.0-dev #193

Version 2.0.0-dev #193

Uh oh!

Conversation

caetera commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timosachsenberg commented Sep 16, 2025

Uh oh!

ypriverol commented Sep 16, 2025

Uh oh!

caetera commented Sep 16, 2025

Uh oh!

ypriverol commented Sep 16, 2025

Uh oh!

caetera commented Sep 16, 2025

Uh oh!

ypriverol commented Sep 16, 2025

Uh oh!

mobiusklein commented Sep 16, 2025

Uh oh!

timosachsenberg commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypriverol commented Sep 17, 2025

Uh oh!

caetera commented Sep 17, 2025

Uh oh!

mobiusklein commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lazear commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypriverol commented Sep 17, 2025

Uh oh!

lazear commented Sep 17, 2025

Uh oh!

ypriverol commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ypriverol commented Sep 18, 2025

Uh oh!

caetera commented Sep 18, 2025

Uh oh!

caetera commented Sep 18, 2025

Uh oh!

ypriverol commented Sep 18, 2025

Uh oh!

caetera commented Sep 18, 2025

Uh oh!

mobiusklein commented Sep 20, 2025

Uh oh!

caetera commented Oct 24, 2025

Uh oh!

ypriverol commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

caetera commented Mar 20, 2025 •

edited

Loading

timosachsenberg commented Sep 17, 2025 •

edited

Loading

mobiusklein commented Sep 17, 2025 •

edited

Loading

lazear commented Sep 17, 2025 •

edited

Loading

ypriverol commented Sep 17, 2025 •

edited

Loading