Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Isse with running Baysor segmentation on some but not all patches #123

Open
marsdenl opened this issue Sep 12, 2024 · 20 comments
Open

Comments

@marsdenl
Copy link

Hey Quentin,

I've been trying to use the CLI workflow to segment a MERSCOPE dataset. After successfully creating patches I see that the formatting for the transcript.csv file was quite variable across patches. In some patches it was completely fine and segmentation ran well, in others some rows were shifted/missing values and I got warning messages so I removed those rows for simplicity (after which segmentation worked). Finally, some patches failed even after removing problematic rows due to:

[16:38:11] Info: Run R493595882
[16:38:12] Info: (2024-09-12) Run Baysor v0.6.2
[16:38:12] Info: Loading data...
ERROR: MethodError: Cannot convert an object of type InlineStrings.String31 to an object of type Float64.

Closest candidates are:
convert(::Type{T}, ::ColorTypes.Gray) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:113
convert(::Type{T}, ::ColorTypes.Gray24) where T<:Real
@ ColorTypes C:\Users\marsdenl.julia\packages\ColorTypes\vpFgh\src\conversions.jl:114
convert(::Type{T}, ::Ratios.SimpleRatio{S}) where {T<:AbstractFloat, S}
@ Ratios C:\Users\marsdenl.julia\packages\Ratios\FsiCW\src\Ratios.jl:51
...

Stacktrace:
[1] setindex!(A::Vector{Float64}, x::InlineStrings.String31, i1::Int64)
@ Base .\array.jl:1021
[2] _unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::Vector{Union{Missing, InlineStrings.String31}}, soffs::Int64, n::Int64)
@ Base .\array.jl:299
[3] unsafe_copyto!
@ .\array.jl:353 [inlined]
[4] _copyto_impl!
@ .\array.jl:376 [inlined]
[5] copyto!
@ .\array.jl:363 [inlined]
[6] copyto!
@ .\array.jl:385 [inlined]
[7] copyto_axcheck!
@ .\abstractarray.jl:1177 [inlined]
[8] Vector{Float64}(x::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:673
[9] convert(::Type{Vector{Float64}}, a::Vector{Union{Missing, InlineStrings.String31}})
@ Base .\array.jl:665
[10] load_df(data_path::String; min_molecules_per_gene::Int64, exclude_genes::Vector{String}, kwargs::@kwargs{x_col::Symbol, y_col::Symbol, z_col::Symbol, gene_col::Symbol, drop_z::Bool, filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:84
[11] load_df
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\data.jl:69 [inlined]
[12] load_df(coordinates::String, data_opts::Baysor.Utils.DataOptions; kwargs::@kwargs{filter_cols::Bool})
@ Baysor.DataLoading C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\data_loading\cli_wrappers.jl:165
[13] run(coordinates::String, prior_segmentation::String; config::Baysor.Utils.RunOptions, x_column::String, y_column::String, z_column::String, gene_column::String, min_molecules_per_cell::Int64, scale::Float64, scale_std::String, n_clusters::Int64, prior_segmentation_confidence::Float64, output::String, plot::Bool, save_polygons::String, no_ncv_estimation::Bool, count_matrix_format::String)
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:100
[14] run
@ C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\cli\main.jl:51 [inlined]
[15] command_main(ARGS::Vector{String})
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:343
[16] command_main()
@ Baysor.CommandLine C:\Users\marsdenl.julia\packages\Comonicon\F3QqZ\src\codegen\julia.jl:90
[17] command_main(; kwargs::@kwargs{})
@ Baysor C:\Users\marsdenl.julia\packages\Baysor\vZCu7\src\Baysor.jl:41
[18] top-level scope
@ none:1

Quick visual inspection shows the files look fine. Weirdly enough, the patches that failed due to that error came at repeating intervals. For instance, if I generated 11 patches total, then patch 1, 4, 7, 10 failed. If I generated 15 patches, 2, 6, 10 and 14 failed (1/3 patches). Do you have any idea why some of the patches failed and not others?

Code run:

cd PATH.zarr.sopa_cache\baysor_boundaries\6
C:\Users\marsdenl.julia\bin\baysor.cmd run --save-polygons GeoJSON -c config.toml transcripts.csv

  • OS: Windows

Thanks!

@quentinblampey
Copy link
Collaborator

Hello @marsdenl, thanks for reporting this weird issue. I have never seen this before. Could you share an example of a transcript.csv file that failed (together with the corresponding config.toml file)?
If it's a problem to share the whole transcript file, you can also delete some unused columns or obfuscate the gene names.

@marsdenl
Copy link
Author

Hey @quentinblampey, thanks for getting back to me. Here is the info:

config.txt
limited_file.csv

The transcripts.csv above is the one that contains both the missing columns error and the float64 error. For missing columns see row 17612 for ex. For float64, I have no idea where it's coming from...I've only attached the first 50,000 rows.
I have managed to make Baysor work independently of sopa on one of my samples so I think it might be linked to Patchify itself.

Thank you!

@marsdenl
Copy link
Author

In case this helps, @quentinblampey, I also tried Comseg and get a similar issue with poor formatting on some patches but not all patches (some rows have fewer or more columns).
Config.json was the default one from your tutorial. I did cellpose prior nucleus segmentation as recommended and that ran fine :)

Code:
sopa patchify comseg region_0.zarr --config-path Z:\Queries\comseg_config_default.json --patch-width-microns 1000 --patch-overlap-microns 100

sopa segmentation comseg region_0.zarr

I get the error on my second (but not first) patch - ParserError: Error tokenizing data. C error: Expected 10 fields in line 112260, saw 14.

Here is a portion of the transcripts.csv containing 2 misformatted rows.
transcripts_cut.csv

Config.json
config.txt

Let me know what you think! Thank you very much.

@quentinblampey
Copy link
Collaborator

Indeed I think the problem is happening when creating the transcripts.csv file. We do that using dask, which process each partition in parallel. Maybe, at some points, two processes try to write on the same file, and weird things happen.

Do you know if the issue always happen on the same files if you run it again? Is it random or deterministic?

@marsdenl
Copy link
Author

I repeated patchify on the same sample with the same config file then simply counted the number of rows that did not have 9 columns in the patch files (I couldn't test re-segmentation with baysor on the patches since the Baysor update, same as issue #125) . Despite getting the same number of patches everytime I patchify, the transcripts.csv for a given patch (say patch 0) has slightly different dimensions and therefore also has a different number of rows that do not have 9 columns. Not too sure how to test for the float64 issue as I don't know where it's coming from...

@quentinblampey
Copy link
Collaborator

Alright, thanks for the details. Can you also let me know if this happens on the toy dataset (--technology uniform)?
I'm trying to reproduce the issue

@marsdenl
Copy link
Author

marsdenl commented Sep 24, 2024

Hey, thanks for trying to reproduce the issue :) Just tried with the --technology uniform dataset but for baysor patchify it only creates one patch so I don't know if it'll reproduce the issue where some patches are okay and others aren't. Also since the baysor update I can't run baysor segmentation to check if it all runs smoothly on that single patch (see issue 125). Looking at the transcripts file for that patch though, it seems okay :) Hope this helps. Let me know if I can send anything else over.

@quentinblampey
Copy link
Collaborator

Can you try to make smaller patches using --patch-width-microns? You can choose any value that creates more than 1 patch

@marsdenl
Copy link
Author

Of course, yes sorry xD Made 8 patches on the tutorial dataset and ran baysor (made the changes needed in the config file for v7.0) and it worked perfectly fine. I tried again on my dataset using the same config and again I get the same errors/warnings:

  1. ERROR: MethodError: Cannot convert an object of type InlineStrings.String31 to an object of type Float64
  2. Warning: thread = 1 warning: only found 9 / 14 columns around data row: 6243. Filling remaining columns with missing

Maybe something to do with the formatting of the input detected_transcripts.csv?

Thanks

@quentinblampey
Copy link
Collaborator

Thanks for trying!

So, now, my hypothesis is either (1) the detected_transcripts.csv file itself has some issue, or (2) dask dataframe reads it wrong.

To test these:

  1. Can you read the file (for instance with pandas), and see if it contains some incomplete rows?
  2. Read your raw files with sopa.io.merscope as usual, and then write the full dask dataframe on disk, as a CSV. See if this file now has incomplete lines.

@marsdenl
Copy link
Author

marsdenl commented Sep 26, 2024

Here is what I have done:

Check initial detected_trascripts.csv

import pandas as pd

data = pd.read_csv('Z:Queries/Data/Batch5_region0/region_0/detected_transcripts.csv', header=None, low_memory=False)

incomplete_rows = data[data.apply(lambda x: len(x.dropna()) != 11, axis=1)] # => this gave 0 rows

Check dask written csv

import spatialdata
import sopa
import dask.dataframe as dd
import pandas as pd
import os

sdata = sopa.io.merscope("Z:Queries/Data/Batch5_region0/region_0/")

sdata.points

{'Batch5_region0_region_0_transcripts': Dask DataFrame Structure:
x y fov cell_id barcode_id transcript_id global_z gene Unnamed: 0
npartitions=12
float64 float64 int64 int64 int64 string float64 category[unknown] int64
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ...
Dask Name: assign, 21 graph layers}

points_df = sdata.points['Batch5_region0_region_0_transcripts']
points_df.to_csv('dask_df.csv', index=False, single_file=True)
written_panda_df = pd.read_csv('C:/Users/marsdenl/dask_df.csv')

check if any row has fewer than 9 columns -> this gave my 0 rows

written_panda_df[written_panda_df.apply(lambda x: len(x.dropna()) != 9, axis=1)]

check if any row has more than 9 columns -> this gave my 0 rows

written_panda_df[written_panda_df.apply(lambda x: len(x) > 9, axis=1)]

At first hand, it seems both files - the inital csv and the dask written csv - look okay in terms of number of columns (although let me know if code above is not correct). Could it be patchify itself?

@quentinblampey
Copy link
Collaborator

When checking for nan values, can you use data.isna().any(axis=1).mean()? This will give you the ratio of rows that contains at least one NaN value. This should be 0 on MERSCOPE data.

Also, on the CSV file you gave me, I saw that the header is also weird, as it has three empty columns at the end, see below:

x,y,Unnamed: 0,transcript_id,barcode_id,cell_id,fov,gene,global_z,cell,,,,

This is also not expected. Do you have something similar with the raw data? Since the header is weird, I suspect there is something wrong before even making the patches.

@marsdenl
Copy link
Author

Sure thing!

  1. On the original file:

data.isna().any(axis=1).mean()
0.00000010972423337598314

data.isna().any(axis=1).value_counts()
False 9113756
True 1
Name: count, dtype: int64

The only true count being the column names.

  1. On the dask-written file:

written_panda_df.isna().any(axis=1).mean()
0.0

written_panda_df.isna().any(axis=1).value_counts()
False 9113756
Name: count, dtype: int64

Regarding the csv I sent you, that's not in the original file, I must have added that... Happy to send you the original file by email.

@quentinblampey
Copy link
Collaborator

The only true count being the column names.

Do you mean that the only row with NaN values are the columns? The command data.isna().any(axis=1).mean() shouldn't count the columns. Are you sure?

Regarding the csv I sent you, that's not in the original file, I must have added that... Happy to send you the original file by email.

Yes, please, can you send it to me (just the file of the patch that has an issue) at [email protected]?

Btw, I'll be on vacations for two weeks, I'll answer you in mid-October :)

@marsdenl
Copy link
Author

Sorry, I wasn't super clear. The only count with data.isna().any(axis=1).value_counts() is the first row, which corresponds to the column names. The first row of the first column is empty:

Screenshot 2024-09-29 at 12 36 48

Just sent it now :)

No worries at all, enjoy your holidays Quentin!

@quentinblampey
Copy link
Collaborator

Hi @marsdenl, do you also have access to a Linux or MacOS machine by any chance? If yes, do you have the same issue? I think it may be a Windows-specific issue, which would explain why I wasn't able to reproduce it.

@marsdenl
Copy link
Author

Hey @quentinblampey, you're absolutely right. Tried sopa patchify baysor in Mac OS and ran baysor on each patch on my windows machine and it ran perfectly fine. That's good to know! Thank you very much for your help. I'll go ahead and close the issue as I'm likely going to move forward with Proseg.

@quentinblampey
Copy link
Collaborator

Hi @marsdenl, thanks for trying out! It's good to know that it works on MacOS for you. I'll re-open the issue though, because other Windows users may want to have this fixed (even though I'm afraid it will be complex to fix, as it may be related to parallel writing on-disk with dask).

I'm interested in having your feedback about Proseg: how good do you think it is? And how scalable is it?

FYI: proseg may come in Sopa in the future, but there is no expected date for that. There is also this new tool, segger, which we'll probably add soon. In the future, we would like Sopa to be a wrapper to many more segmentation methods, so that we can easily switch from one to another without changing the code much.

@marsdenl
Copy link
Author

Sure thing :) Best of luck with the windows issue.

I am a big fan of proseg - it looks really good. I definitely think you should add it to your segmentation portfolio. I've tried Comseg, Baysor, Cellpose, Proseg and so far Proseg is the most performing method for my tissue type (cortex). It's quite fast - could run 9 million transcripts in ~ 45mn on 64 GB RAM and incredibly easy to use (for MERSCOPE, all you need is a detected_transcripts.csv with a cell_id column giving nuclear prior info.

That being said it is highly dependent on good nuclear segmentation (see Proseg issue #34). According to DC Jones:

"Currently proseg is somewhat anchored to this (nuclear) input. It can't merge or split cells, it only tries to infer better borders (from existing nuclear borders), so if it's initialized with too many cells it's not able to correct that and has to make due."

So bad nuclear segmentation will lead to poor Proseg performance. So I would definitely recommend training a cellpose 2 model on your dataset to get a very good nuclear segmentation, then finishing it off with Proseg.

My other reservation for Proseg is the decimal value cell_gene matrix since the assignment of a transcript to a cell is probabilistic, not deterministic. DC Jones mentionned you could round it to integer value or alternatively use the maxpost-counts.csv matrix instead which is an integer count matrix which filters out ambiguous transcripts.

Hope this helps :)

Thanks again for all your help Quentin

@quentinblampey
Copy link
Collaborator

quentinblampey commented Oct 25, 2024

Thanks @marsdenl for your very detailed comment! It's not the first time I hear good things about proseg.

Indeed, if it's highly dependent on a good prior, it could be a good idea to run it after cellpose.

Regarding the decimal value for the cell-by-gene table, this is also a concern to me. This is actually one of the current blockers for adding proseg to Sopa, because, when resolving the boundaries on patches, I'll need to aggregate the transcript again (which are integers with Sopa). Therefore, we would have some cells with decimal values, and others with integer counts, which is an issue...
One possible solution is to run proseg on the whole slide instead of using the patches (if proseg is not too RAM-demanding), and then the user will have the possibility to keep the decimal values from proseg, or use Sopa to get the actual transcript counts within the boundaries. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants