Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronizing Table-like stuff #22

Open
kescobo opened this issue Oct 20, 2021 · 4 comments
Open

Synchronizing Table-like stuff #22

kescobo opened this issue Oct 20, 2021 · 4 comments

Comments

@kescobo
Copy link
Member

kescobo commented Oct 20, 2021

Purpose

In an effort to improve cross-ecosystem compatibility, it would be nice to make table-like data structures more interoperable. My view of the ecosystem is quite narrow - I'm really only aware of ComMatrix from SpatialEcology.jl and my own CommunityProfile from Microbiome.jl which took quite a bit of inspiration from the former. I also haven't used ComMatrix in the last year or so as I was trying to iterate quickly in Microbiome.jl.

cc @mkborregaard

Current advantages of CommunityProfile

  • Tables.jl interface makes it easy to convert to DataFrame or write to CSV
  • Rows (features) and columns (samples) can be indexed with numbers, strings, or regex
  • rows and column indexes are types (eg Taxon or GeneFunction for features, MicrobiomeSample for samples). This enables storing additional information (including metadata) inside the community table type
julia> using Microbiome

julia> s1 = MicrobiomeSample("sample1")
MicrobiomeSample("sample1", {})

julia> s2 = MicrobiomeSample("sample2");

julia> set!(s1, :type, "stool")
MicrobiomeSample("sample1", {:type = "stool"})

julia> set!(s1, :age, 37)
MicrobiomeSample("sample1", {:type = "stool", :age = 37})

julia> sp1 = Taxon("Bifidobacterium_longum", :species)
Taxon("Bifidobacterium_longum", :species)

julia> sp2 = taxon("s__Echerichia_coli")
Taxon("Echerichia_coli", :species)

julia> cm = CommunityProfile([0 1; 3 4], [sp1, sp2], [s1, s2])
CommunityProfile{Int64, Taxon, MicrobiomeSample} with 2 features in 2 samples

Feature names:
Bifidobacterium_longum, Echerichia_coli

Sample names:
sample1, sample2



julia> cm[r"Bifido", :]
CommunityProfile{Int64, Taxon, MicrobiomeSample} with 1 features in 2 samples

Feature names:
Bifidobacterium_longum

Sample names:
sample1, sample2



julia> metadata(cm)
2-element Vector{NamedTuple{(:sample, :type, :age), T} where T<:Tuple}:
 (sample = "sample1", type = "stool", age = 37)
 (sample = "sample2", type = missing, age = missing)

Current advantages of ComMatrix (that I'm aware of)

  • view machinery for cheap subsetting
  • integration with spatial types
  • plot recipes
  • others?

Current incompatibilities

  • names for columns / rows, the matrix data. This shouldn't matter too much, since I both fall back to EcoBase thing* and place* methods. If this is done right, one should be able to call featurenames on a ComMatrix or speciesnames on a CommunityProfile and get the same thing.
  • Internal representation. I think ComMatrix is a simple wrapper around a sparse matrix, I'm using AxisArrays and NamedDims for a few things. But I actually think that this might be over-kill, since I mostly used it to take advantage of indexing, but then re-wrote the indexing in a way that doesn't rely on it so much. With a few tweaks to SpatialEcology's views, I think I could drop that dependency.
  • Others?
@richardreeve
Copy link
Member

richardreeve commented Oct 20, 2021

I've never looked at Microbiome.jl before, but I think there's a bit of incompatibility going on with the underlying EcoBase interface... looking briefly at Microbiome, it seems like it can (maybe?) use types that offer that interface in places (in particular here, but it doesn't offer it itself (I think?). If it did, a lot of these currently incompatibilities might go away, and you could do the plotting, etc. directly with Microbiome types. Diversity.jl on the other hand implements the EcoBase interface here, for instance, as does SpatialEcology.jl in a variety of places.

More generally, I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice. Diversity.jl stores abundances as an AbstractMatrix subtype directly in a Metacommunity object, whereas EcoSISTEM.jl stores it in two ways. For simple multithreaded code, it stores it in a GridLandscape object, whereas for multiprocess (MPI) code, it stores it in an MPIGridLandscape object, because the abundance matrix itself is distributed across multiple nodes. Because they all (I hope!) satisfy the EcoBase interface, then everything should just work across the ecosystem, and you can use the SpatialEcology plotting and so on directly irrespective of the underlying storage type. However, the last (MPI) one in particular has no flexibility in how storage is implemented to make the inter-process communication efficient.

If you are just proposing a common interface, and not a common storage mechanism, then that's different, but I'm not sure what interface you're proposing - do you just mean implementing the Tables.jl interface? If so, what does implementing that involve? If it's simple and makes sense it might just be something that can be implemented directly it terms of the EcoBase primitives, so no-one has to do anything to get it to work?

@kescobo
Copy link
Member Author

kescobo commented Oct 21, 2021

looking briefly at Microbiome, it seems like it can (maybe?) use types that offer that interface in places (in particular here, but it doesn't offer it itself (I think?)

This seems entirely plausible - I didn't do much testing. Come to think of it, do we have a pre-made set of tests that check for compatibility? That might be a nice way to solidify the interface and make it easier to check.

I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice

I don't mean that they all need to have the same representation or use the same type specifically, I mostly mean that it would be nice to re-use functionality where possible, and try as much as we can to make them inter-convertible.

I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice

Maybe I'm only re-proposing EcoBase 😆. I am not nearly as up on the rest of the EcoJulia landscape as I should be, it's entirely possible that it's only me that needs to do any work. The impetus for this issue is that I used to use ComMatrix, but wanted some things that it didn't have, so I split off and made my own type because (a) I wasn't super familiar with SpatialEcology internals, and (b) I wanted to be able to experiment and break stuff without needing to burden @mkborregaard every time I made changes. Now, I'd like to come back to being more compatible. As I say, it may be that all of the work is on my end.

do you just mean implementing the Tables.jl interface? If so, what does implementing that involve?

After banging my head against it for a bit, it turns out to be pretty simple. You can be a Tables source, or sink, or both. I've only implemented the source bit, since that was easier and all I wanted for my use-case. To be a source, all you really need is to be able to generate an iterator of named tuples, where the keys are column names (you can implement your own row types too, but a vector of named tuples is the proto-table).

If it's simple and makes sense it might just be something that can be implemented directly it terms of the EcoBase primitives, so no-one has to do anything to get it to work?

I think we could definitely implement a fall-back interface on the primitives, which could then be modified as needed by other packages.

@richardreeve
Copy link
Member

Cool. That all sounds good to me. I think what I understood originally did sound a bit like a re-proposal of EcoBase, but in fact I think that adding in some tests that answer "Do I implement EcoBase?" would be really helpful - we could even think about it in terms of sources and sinks like the Tables interface you describe. And adding in the core Tables interface through the current EcoBase primitives would be really nice too. Then we can have a think about whether there are enough commonalities in the implementations to thing about common storage mechanisms - my feeling is that if we can interoperate anyway, it may not be a high priority though.

There's another suggestion on Zulip that we think about providing the same interface as a trait-like thing rather than imposing inheritance on it, which could tie in nicely with providing the tests. I think the idea would be that if you did the inheritance, you wouldn't need to worry about the traits, but you could provide them instead...

@mkborregaard
Copy link
Member

Sorry guys, I've been busy, will take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants