Skip to content

Conversation

@wasade
Copy link

@wasade wasade commented Jan 11, 2023

SourceTracker2 regrettably relied on a DataFrame transformation of a biom.Table early in its processing leading to substantial resource requirements stemming from the resulting dense matrix.

This pull request fixes SourceTracker2 to use biom.Table for upstream processing and deferring until the last point for a dense transformation. On a large data, we observed a 2.33x reduction in runtime and a 5.59x reduction in memory used.

The results are qualitatively identical to "vanilla" SourceTracker2 relative to the predicted environments.

Note that three tests are failing in SourceTracker2 master, two of which are almost certainly related to changes in NumPy's random number generator and I suspect are sensitive to the seed. The third test was an actual bug that was fixed while adjusting unit tests

I'm uncertain whether it is technically correct to make the inner loop sparse. The use of alpha1 necessitates a dense matrix as it adds a small prior to all features even the zero'd ones. If that is not necessary, then the inner loop can be made sparse which will yield further gains. Note that the unit tests assume alpha1 is applied everywhere though so it would require modifying some of the unit tests.

cc @wdwvt1 @rob-knight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants