-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse matrix from formula #46
Comments
Hello, No, at the moment patsy has no support for directly consuming or producing On Thu, Jun 19, 2014 at 7:02 PM, memeplex [email protected] wrote:
Nathaniel J. Smith |
FWIW, it's easy enough to do this in some settings:
|
Right, but this requires that you first create the dense matrix in memory,
|
Agreed. My particular use case was for doing MCMC with the sparse matrix via pymc2, however, so I was able to get massive speedups when taking products of model parameters and the featurized input data. |
@njsmith could you provide some pointers on what would be needed for patsy |
I would be happy to help. I just don't have time to head on this.
…On Fri, Jan 3, 2020, 12:50 seanv507 ***@***.***> wrote:
@njsmith <https://github.com/njsmith> could you provide some pointers on
what would be needed for patsy
to support generating sparse matrices. Would be interested in submitting a
patch.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#46?email_source=notifications&email_token=ABKTSRP62R445OJC2QFND2TQ34Q73A5CNFSM4AQXKTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA62UI#issuecomment-570551633>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRN7I2GHOZB5DGYUDG3Q34Q73ANCNFSM4AQXKTPQ>
.
|
I'd also be interested in this! @njsmith do you have pointers on what would be needed to build this patch? I'd be happy to help |
Hi all, I was interested in this problem for speeding up some of my code. I've been relying on the excellent patsy library, but the datasets I use tend to have a lot of data points and splines can end up being slow to compute. I wrote up a faster b-spline basis implementation that uses As written below this provides me with a roughly 2x speed up over the non-sparse patsy implementation, and uses 25% of the memory compared to non-sparse patsy. I think this could still be improved upon and sped up, but if anyone needs a sparse implementation of b-splines, this is a good place to start. import numpy as np
from scipy.sparse import csr_matrix, vstack
from numba import jit
# jit provides a massive speed up
@jit(nopython=True)
def basis(x, degree, i, knots):
"""Create a spline basis for an input x, for the ith knot
Parameters
----------
x : np.ndarray
Input x
degree : int
Degree of spline to calculate basis for
i : int
The index of the knot to calculate the basis for
knots : np.ndarray
Array of all knots
"""
if degree == 0:
B = np.zeros(len(x))
B[(x >= knots[i]) & (x <= knots[i+1])] = 1
#B = (B)
else:
# alpha1, alpha2 = 0, 0
da = (knots[degree + i] - knots[i])
db = (knots[i + degree + 1] - knots[i + 1])
if ((knots[degree + i] - knots[i]) != 0):
alpha1 = ((x - knots[i])/da)
else:
alpha1 = np.zeros(len(x))
if ((knots[i+degree+1] - knots[i+1]) != 0):
alpha2 = ((knots[i + degree + 1] - x)/db)
else:
alpha2 = np.zeros(len(x))
B = (basis(x, (degree-1), i, knots)) * (alpha1) + (basis(x, (degree-1), (i+1), knots)) * (alpha2)
return B
def create_sparse_spline_matrix(x, n_knots=20, knots=None, degree=3):
"""Returns a scipy.sparse.csr_matrix which contains the B-spline basis
Parameters
----------
x : np.ndarray
vector to spline
n_knots: int
Number of knots (default: 20).
knots : np.ndarray [optional]
Optional array containing knots
degree: int
Polynomial degree.
Returns
-------
spline_dm: scipy.sparse.csr_matrix
B-spline basis matrix
"""
# To use jit we have to use float64
x = np.asarray(x, np.float64)
if not isinstance(n_knots, int):
raise ValueError('`n_knots` must be an integer.')
if n_knots - degree <= 0:
raise ValueError('n_knots must be greater than degree.')
if (knots is None) and (n_knots is not None):
knots = np.asarray([s[-1] for s in np.array_split(np.argsort(x), n_knots - degree)[:-1]])
knots = [np.mean([x[k], x[k + 1]]) for k in knots]
elif (knots is None) and (n_knots is None):
raise ValueError('Pass either `n_knots` or `knots`.')
knots = np.append(np.append(x.min(), knots), x.max())
knots = np.unique(knots)
knots_wbounds = np.append(np.append([x.min()] * (degree - 1), knots), [x.max()] * (degree))
matrices = [csr_matrix(basis(x, degree, idx, knots_wbounds)) for idx in np.arange(-1, len(knots_wbounds) - degree - 1)]
spline_dm = vstack([m for m in matrices if (m.sum() != 0) ], format='csr').T
return spline_dm |
Hi all! I also tackled this problem late last year, and the result was a new Python library called Formulaic. It is orders of magnitude faster than patsy, even beating R in many cases. I'd welcome feedback there :). Documentation is still lacking, but you can do:
which will result in a sparse model matrix. |
@matthewwardrop I've tried to use the package and the following script works :) |
Yes... Things have evolved! Glad it worked for you! |
Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.
The text was updated successfully, but these errors were encountered: