Sparse matrix from formula #46

memeplex · 2014-06-19T18:02:11Z

Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.

njsmith · 2014-06-19T18:44:30Z

Hello,

No, at the moment patsy has no support for directly consuming or producing
sparse matrices. It's been on the TODO list since the beginning, and
there's no particular obstacle to doing so, it's just that neither I nor
anyone else have gotten around to implementing it. (If you want to have a
go, I'll happily review the patch ;-).)

On Thu, Jun 19, 2014 at 7:02 PM, memeplex [email protected] wrote:

Is it currently possible to build a sparse design matrix from a formula?
This is desirable for formulas with lot of categorical variables and their
interactions, which are naturally represented as a sparse matrix of zeros
and a relatively few ones. R sparse.model.matrix produces an sparse matrix
but, when the model includes interactions, it fallbacks to the standard
model.matrix so a dense matrix is created as an intermediate step. I don't
know if patsy includes any sparse matrix support at all; in case it doesn't
consider this a request for enhancement.

—
Reply to this email directly or view it on GitHub
#46.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

kyleabeauchamp · 2015-01-11T20:45:57Z

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)

njsmith · 2015-01-11T21:02:04Z

Right, but this requires that you first create the dense matrix in memory,
which kinda defeats the purpose in a lot of cases :-)
On 11 Jan 2015 20:45, "Kyle Beauchamp" [email protected] wrote:

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)

—
Reply to this email directly or view it on GitHub
#46 (comment).

kyleabeauchamp · 2015-01-11T21:04:24Z

Agreed. My particular use case was for doing MCMC with the sparse matrix via pymc2, however, so I was able to get massive speedups when taking products of model parameters and the featurized input data.

seanv507 · 2020-01-03T11:50:18Z

@njsmith could you provide some pointers on what would be needed for patsy
to support generating sparse matrices. Would be interested in submitting a patch.

bashtage · 2020-01-03T13:58:47Z

I would be happy to help. I just don't have time to head on this.

…

On Fri, Jan 3, 2020, 12:50 seanv507 ***@***.***> wrote: @njsmith <https://github.com/njsmith> could you provide some pointers on what would be needed for patsy to support generating sparse matrices. Would be interested in submitting a patch. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#46?email_source=notifications&email_token=ABKTSRP62R445OJC2QFND2TQ34Q73A5CNFSM4AQXKTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA62UI#issuecomment-570551633>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKTSRN7I2GHOZB5DGYUDG3Q34Q73ANCNFSM4AQXKTPQ> .

christinahedges · 2020-04-03T23:03:42Z

I'd also be interested in this! @njsmith do you have pointers on what would be needed to build this patch? I'd be happy to help

christinahedges · 2020-04-23T02:42:54Z

Hi all,

I was interested in this problem for speeding up some of my code. I've been relying on the excellent patsy library, but the datasets I use tend to have a lot of data points and splines can end up being slow to compute. I wrote up a faster b-spline basis implementation that uses scipy.sparse. I'm using it in this (open) pull request on the package I develop (lightkurve).

As written below this provides me with a roughly 2x speed up over the non-sparse patsy implementation, and uses 25% of the memory compared to non-sparse patsy.

I think this could still be improved upon and sped up, but if anyone needs a sparse implementation of b-splines, this is a good place to start.

import numpy as np
from scipy.sparse import csr_matrix, vstack
from numba import jit

# jit provides a massive speed up
@jit(nopython=True)
def basis(x, degree, i, knots):
    """Create a spline basis for an input x, for the ith knot

    Parameters
    ----------
    x : np.ndarray
        Input x
    degree : int
        Degree of spline to calculate basis for
    i : int
        The index of the knot to calculate the basis for
    knots : np.ndarray
        Array of all knots
    """
    if degree == 0:
        B = np.zeros(len(x))
        B[(x >= knots[i]) & (x <= knots[i+1])] = 1
        #B = (B)
    else:
#        alpha1, alpha2 = 0, 0
        da = (knots[degree + i] - knots[i])
        db = (knots[i + degree + 1] - knots[i + 1])
        if ((knots[degree + i] - knots[i]) != 0):
            alpha1 = ((x - knots[i])/da)
        else:
            alpha1 = np.zeros(len(x))
        if ((knots[i+degree+1] - knots[i+1]) != 0):
            alpha2 = ((knots[i + degree + 1] - x)/db)
        else:
            alpha2 = np.zeros(len(x))
        B = (basis(x, (degree-1), i, knots)) * (alpha1) + (basis(x, (degree-1), (i+1), knots)) * (alpha2)
    return B


def create_sparse_spline_matrix(x, n_knots=20, knots=None, degree=3):
    """Returns a scipy.sparse.csr_matrix which contains the B-spline basis

    Parameters
    ----------
    x : np.ndarray
        vector to spline
    n_knots: int
        Number of knots (default: 20).
    knots : np.ndarray [optional]
        Optional array containing knots
    degree: int
        Polynomial degree.

    Returns
    -------
    spline_dm: scipy.sparse.csr_matrix
        B-spline basis matrix
    """

    # To use jit we have to use float64
    x = np.asarray(x, np.float64)

    if not isinstance(n_knots, int):
        raise ValueError('`n_knots` must be an integer.')
    if n_knots - degree <= 0:
        raise ValueError('n_knots must be greater than degree.')

    if (knots is None)  and (n_knots is not None):
        knots = np.asarray([s[-1] for s in np.array_split(np.argsort(x), n_knots - degree)[:-1]])
        knots = [np.mean([x[k], x[k + 1]]) for k in knots]
    elif (knots is None)  and (n_knots is None):
        raise ValueError('Pass either `n_knots` or `knots`.')

    knots = np.append(np.append(x.min(), knots), x.max())
    knots = np.unique(knots)

    knots_wbounds = np.append(np.append([x.min()] * (degree - 1), knots), [x.max()] * (degree))
    matrices = [csr_matrix(basis(x, degree, idx, knots_wbounds)) for idx in np.arange(-1, len(knots_wbounds) - degree - 1)]
    spline_dm = vstack([m for m in matrices if (m.sum() != 0) ], format='csr').T
    return spline_dm

matthewwardrop · 2020-07-06T17:10:38Z

Hi all! I also tackled this problem late last year, and the result was a new Python library called Formulaic. It is orders of magnitude faster than patsy, even beating R in many cases. I'd welcome feedback there :).

Documentation is still lacking, but you can do:

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y,X = Formula('y ~ x + z').get_model_matrix(df, sparse=True)

which will result in a sparse model matrix.

jykr · 2022-04-12T16:22:54Z

@matthewwardrop I've tried to use the package and the following script works :)
X = Formula('x + z').get_model_matrix(df, output='sparse')

matthewwardrop · 2022-04-12T20:43:45Z

Yes... Things have evolved! Glad it worked for you!

matthewwardrop added the fixed in formulaic label Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse matrix from formula #46

Sparse matrix from formula #46

memeplex commented Jun 19, 2014

njsmith commented Jun 19, 2014

kyleabeauchamp commented Jan 11, 2015

njsmith commented Jan 11, 2015

kyleabeauchamp commented Jan 11, 2015

seanv507 commented Jan 3, 2020

bashtage commented Jan 3, 2020 via email

christinahedges commented Apr 3, 2020

christinahedges commented Apr 23, 2020 •

edited

Loading

matthewwardrop commented Jul 6, 2020

jykr commented Apr 12, 2022

matthewwardrop commented Apr 12, 2022

Sparse matrix from formula #46

Sparse matrix from formula #46

Comments

memeplex commented Jun 19, 2014

njsmith commented Jun 19, 2014

kyleabeauchamp commented Jan 11, 2015

njsmith commented Jan 11, 2015

kyleabeauchamp commented Jan 11, 2015

seanv507 commented Jan 3, 2020

bashtage commented Jan 3, 2020 via email

christinahedges commented Apr 3, 2020

christinahedges commented Apr 23, 2020 • edited Loading

matthewwardrop commented Jul 6, 2020

jykr commented Apr 12, 2022

matthewwardrop commented Apr 12, 2022

christinahedges commented Apr 23, 2020 •

edited

Loading