Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrix from formula #46

Open
memeplex opened this issue Jun 19, 2014 · 11 comments
Open

Sparse matrix from formula #46

memeplex opened this issue Jun 19, 2014 · 11 comments

Comments

@memeplex
Copy link

Is it currently possible to build a sparse design matrix from a formula? This is desirable for formulas with lot of categorical variables and their interactions, which are naturally represented as a sparse matrix of zeros and a relatively few ones. R sparse.model.matrix produces an sparse matrix but, when the model includes interactions, it fallbacks to the standard model.matrix so a dense matrix is created as an intermediate step. I don't know if patsy includes any sparse matrix support at all; in case it doesn't consider this a request for enhancement.

@njsmith
Copy link
Member

njsmith commented Jun 19, 2014

Hello,

No, at the moment patsy has no support for directly consuming or producing
sparse matrices. It's been on the TODO list since the beginning, and
there's no particular obstacle to doing so, it's just that neither I nor
anyone else have gotten around to implementing it. (If you want to have a
go, I'll happily review the patch ;-).)

On Thu, Jun 19, 2014 at 7:02 PM, memeplex [email protected] wrote:

Is it currently possible to build a sparse design matrix from a formula?
This is desirable for formulas with lot of categorical variables and their
interactions, which are naturally represented as a sparse matrix of zeros
and a relatively few ones. R sparse.model.matrix produces an sparse matrix
but, when the model includes interactions, it fallbacks to the standard
model.matrix so a dense matrix is created as an intermediate step. I don't
know if patsy includes any sparse matrix support at all; in case it doesn't
consider this a request for enhancement.


Reply to this email directly or view it on GitHub
#46.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@kyleabeauchamp
Copy link

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)

@njsmith
Copy link
Member

njsmith commented Jan 11, 2015

Right, but this requires that you first create the dense matrix in memory,
which kinda defeats the purpose in a lot of cases :-)
On 11 Jan 2015 20:45, "Kyle Beauchamp" [email protected] wrote:

FWIW, it's easy enough to do this in some settings:

X = patsy.dmatrices("y ~ col0 * col1 - 1", data, return_type="dataframe")[1]
X_csr = scipy.sparse.csr_matrix(X.values)


Reply to this email directly or view it on GitHub
#46 (comment).

@kyleabeauchamp
Copy link

Agreed. My particular use case was for doing MCMC with the sparse matrix via pymc2, however, so I was able to get massive speedups when taking products of model parameters and the featurized input data.

@seanv507
Copy link

seanv507 commented Jan 3, 2020

@njsmith could you provide some pointers on what would be needed for patsy
to support generating sparse matrices. Would be interested in submitting a patch.

@bashtage
Copy link
Contributor

bashtage commented Jan 3, 2020 via email

@christinahedges
Copy link

I'd also be interested in this! @njsmith do you have pointers on what would be needed to build this patch? I'd be happy to help

@christinahedges
Copy link

christinahedges commented Apr 23, 2020

Hi all,

I was interested in this problem for speeding up some of my code. I've been relying on the excellent patsy library, but the datasets I use tend to have a lot of data points and splines can end up being slow to compute. I wrote up a faster b-spline basis implementation that uses scipy.sparse. I'm using it in this (open) pull request on the package I develop (lightkurve).

As written below this provides me with a roughly 2x speed up over the non-sparse patsy implementation, and uses 25% of the memory compared to non-sparse patsy.

I think this could still be improved upon and sped up, but if anyone needs a sparse implementation of b-splines, this is a good place to start.

import numpy as np
from scipy.sparse import csr_matrix, vstack
from numba import jit

# jit provides a massive speed up
@jit(nopython=True)
def basis(x, degree, i, knots):
    """Create a spline basis for an input x, for the ith knot

    Parameters
    ----------
    x : np.ndarray
        Input x
    degree : int
        Degree of spline to calculate basis for
    i : int
        The index of the knot to calculate the basis for
    knots : np.ndarray
        Array of all knots
    """
    if degree == 0:
        B = np.zeros(len(x))
        B[(x >= knots[i]) & (x <= knots[i+1])] = 1
        #B = (B)
    else:
#        alpha1, alpha2 = 0, 0
        da = (knots[degree + i] - knots[i])
        db = (knots[i + degree + 1] - knots[i + 1])
        if ((knots[degree + i] - knots[i]) != 0):
            alpha1 = ((x - knots[i])/da)
        else:
            alpha1 = np.zeros(len(x))
        if ((knots[i+degree+1] - knots[i+1]) != 0):
            alpha2 = ((knots[i + degree + 1] - x)/db)
        else:
            alpha2 = np.zeros(len(x))
        B = (basis(x, (degree-1), i, knots)) * (alpha1) + (basis(x, (degree-1), (i+1), knots)) * (alpha2)
    return B


def create_sparse_spline_matrix(x, n_knots=20, knots=None, degree=3):
    """Returns a scipy.sparse.csr_matrix which contains the B-spline basis

    Parameters
    ----------
    x : np.ndarray
        vector to spline
    n_knots: int
        Number of knots (default: 20).
    knots : np.ndarray [optional]
        Optional array containing knots
    degree: int
        Polynomial degree.

    Returns
    -------
    spline_dm: scipy.sparse.csr_matrix
        B-spline basis matrix
    """

    # To use jit we have to use float64
    x = np.asarray(x, np.float64)

    if not isinstance(n_knots, int):
        raise ValueError('`n_knots` must be an integer.')
    if n_knots - degree <= 0:
        raise ValueError('n_knots must be greater than degree.')

    if (knots is None)  and (n_knots is not None):
        knots = np.asarray([s[-1] for s in np.array_split(np.argsort(x), n_knots - degree)[:-1]])
        knots = [np.mean([x[k], x[k + 1]]) for k in knots]
    elif (knots is None)  and (n_knots is None):
        raise ValueError('Pass either `n_knots` or `knots`.')

    knots = np.append(np.append(x.min(), knots), x.max())
    knots = np.unique(knots)

    knots_wbounds = np.append(np.append([x.min()] * (degree - 1), knots), [x.max()] * (degree))
    matrices = [csr_matrix(basis(x, degree, idx, knots_wbounds)) for idx in np.arange(-1, len(knots_wbounds) - degree - 1)]
    spline_dm = vstack([m for m in matrices if (m.sum() != 0) ], format='csr').T
    return spline_dm

@matthewwardrop
Copy link
Collaborator

Hi all! I also tackled this problem late last year, and the result was a new Python library called Formulaic. It is orders of magnitude faster than patsy, even beating R in many cases. I'd welcome feedback there :).

Documentation is still lacking, but you can do:

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y,X = Formula('y ~ x + z').get_model_matrix(df, sparse=True)

which will result in a sparse model matrix.

@jykr
Copy link

jykr commented Apr 12, 2022

@matthewwardrop I've tried to use the package and the following script works :)
X = Formula('x + z').get_model_matrix(df, output='sparse')

@matthewwardrop
Copy link
Collaborator

Yes... Things have evolved! Glad it worked for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants