Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow ingest for relatively big SPSS files #8954

Open
lubitchv opened this issue Sep 1, 2022 · 4 comments
Open

Slow ingest for relatively big SPSS files #8954

lubitchv opened this issue Sep 1, 2022 · 4 comments

Comments

@lubitchv
Copy link
Contributor

lubitchv commented Sep 1, 2022

For relatively big SPSS files (150-400MB) ingest is very slow. It usually takes 1-3 hours. Some files can be stuck in the ingest process for 12 hours. Ingest usually takes 100% CPU and hence maximum number of simultaneous ingests only can be less or equal to the number of CPUs on the server.
We are in the process of transferring from Nesstar and have thousands of datasets with relatively large SPSS files. With such slow ingest we have difficulties to transition.
So it would be useful to have some optimization of ingest code to speed the ingest.

@donsizemore
Copy link
Contributor

@lubitchv one proposal I remember was to have Dataverse only allow n-1 concurrent ingests, where n equals the number of cores available on the node. I don't find that in an open issue, though.

@lubitchv
Copy link
Contributor Author

lubitchv commented Sep 1, 2022

Limiting number of concurrent ingests will resolve security issue but will not resolve our problem of uploading to Dataverse relatively large number datasets with ingest in reasonable timeframe.

@pdurbin
Copy link
Member

pdurbin commented Dec 13, 2024

@CB-HAL
Copy link

CB-HAL commented Dec 17, 2024

Slow ingest concerns also dta and csv. I generated 2 test files (uniformly distributed random value 0-10, without variable and value labels) as dta, sav, csv and have ingested them on a two test server (dv03, dv06) of us. The attached Stata code generates the test data.

11000 observations, 6200 variables:
test_data_11000x6200v.dta 270MB dv03: 14h10, dv06: 13h40
test_data_11000x6200v.sav 533MB dv03: 6h11, dv06: 7h55
test_data_11000x6200v.csv 139MB dv03: 6h36, dv06: 7h52

6200 observations, 11000 variables:
test_data_6200x11000v.dta 273MB dv03: 21h58, dv06: 28h27
test_data_6200x11000v.sav 533MB dv03: 11h46, dv06: 14h54
test_data_6200x11000v.csv 139MB dv03: 14h1, dv06: 14h53

Interestingly, dta takes much longer as sav and csv, sav is even faster than csv. If the data matrix observation-variables is transposed, the duration increases significantly.

gen_test_data.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants