Slow ingest for relatively big SPSS files #8954

lubitchv · 2022-09-01T17:02:06Z

For relatively big SPSS files (150-400MB) ingest is very slow. It usually takes 1-3 hours. Some files can be stuck in the ingest process for 12 hours. Ingest usually takes 100% CPU and hence maximum number of simultaneous ingests only can be less or equal to the number of CPUs on the server.
We are in the process of transferring from Nesstar and have thousands of datasets with relatively large SPSS files. With such slow ingest we have difficulties to transition.
So it would be useful to have some optimization of ingest code to speed the ingest.

donsizemore · 2022-09-01T17:10:09Z

@lubitchv one proposal I remember was to have Dataverse only allow n-1 concurrent ingests, where n equals the number of cores available on the node. I don't find that in an open issue, though.

lubitchv · 2022-09-01T17:21:40Z

Limiting number of concurrent ingests will resolve security issue but will not resolve our problem of uploading to Dataverse relatively large number datasets with ingest in reasonable timeframe.

pdurbin · 2024-12-13T00:33:52Z

Via @amberleahey at https://groups.google.com/g/dataverse-community/c/fVoClEGg4oU/m/LAKVRN4zAAAJ here is an SPSS file that takes 12 hours to ingest: https://borealisdata.ca/file.xhtml?fileId=705083&version=4.0

CB-HAL · 2024-12-17T08:55:57Z

Slow ingest concerns also dta and csv. I generated 2 test files (uniformly distributed random value 0-10, without variable and value labels) as dta, sav, csv and have ingested them on a two test server (dv03, dv06) of us. The attached Stata code generates the test data.

11000 observations, 6200 variables:
test_data_11000x6200v.dta 270MB dv03: 14h10, dv06: 13h40
test_data_11000x6200v.sav 533MB dv03: 6h11, dv06: 7h55
test_data_11000x6200v.csv 139MB dv03: 6h36, dv06: 7h52

6200 observations, 11000 variables:
test_data_6200x11000v.dta 273MB dv03: 21h58, dv06: 28h27
test_data_6200x11000v.sav 533MB dv03: 11h46, dv06: 14h54
test_data_6200x11000v.csv 139MB dv03: 14h1, dv06: 14h53

Interestingly, dta takes much longer as sav and csv, sav is even faster than csv. If the data matrix observation-variables is transposed, the duration increases significantly.

gen_test_data.zip

pdurbin added Feature: File Upload & Handling Feature: Performance & Stability User Role: Depositor Creates datasets, uploads data, etc. labels Oct 11, 2022

pdurbin added the Type: Bug a defect label Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow ingest for relatively big SPSS files #8954

Slow ingest for relatively big SPSS files #8954

lubitchv commented Sep 1, 2022

donsizemore commented Sep 1, 2022

lubitchv commented Sep 1, 2022

pdurbin commented Dec 13, 2024

CB-HAL commented Dec 17, 2024

Slow ingest for relatively big SPSS files #8954

Slow ingest for relatively big SPSS files #8954

Comments

lubitchv commented Sep 1, 2022

donsizemore commented Sep 1, 2022

lubitchv commented Sep 1, 2022

pdurbin commented Dec 13, 2024

CB-HAL commented Dec 17, 2024