Provide benchmarks testing against available query-server implementations. #5

dmunch · 2017-03-05T16:15:28Z

Might be something along the lines of this blog post: http://blog.idempotent.ca/2016/12/19/couchdb-indexing-benchmark/

@excieve also mentions working on some benchmarking in
#2 (comment)

excieve · 2017-03-06T12:12:55Z

Some preliminary benchmark results using a similar method to the one from article:

1st time (after compaction and view cleanup)
$ python3 utils/addview.py data/reference/income_map_function.js -u admin -p admin -D map_tests_js -v read_income_test -l javascript -b
INFO:dragnet.addview:average = 657.98 (all_tasks) 657.98 (one task)
INFO:dragnet.addview:total time = 55s

2nd time (updating a view with a slightly modified function)
$ python3 utils/addview.py data/reference/income_map_function.js -u admin -p admin -D map_tests_js -v read_income_test -l javascript -b
INFO:dragnet.addview:average = 1196.54 (all_tasks) 149.87 (one task)
INFO:dragnet.addview:total time = 249s

1st time (after compaction view cleanup)
$ python3 utils/addview.py data/reference/income_map_function.es6 -u admin -p admin -D map_tests_ch -v read_income_test -l chakra -b
INFO:dragnet.addview:average = 676.65 (all_tasks) 676.65 (one task)
INFO:dragnet.addview:total time = 52s

2nd time (updating a view with a slightly modified function)
$ python3 utils/addview.py data/reference/income_map_function.es6 -u admin -p admin -D map_tests_ch -v read_income_test -l chakra -b
INFO:dragnet.addview:average = 1213.02 (all_tasks) 147.49 (one task)
INFO:dragnet.addview:total time = 243s

The first two are using couchjs and the last two are with couch-chakra. Note how first run (after deleting the design doc, compacting the DB and running the view cleanup) differs from design doc update, given it's also running in just one process instead of 8 as in the update case. I'm not quite sure why is this. My prior understanding of CouchDB's MapReduce was that it traverses the whole B-Tree structure and applies map to each item, building a new B-Tree each time. But with these results it might be a bit different...

excieve · 2017-03-06T15:33:10Z

Please ignore the part about running better on the first view creation. Apparently in those cases CouchDB has only indexed one shard out of 8. And the next query actually indexed the whole view. So only "2nd time" runs really matter for the benchmarking process.

dmunch · 2017-03-07T10:20:39Z

Just checked out your repository https://github.com/excieve/dragnet, great work and write-up on benchmarking views!

I'm just curious, do you think it would be possible to provide a synthetic data-set for benchmarking based on your real-word data set? I'm aware that your data is most certainly private, but maybe you could come up with some sort of anonymization procedure. It's just that real-world data is so much more signifikative for benchmarks!

Another thing I'd be curious about is average CPU time and memory used during indexing. I'd be curious to see how ChakraCore performs there in comparison to SpiderMonkey, especially since ChakraCore is supposed to be optimized for IoT scenarios.

Good news also from the binary protocol side: I have all the view tests running now. CouchApp specific functions still fail but I'm going to ignore those for the moment. There's still some cleanup work to do, and more important, some minor modifications in CouchDB itself, but from the conceptual side I'm on the right way!

excieve · 2017-03-07T11:41:03Z

Actually the dataset I'm using is completely public. It's a subset of tax and property ownership declarations, which all public employees in Ukraine have to submit annually (and upon certain events) into an official online registry. After which they are in public domain, available from the national agency for corruption prevention, like this one for instance: https://public-api.nazk.gov.ua/v1/declaration/3371ace7-177b-44d6-ba2a-53e023f740be.

We're planning to analyse them all in continuous manner but for the testing purposes I'm only operating on a subset. I can provide you with a file suitable to feed into this import script or maybe just archive CouchDB's data volume with this dataset already imported. Which do you prefer?

Another thing I'd be curious about is average CPU time and memory used during indexing.

In all cases query servers don't really consume much CPU time due to the I/O bottleneck. The only case when I've seen a query server get more CPU than CouchDB was CPython with multiple iterations in the function (which did not get optimised like in JITed PyPy or JS runtimes). But I agree, it would be nice to monitor those and report too. Will see if I can get it in the benchmarks.

Good news also from the binary protocol side: I have all the view tests running now.

That's awesome! Please let me know when there's something to try out and I will be happy to test.

dmunch · 2017-03-08T10:56:36Z

That's definitely an interesting dataset :)

One thing I thought might be comfortable is to provide a publicly available Cloudant instance with the demo data from which interested parties might replicate from. That is, as far as I understood it's only free up to 1GB, but maybe that's enough! But I'll take what works and what's easiest for you, shouldn't take too much time either!

excieve · 2017-03-11T11:19:49Z

FYI: https://medium.com/@excieve/benchmarking-couchdb-views-abb7a0a891b2#.v6px6aid2
It includes a link to the raw dataset for importing. I'll see if I can get Cloudant instance going though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide benchmarks testing against available query-server implementations. #5

Provide benchmarks testing against available query-server implementations. #5

dmunch commented Mar 5, 2017

excieve commented Mar 6, 2017

excieve commented Mar 6, 2017 •

edited

Loading

dmunch commented Mar 7, 2017

excieve commented Mar 7, 2017

dmunch commented Mar 8, 2017

excieve commented Mar 11, 2017

Provide benchmarks testing against available query-server implementations. #5

Provide benchmarks testing against available query-server implementations. #5

Comments

dmunch commented Mar 5, 2017

excieve commented Mar 6, 2017

excieve commented Mar 6, 2017 • edited Loading

dmunch commented Mar 7, 2017

excieve commented Mar 7, 2017

dmunch commented Mar 8, 2017

excieve commented Mar 11, 2017

excieve commented Mar 6, 2017 •

edited

Loading