-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating semanticizest model is slow #11
Comments
I'll give it another try using an unpacked dump. |
Same thing for an uncompressed dump. isijara1@zookst23 time python -m semanticizest.parse_wikidump nlwiki-20141120-pages-articles.xml nlwiki_20141120.unpacked.sqlite3 2>&1 | tee nlwiki_20141120.unpacked.log
(...)
real 70m35.425s
user 46m24.507s
sys 23m54.258s |
Yes, it's slow. My line-by-line timings showed 80% of the time being spent in calls into Other options for optimization:
|
My first guess would be to try to reduce the sqlite calls, batch them up (esp. the insert+update pairs.) |
The other thing worth trying is not creating the index for redirection when we create the db. Meaning:
So we don't have to update that index after processing each page (and still have it for the redirection handling.) But this might also make the updates in step 2 slower... |
Indexing the first 10000 articles from an nlwiki dump previously took: real 1m17.655s user 0m48.409s sys 0m25.430s Now: real 0m46.042s user 0m40.603s sys 0m4.408s
Here are some more SQLite performance tricks. I just turned journaling off for a ~40% speedup. I'm not yet sure how the Python bindings handle transactions, but the docs are here. |
For what it's worth, I played with implementing an sqlite backend to the old semanticizer (not semanticizest!). It was too slow, the data access patterns were too random => lots of cache misses => lots of slow disk access. Probably not related to the current version (this was for live ngram resolution IIRC, nothing to do with wiki preprocessing), and I assume you're using different data structures now and all. Also, counting table rows is super slow in sqlite (full table scan, unless rows are never deleted), though I'm sure you're well aware of that, just saying :) |
The n-gram counts for non-links are going to replaced by count-min sketches. Storing them explicitly takes too much space, and we only need to retrieve the frequencies. |
Ah, ok. What implementation do you plan to use Lars? I'm asking because @mfcabrera wants to add min sketches to gensim, and @PitZeGlide is working on (practical) extensions. Maybe there's room for some collaboration there? |
=D I like the probabilistic (approximate) counting. And it seems that the lower relative error of the log version fits our use case nicely, since we'll be using these counts primarily to compute probabilities. Regarding @PitZeGlide numbers, how do i read those? He counted 1.1e6 things (of which 60e3 unique ones)? And the resulting sketches are 8_5000_4bytes = 160kbytes? Which implementation? My vote is for simplest one that'll work, and optimize after we have a working version. |
(it might be slow, but I don't think this should be a high priority thing -- parsing a Wikidump overnight is totally acceptable for the typical use-case, I can imagine) |
Hi there, sorry to intrude but I have a few questions about the performace in model creation. I have downloaded an english wikipedia dump and am running the extraction process. It processed around 3 million pages and then disk started swapping. After around 0.5 million pages more, the process consumed all virtual ram and all the swap space. CPU usage is virtually nothing. Is there a minimum ram requirement to run this ? Any reason why so much ram is being used ? |
That's because we're storing massive numbers of n-grams. We're switching to a hash-based implementation, but since we couldn't get that to perform well in Python, we're also rewriting the dumpparser in Go: https://github.com/semanticize/dumpparser The client code still has to be adapted to work with the hash-based dumps. |
I'm too impatient for this...
The text was updated successfully, but these errors were encountered: