Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline & beta testing #24

Open
piskvorky opened this issue Dec 20, 2014 · 23 comments
Open

Timeline & beta testing #24

piskvorky opened this issue Dec 20, 2014 · 23 comments
Labels

Comments

@piskvorky
Copy link
Member

Hello gentlemen,

what is the expected timeline for releasing semanticizest?

I have a client eager to try it out (because semanticizer is so slow, it's a bottleneck for them).

So I'm thinking if there's a way to share (human) resources -- maybe they could do some beta testing and benchmarking on live data, as soon as you proclaim the new semanticizest production-ready?

@larsmans
Copy link
Contributor

This was planned for last month, but the deadline slipped because of other projects that had priority. My own plan is to release a bare-bones version in the second week of January.

@piskvorky
Copy link
Member Author

Great, thanks Lars. Please ping me when you think it's ready (not sure what bare-bones means, perhaps that's enough).

@larsmans
Copy link
Contributor

Bare-bones means we can semanticize with baseline metrics. No count-min sketch needed (you only need that for fitting more complicated models on the output of semanticizest).

@larsmans
Copy link
Contributor

Hi @piskvorky, I think what we have now is ready for beta-testing. Would you like to have a try?

I was wanting to merge #22 before releasing as beta, but it needs tests and I'm not going to postpone further. The current functionality should be close to what the old semanticizer could do.

@piskvorky
Copy link
Member Author

Excellent, thanks! Will check the situation with client and report back.

By the way, I remember we trained an extra model with David Graus, on Yahoo! queries. Is a similar thing possible here? Or does it not make sense? CC @graus .

@larsmans
Copy link
Contributor

There's no re-ranking model in here. If you want that, you'd need to stack a model on top.

@graus
Copy link

graus commented Jan 30, 2015

(Which I will definitely work on if it does not magically appear -- I don't recall [if/where] this item ended up in terms of roadmap. Nor if there's a roadmap.)

@larsmans
Copy link
Contributor

The idea was to provide all the information necessary for feature extraction in such a model. The thing is that we can't ship models, training data or anything, so we can't test this stuff and it will go stale.

@piskvorky
Copy link
Member Author

Thanks guys.

One question: skimming the docs I can't see an obvious answer (will have to check the API in more detail later), but: I remember we had some disambiguation issues and needed semanticizer to return "unnormalized" statistics too, for some local result post-filtering.

We ended up adding something like link['unnormed'] = self.wpm.get_sense_data(ngram, sense_str) to our fork of semanticizer.

Is this already included in semanticizest, or will we have to add it again ourselves (+ pull request)? Or does this question not even make sense for the new code/approach?

@larsmans
Copy link
Contributor

We're not returning that, but we should. There's an XXX in semanticizest/_semanticizer.py where it needs to be added...

@c-martinez
Copy link
Contributor

Speaking of shipping models -- does anyone have an English model and wouldn't mind sharing it with me? I've tried building it myself, but my laptop crashes after processing 3920000 articles. ;-(

@larsmans
Copy link
Contributor

larsmans commented Feb 3, 2015

I'm building one right now, ready in three hours. Ping me by email tomorrow morning.

@c-martinez
Copy link
Contributor

Cool thanks :-)

@piskvorky
Copy link
Member Author

We ran some initial checks, created the EN model, and I want to make sure I understand correctly:

semanticizEST only has a single API method, all_candidates(tokens). This returns all candidates for given tokens, no context, no disambiguation.

There's no API like in semanticizER where we send in a text (string) and it gives us back detected entities.

Is that correct?

Are you planning on extending the pipeline? What is the timeline before reaching ± semanticizER functionality? The README says that is the goal.

Thanks! CC @tgalery @graus @larsmans

@piskvorky
Copy link
Member Author

Ping @graus @larsmans clarification of the project goals would be welcome. Please let us know what the status is.

Thanks a lot!

@larsmans
Copy link
Contributor

The status is that we have a working replacement at http://github.com/semanticize/st that is only lacking:

  1. disambiguation
  2. REST API (currently being written)
  3. Python wrapper (that will be this repo, I guess)
  4. documentation

I plan to have this finished this week (and I'm working on Saturday). You're welcome to test this new version. People at UvA are already using it.

@piskvorky
Copy link
Member Author

Thanks Lars.

@tgalery can you keep an eye on this? Once we know how to apply semanticizest at a level where it can replace semanticizer (=API for linking entities from plain text), let's evaluate.

@tgalery
Copy link

tgalery commented Apr 30, 2015

Will do @piskvorky !

@larsmans
Copy link
Contributor

larsmans commented May 3, 2015

REST API now works, simple disambiguation in the works but not yet finished.

@larsmans
Copy link
Contributor

larsmans commented May 7, 2015

@tgalery The package is now ready for beta-testing, AFAIC.

@tgalery
Copy link

tgalery commented May 7, 2015

Thanks @larsmans I will have a go when I have the time.

@tgalery
Copy link

tgalery commented Jun 18, 2015

Hi @larsmans @piskvorky I finally had time to take a look at this. I have been playing with the Danish model, and it seems that the st project has pretty much the same functionality of the semanticizest project. The only difference is that instead of having a single endpoint where you get all the candidates, one can also get the candidate matching a bestpath and an exact match of the string (as documented here https://github.com/semanticize/st/blob/68465fe840a6087698df8963af5980373c5cedb4/cmd/semanticizest/webserver.go) . Although this is functionality added on top of semanticizest, it seems that there is no spotter per se (i.e. something that determines which surface forms in the whole text are worth extracting candidates from) nor any robust incorporation of context. Am I right ? Are there any plans to incorporate those in the project ?

@larsmans
Copy link
Contributor

Candidate entities are determined by semanticizest itself; this is a consequence of the hash representation. There are also no context features. We don't have plans to add them, but if @dodijk agrees that we're missing them, they could be added.

The plan was to have semanticizest do basic entity linking, and do it fast, without too many dependencies in terms of training sets, with enough useful output for downstream code to improve its results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants