Need pre-trained vectors for POS tags and universal dependencies? [on hold] - python

Is there any pre-trained model available for POS tags and universal dependency tags? I am looking to make a dependency parser like https://cs.stanford.edu/~danqi/papers/emnlp2014.pdf. I need dense embedding for words (got glove model) ,pos tags and universal dependency tags. Please Help!!

Related

SpaCy language neutral Named Entity Recognition

SpaCy provides tutorial for training custom entities and one of the lines are
...
nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])
...
Is it possible to train SpaCy without using explicit (English) language model? So it will look for the patterns and extracting entities from the stream of chars?

Defining new language grammar rules?

Can you help me how could I edit the .tagger file using Stanford NLP? I have problem here, i can't open and edit the file to define the grammar rules for new language to generate part of speech?
The .tagger files are serialized statistical models used by a Maximum Entropy based sequence tagger. You can't edit them in any meaningful way.
If you want to create part of speech tags for a new language, you will have to create training data which consists of a large set of sentences in the language you want and having the correct part of speech tag for each word in the sentence, and then train a new part of speech tagging model.

How does spacy use word embeddings for Named Entity Recognition (NER)?

I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron.
However, nowhere in the code does it appear that spaCy uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus).
My questions are -
Are these used in the NER system now?
If I were to switch out the word vectors to a different set, should I expect performance to change in a meaningful way?
Where in the code can I find out how (if it all) spaCy is using the word vectors?
I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings.
spaCy does use word embeddings for its NER model, which is a multilayer CNN. There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here. All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link.
It's quite doable to add custom vectors. There's an overview of the process in the spaCy docs, plus some example code on Github.

Train NER model in NLTK with custom corpus

I have an annotated corpus in the conll2002 format, namely a tab separated file with a token, pos-tag, and IOB tag followed by entity tag. Example:
John NNP B-PERSON
I want to train a portuguese NER model in NLTK, preferably the MaxEnt model. I do not want to use the "built-in" Stanford NER in NLTK since I was already able to use the stand-alone Stanford NER. I want to use the MaxEnt model to use as comparison to the Stanford NER.
I found NLTK-trainer but I wasn't able to use it.
How can I achieve this?
Chapters 6 and 7 of the nltk book explain how to train a "chunker" on an IOB-encoded corpus. The example in chapter 7 does NP chunking, but that's incidental-- your chunker will chunk whatever you train it on. You'll need to decide what features are useful for named entity recognition; chapter 6 covers the basics of choosing features for a classifier. Finally, look at the source for the features used by the nltk's own named entity chunker. They'll probably do a pretty good job in Portuguese too; then you can try adding stemming or other Portuguese-specific features.

Train corpus for NER with NLTK ieer or conll2000 corpus

I have been trying to train a model for Named Entity Recognition for a specific domain, and with new entities. It seems there is not a completed suitable pipeline for this, and there is the need to use different packages.
I would like to give a chance to NLTK. My question is, how can I train a the NLTK NER to classify and match new entities using the ieer corpus?
I will of course provide training data with the IOB-Format like:
We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP
I guess I will have to tag the tokens by myself.
What do I do next when I have a text file in this format, what are the steps to train my data with the ieer corpus, or with a better one, conll2000?
I know there is some documentation out there, but it is not clear for me what to do after you have a training corpus tagged.
I want to go for NLTK because I then want to use the relextract() function.
Please any advise.
Thanks
The nltk provides everything you need. Read the nltk book's chapter 6, on Learning to Classify Text. It gives you a worked example of classification. Then study sections 2 and 3 from Chapter 7, which show you how to work with IOB text and write a chunking classifier. Although the example application is not named entity recognition, the code examples should need almost no changes to work (although of course you'll need a custom feature function to get decent performance.)
You can also use the nltk's tagger (or another tagger) to add POS tags to your corpus, or you could take your chances and try to train a classifier on data without part-of-speech tags (just the IOB named entity categories). My guess is that POS tagging will improve performance, and you're actually much better off if the same POS tagger is used on the training data as for evaluation (and eventually production use).

Resources