How to easily extract Text from anything using spaCy
Here’s a new framework that our AI Developer just unearthed - with this framework you can now extract text in a jiffy and also do a load of other cool stuff. Read on and find out how!
Subscribe to our awesome Newsletter.
Hey guys, I’d like to tell you there is this super amazing NLP framework called spaCy. Most of us always go for NLTK when it comes to any NLP application because of its simple documentation and most of us are first exposed to it when we started our NLP journey.
Luckily, I stumbled upon this framework called spaCy. And I started using it because it is faster than NLTK - and I’ll also give you fair warning that I am not here to compare spaCy with NLTK!
So, let’s try understanding spaCy’s working a bit more before going deep into it.
Follow this and install spaCy. Make sure you are doing it in a virutalenv. It’s always good practice to use a virtual environment.
The following are the core features that spaCy provides.
|Tokenization||Segmenting text into words, punctuations marks etc.|
|Part-of-speech(POS) Tagging||Assigning word types to tokens, like verb or noun.|
|Dependency Parsing||Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
|Lemmatization||Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.|
|Sentence Boundary Detection(SBD)||Finding and segmenting individual sentences.|
|Named Entity Recognition(NER)||Labelling named “real-world” objects, like persons, companies or locations.|
|Similarity||Comparing words, text spans and documents and how similar they are to each other.|
|Text Classification||Assigning categories or labels to a whole document, or parts of a document.|
|Rule-based Matching||Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
|Training||Updating and improving a statistical model’s predictions.|
|Serialization||Saving objects to files or byte strings.|
Here, we are going to see about Rule-based Matching which is going to help us in text/entity extraction.
One simple example to get started with,
What else can you really do with this Matching? That was my first question too when I was trying to understand what spaCy could do!
The one thing I admire about spaCy is, the documentation and the code. Both are beautifully written. And any noob can understand it just by reading. No complication adapters or exceptions.
P.S: For beginners, there was a big leap taken from spaCy 1.x to spaCy 2 and you might need to get hold of new functions and new changes in function names. But it’s worth investing time in.
There are few attrs that help in easier extraction of text from the sentence. This helps us in achieving custom patterns which are very stable.
This is the attrs file. You can see that they are very simple and helpful attrs like LIKE_URL, LIKE_EMAIL etc., and the best part is you can define your own flags and attrs in special cases.
There is an on_match (callback function) in the matcher.add() function. The second parameter takes the matched triple object and uses send as the parameter to the on_match callback function().
A sample of the working:
I hope you are able to understand the basic operations that can be done using spaCy. spaCy 2 is the bleeding edge version and it’s getting loaded with lots and lots of features that every NLP enthusiast has ever dreamt of - and there are even other libraries like textacy which have been built on the top of spaCy.
Okay guys, until we meet next time, I wish you have some good time with spaCy’s magic!