Open Source Entity Recognition for Indian Languages (NER)
One of the key components of most successful NLP applications is the Named Entity Recognition (NER) module which accurately identifies the entities in text such as date, time, location, quantities, names and product specifications. There are already existing sophisticated systems for NER such as spaCy, Stanford NER, etc. but most of them are built with general purpose for a wide range of NLP applications such as Information Retrieval, Document classification and other applications of unstructured data analysis. At Haptik, we focus on continuously improving NLP capabilities of our conversational AI platform, which powers more than few million exchanges on a daily basis. These conversations are spread across hundreds of enterprise bots built for different use-cases such as customer support, e-commerce, etc. Hence, building an accurate and reliable NER system tailored for conversational AI has always been one of the key focus areas of the engineering team at Haptik.
Around 3 years ago we open-sourced one of our key frameworks, Chatbot NER, which is custom built to support entity recognition in text messages. You can read more about Chatbot NER. After doing thorough research on existing Named Entity Recognition (NER) systems, we felt the strong need for building a framework which can support entity recognition for Indian languages. This led us to upgrade our own NER module i.e Chatbot NER to V2 version to scale its functionalities in local languages. The primary focus of this blog is to help you get started with using basic capabilities of Chatbot NER for English and 5 other Indian languages and their code mixed form.
In version 1, we had provided support for following entity types:
- Numeral: Entities that deal with the numeral or numbers such as temperature, budgets, size, quantities etc.
- Pattern: Entities which use patterns or regular expressions such as email, phone_number, PNR.
- Emporal: Entities for detecting time and date.
- Textual: Entities detection by looking at the dictionary or sentence structure. This detection mainly contains detection of entities like cuisine, dish, restaurants, city, location, etc.
In version 2 we have extended support for all above entity types (except pattern entities as it is language independent) in the following five Indian languages:
Selection of the above languages was based on the availability of linguistic experts in Indian languages who helped us in curating training data to scale entities.
We have a Docker-based set up for the Chatbot NER module, which can be set up on your system in less than 5 minutes by just following the installation steps given.
Below are the Django shell examples of Hindi date detections in Devanagari and Latin script:
1. Detection in Devanagari script:
2. Detection from Latin script:
You can follow the structure given below to make a curl request for the above example:
Our team is actively working and we will extend support for more Indian languages within the next few months as mentioned in our repository milestone. We also plan to add batch processing, more optimisations and better models going forward.
We hope that Chatbot NER helps you add more utilities to your Bot and ease out the process of detecting entities. Do share your feedback so that we can improve the same. Also, if you wish to understand the details of the framework and contribute either data or code to it, then refer to our contribution guidelines. Let’s hope that this repository comes up as a powerful resource and contributes to the research and engineering community.
For, any questions or support related to this repository, do leave your comments below.
Also, don’t forget, we are actively hiring researchers and engineers. So, if you are interested, check out our careers page.