Using Text Analytics and NLP: An Introduction
Every business wants to get the most from its data, but unlike legacy data types, today’s rising volume of data is not well structured — especially text data, which includes conversations, social posts, surveys, product reviews, documents, and customer feedback.
Businesses can tap into the power of text analytics and natural language processing (NLP) to extract actionable insights from text data. Here’s how it works.
Text Analytics Basics
Text analytics (also known as text mining or text data mining) is the process of extracting information and uncovering actionable insights from unstructured text.
Text analytics allows data scientists and analysts to evaluate content to determine its relevancy to a specific topic. Researchers mine and analyze text by leveraging sophisticated software developed by computer scientists.
Example business use cases for text analytics include:
- Customer 360. Analyzing customer email, surveys, call center logs, and social media streams such as blogs, tweets, forum posts, and newsfeeds to understand customers better.
- Warranty analysis. Understanding text from dealer service professionals, warranty claims, orders, and similar sources.
- Product or service reviews. Analysis of customer reviews of products or services helps enterprises understand user sentiment or common issues customers are talking about.
- Recruitment. Keyword analysis (comparing profiles with job descriptions) helps in short-listing suitable candidates.
The Text Analytics Process
There are many ways text analytics can be implemented depending on the business needs, data types, and data sources. All share four key steps.
Step 1: Data Acquisition
Text analytics begins with collecting the text to be analyzed — defining, selecting, acquiring, and storing raw data. This data can include text documents, web pages (blogs, news, etc.), and online reviews, among other sources. Data sources can be internal or external to an organization.
Step 2: Data Preparation
Once data is acquired, the enterprise must prepare it for analysis. The data must be in the proper form to work with machine learning models that will be used for data analysis. There are four stages in data preparation:
- Text cleansing removes any unnecessary or unwanted information, such as ads from web pages. Text data is restructured to ensure data can be read the same way across the system and to improve data integrity (also known as “text normalization”).
- Tokenization breaks up a sequence of strings into pieces (such as words, keywords, phrases, symbols, and other elements) called tokens. Semantically meaningful pieces (such as words) will be used for analysis.
- Part-of-speech tagging (also referred as “PoS”) assigns a grammatical category to the identified tokens. Familiar grammatical categories include noun, verb, adjective, and adverb.
- Parsing creates syntactic structures from the text based on the tokens and PoS models. Parsing algorithms consider the text’s grammar for syntactic structuring. Sentences with the same meaning but different grammatical structures will result in different syntactic structures.
Step 3: Data Analysis
Data analysis is the process of analyzing the prepared text data. Machine learning models can be used to analyze huge volumes of data, and the outcome is typically produced as an API in JSON format or in a CSV/Excel file. There are many ways data can be analyzed; two popular approaches are text extraction and text tagging.
Simply stated, text extraction is the process of identifying structured information from unstructured text. Text tagging is the process of assigning tags to text data based on its content and relevance.
Two common models for text tagging are “bag of words” and “Word2vec.”
The bag-of-words method is the easiest method to understand, but it’s outdated and has been deprecated. This method simply counts the number of words within the text content regardless of location and context. The disadvantage of this technique is that it does not offer a way to understand context from words — content with a higher word count is given a higher (and, falsely, more relevant) score.
Word2Vec has become the preferred method of text tagging. Text collected for Word2Vec is turned into a vector, which provides relevant information about words (including synonyms). For example, the terms “man” and “boy” can be closely related. Word2Vec also understands that the words “humor” and “humour” should be treated the same way. Word2Vec produces a mesh of related words. The closer the words are to each other in the neural network, the stronger their relationship to each other. This neural net allows algorithms to better understand the context of words, so data scientists can generate better analysis of content relevancy.
Step 4: Data Visualization
Visualization is the process of transforming analysis into actionable insights, representing the data in graphs, tables, and other easy-to-understand representations. Organizations can use a wide variety of commercial and open source visualization tools.
The Role of Natural Language Processing
NLP is a component of text analytics. Most advanced text analytics platforms and products use NLP algorithms for linguistic (language-driven) analysis that helps machines read text. NLP analyzes words for relevancy, including related words that should be considered equivalent, even if they are expressed differently (e.g., humor vs. humour). It’s the workhorse behind steps 2 and 3 described above.
One popular application of NLP is identifying relevant, quality content for search engines. For example, Google uses NLP in several ways, the most prominent of which is in search engine organization and categorization.
Long ago, a webmaster could achieve a higher rank in Google search results just by stuffing keywords into web content, so Google revised how its search engine processed content using numerous algorithms and NLP. NLP helps Google identify “spammy” content and categorize it. Google may de-index this content, penalize it, or simply rank it much lower than other content.
NLP is also used in email spam filters. Spammers try their best to evade such filters by changing words around, purposely misspelling words, or using synonyms. Email spam filters use a variety of factors to identify and block spam, phishing, and malicious content. Gmail’s filter, for example, incorporates machine learning and NLP to perform “sentiment analysis.” If content is determined to likely be spam, the content is sent to the user’s junk folder. For some content, Gmail deletes the message.
A decade ago, application of NLP was comparatively complicated. AI-based technologies (including NLP and text analytics) have evolved considerably, and there are many cloud services, commercial products, and open source platforms businesses can leverage. Here are few open source NLP applications:
- Stanford CoreNLP
- Natural Language Toolkit
- Apache Lucene and Solr
- Apache OpenNLP
- GATE and Apache UIMA
A Final Word
Text analytics isn’t new, but it is still unfamiliar to many organizations. With APIs, cloud-based AI services, and open source platforms available today, your business can leverage the power of text analytics to get a competitive edge by better understanding your customers and improving your brand’s value.
Published in TDWI UPSIDE.