Every business wants to get the most from its data, but unlike legacy data types, today’s rising volume of data is not well structured — especially text data, which includes conversations, social posts, surveys, product reviews, documents, and customer feedback.

Businesses can tap into the power of text analytics and natural language processing (NLP) to extract actionable insights from text data. Here’s how it works.

Text Analytics Basics

Text analytics (also known as text mining or text data mining) is the process of extracting information and uncovering actionable insights from unstructured text.

Text analytics allows data scientists and analysts to evaluate content to determine its relevancy to a specific topic. Researchers mine and analyze text by leveraging sophisticated software developed by computer scientists.

Example business use cases for text analytics include:

  • Customer 360. Analyzing customer email, surveys, call center logs, and social media streams such as blogs, tweets, forum posts, and newsfeeds to understand customers better.
  • Warranty analysis. Understanding text from dealer service professionals, warranty claims, orders, and similar sources.
  • Product or service reviews. Analysis of customer reviews of products or services helps enterprises understand user sentiment or common issues customers are talking about.
  • Recruitment. Keyword analysis (comparing profiles with job descriptions) helps in short-listing suitable candidates.

The Text Analytics Process

There are many ways text analytics can be implemented depending on the business needs, data types, and data sources. All share four key steps.

Step 1: Data Acquisition

Text analytics begins with collecting the text to be analyzed — defining, selecting, acquiring, and storing raw data. This data can include text documents, web pages (blogs, news, etc.), and online reviews, among other sources. Data sources can be internal or external to an organization.

Step 2: Data Preparation

Once data is acquired, the enterprise must prepare it for analysis. The data must be in the proper form to work with machine learning models that will be used for data analysis. There are four stages in data preparation:

  • Text cleansing removes any unnecessary or unwanted information, such as ads from web pages. Text data is restructured to ensure data can be read the same way across the system and to improve data integrity (also known as “text normalization”).
  • Tokenization breaks up a sequence of strings into pieces (such as words, keywords, phrases, symbols, and other elements) called tokens. Semantically meaningful pieces (such as words) will be used for analysis.
  • Part-of-speech tagging (also referred as “PoS”) assigns a grammatical category to the identified tokens. Familiar grammatical categories include noun, verb, adjective, and adverb.
  • Parsing creates syntactic structures from the text based on the tokens and PoS models. Parsing algorithms consider the text’s grammar for syntactic structuring. Sentences with the same meaning but different grammatical structures will result in different syntactic structures.

The Role of Natural Language Processing

NLP is a component of text analytics. Most advanced text analytics platforms and products use NLP algorithms for linguistic (language-driven) analysis that helps machines read text. NLP analyzes words for relevancy, including related words that should be considered equivalent, even if they are expressed differently (e.g., humor vs. humour). It’s the workhorse behind steps 2 and 3 described above.

One popular application of NLP is identifying relevant, quality content for search engines. For example, Google uses NLP in several ways, the most prominent of which is in search engine organization and categorization.

Long ago, a webmaster could achieve a higher rank in Google search results just by stuffing keywords into web content, so Google revised how its search engine processed content using numerous algorithms and NLP. NLP helps Google identify “spammy” content and categorize it. Google may de-index this content, penalize it, or simply rank it much lower than other content.

NLP is also used in email spam filters. Spammers try their best to evade such filters by changing words around, purposely misspelling words, or using synonyms. Email spam filters use a variety of factors to identify and block spam, phishing, and malicious content. Gmail’s filter, for example, incorporates machine learning and NLP to perform “sentiment analysis.” If content is determined to likely be spam, the content is sent to the user’s junk folder. For some content, Gmail deletes the message.

A decade ago, application of NLP was comparatively complicated. AI-based technologies (including NLP and text analytics) have evolved considerably, and there are many cloud services, commercial products, and open source platforms businesses can leverage. Here are few open source NLP applications:

  • Stanford CoreNLP
  • Natural Language Toolkit
  • Apache Lucene and Solr
  • Apache OpenNLP
  • GATE and Apache UIMA

A Final Word

Text analytics isn’t new, but it is still unfamiliar to many organizations. With APIs, cloud-based AI services, and open source platforms available today, your business can leverage the power of text analytics to get a competitive edge by better understanding your customers and improving your brand’s value.

Published in TDWI UPSIDE.

Want to learn more about how we can help you?

Schedule a free, no-pressure consultation about your unique use case.