Amazon Comprehend is a Natural Language Processing (NLP) service that uses machine learning to find insights and relationships in text.

Available Features

  • Kephrase Extraction
  • Sentiment Analysis
  • Syntax Analysis
  • Entity Recognition
  • Language Detection
  • Topic Modelling

At the time of writing, only English and Spanish are supported for text analysis. But the official page suggests detecting the language, translating it to English or Spanish(using Amazon Translate), and then using Amazon Comprehend.

Our Exploration

We at DeltaX, wanted to try a feature to find relevant Interest Targeting Options from textual content of Ads. Initially we thought that this required detecting Topical Keywords from blocks of text. The terms Keyword Extraction, Topic Modelling and tf-idf used in text analytics, all came to mind.

Our goal was to do effective Ad Targeting based on textual content of Ads. We weren’t exactly clear at this point about what sort of analysis was required to be done on the text content.
To facilitate this exploration, we decided to use an easy to use 3rd party API to try and achieve what we wanted.

This approach abstracted away the complex NLP part and helped us bypass a steeper learning curve which would have involved designing our own model, training the model, validation, deploying and integrating with the existing feature on top of which we were trying to build this. This decision also meant a quicker route to production allowing us to see how the results of this expirement looked like.

We played around on IBM Watson, Google Cloud ML, ParallelDots, Amazon Comrehend etc. with real data. All of them were very capable.

Salient Characteristics of Amazon Comprehend

  • We chose to go ahead with AWS Comprehend. The web interface allowed for comprehensive analysis with just a click of a button. After a few tries we realized that “Entity Recognition” was the best fit for our use case.

  • The concept of Confidence Score -the level of confidence that Amazon Comprehend has in the accuracy of the detection. This allowed us to choose a threshold to make decisions for quality and number of recognised entities. In contrast, Google Cloud ML’s results were returned in the order of a salience score which reflect their relevance to the overall text.

  • Easy integration with .NET Applications.

Example Usage with .NET

  • Install the AWSSDK.Comprehend package to your project. This also installs AWSSDK.Core.
    Install-Package AWSSDK.Comprehend -Version 3.3.2.11

  • Setting up the client

    private readonly AmazonComprehendClient _comprehendClient;
    

  var awsCredentials = new BasicAWSCredentials(ConfigurationManager.AppSettings["AWSAccessKey"], ConfigurationManager.AppSettings["AWSSecretKey"]);

  _comprehendClient = new AmazonComprehendClient(awsCredentials, Amazon.RegionEndpoint.USEast1);
  • Request & Response Handling

    public string GetRecognisedEntities(string inputText)
          {
              DetectEntitiesRequest detectEntitiesRequest = new DetectEntitiesRequest
              {
                  LanguageCode = new LanguageCode("en"),
                  Text = inputText
              };
              DetectEntitiesResponse detectEntitiesResponse = _comprehendClient.DetectEntities(detectEntitiesRequest);
    
              return JsonConvert.SerializeObject(detectEntitiesResponse.Entities);
          }
    
  • Sample Request and Response

    • Request
      {
        "Text": "Amazon.com, Inc. is located in Seattle, WA and was founded July 5th, 1994 by Jeff Bezos, allowing customers to buy everything from books to blenders. Seattle is north of Portland and south of Vancouver, BC. Other notable Seattle - based companies are Starbucks and Boeing.",
        "LanguageCode": "en"
      }
      
    • Response
      "Entities": [
        {
          "Score": 0.9003683924674988,
          "Type": "ORGANIZATION",
          "Text": "Amazon.com, Inc",
          "BeginOffset": 0,
          "EndOffset": 15
        },
        {
          "Score": 0.8933648467063904,
          "Type": "LOCATION",
          "Text": "Seattle, WA",
          "BeginOffset": 31,
          "EndOffset": 42
        },
        {
          "Score": 0.9979841709136963,
          "Type": "DATE",
          "Text": "July 5th, 1994",
          "BeginOffset": 59,
          "EndOffset": 73
        },
        {
          "Score": 0.9998443722724915,
          "Type": "PERSON",
          "Text": "Jeff Bezos",
          "BeginOffset": 77,
          "EndOffset": 87
        },
        {
          "Score": 0.973984956741333,
          "Type": "LOCATION",
          "Text": "Seattle",
          "BeginOffset": 150,
          "EndOffset": 157
        },
        {
          "Score": 0.9932572841644287,
          "Type": "LOCATION",
          "Text": "Portland",
          "BeginOffset": 170,
          "EndOffset": 178
        },
        {
          "Score": 0.942959725856781,
          "Type": "LOCATION",
          "Text": "Vancouver, BC",
          "BeginOffset": 192,
          "EndOffset": 205
        },
        {
          "Score": 0.990336537361145,
          "Type": "LOCATION",
          "Text": "Seattle",
          "BeginOffset": 221,
          "EndOffset": 228
        },
        {
          "Score": 0.9934123158454895,
          "Type": "ORGANIZATION",
          "Text": "Starbucks",
          "BeginOffset": 251,
          "EndOffset": 260
        },
        {
          "Score": 0.9982319474220276,
          "Type": "ORGANIZATION",
          "Text": "Boeing",
          "BeginOffset": 265,
          "EndOffset": 271
        }
      ]
      

Filtering

Entity Types

  • Entity Types - The response can be filtered by using Entities from only selective Entity Types
  • Confidence Score - A threshold can be used to only select Entities above a certain confidence score

References and Further Reading