I’ve always adored literature as much as spreadsheets, so it makes sense that I started wondering about natural language soon after I started at DBC. Regretfully I haven’t made much progress beyond wondering, but I’m slated to give a briefly ‘lightning talk’ on something tomorrow, so I figured now is the time to summarize what I’ve gathered so far about this topic.
What is natural language processing (NLP) ?
NLP is a field of computer science that considers human language and how computers can interact with it. This includes relatively simple things like describing human-generated text in terms of frequency distributions, to very complex things like extracting meaning from texts or generating human-like language.
Incidentally it’s interesting to note that google trends suggests “natural language” is actually less popular of a topic now than it was in 2005; that’s interesting – I wonder if it’s now branched out too far for the general term to be used often.
What tools are easily accessible to us (i.e. people who recently started programming, primarily in ruby) for processing natural language?
I figure I should mention this first since it’s a Ruby gem. I haven’t tried it yet, but it seems to have basic functions that are similar to python’s NLTK. Treat does things like tokenizing, stemming, parsing groups of words into syntactic trees (more detail on that later).
AlchemyAPI – a company that provides text-analysis services; a few groups have used this for final projects, since they do some high-level language processing for you instead of you having to write your own algorithms (I guess this could be crazy in the context of a week-long project). They have a nice “getting started” guide for developers with examples of what they can do, including:
- Entity extraction, keyword extraction – finding the subjects of sentences or larger pieces of text
- Relation extraction – within sentences, isolating subject, action, object
- Sentiment analysis – providing a numerical value on whether context around specific words is positive or negative
- Language detection
- Taxonomy – grouping articles into topics like politics, gardening, education, etc.
Semantria – seems to be comparable to Alchemy in that they also have an API that allows developers to request sentiment analysis for pieces of text; from a glance their marketing seems to be more directed towards twitter/social media.
Python’s NLTK is a well known library for natural language processing, and python is relatively similar to ruby as a programming language. The NLTK introductory book is easy to read and simultaneously provides an introduction to python. The basic concepts are easy to understand, but they quickly develop into sophisticated problems that remain issues in academic research. Some important concepts/vocabulary words below… in order of the book’s mentions, which follows tasks that are basic and doable with simple built-in methods, to concepts that require writing functions and large sets of data to provide meaningful results.
- Tokenizing – splitting text into character groups that are useful. Often these are words, but I think it’s interesting how a word like “didn’t” could be tokenized into “did” and “n’t”
- Frequency distributions are often used – frequencies of words, phrases, parts of speech, verb tense – these are all ways that different types of texts can be categorized. For example,
- Corpora – these are large bodies of text data that may have some structure to make processing easier. The Brown Corpus is a famous one that includes texts from a variety of sources (religion, humor, news, hobbies, etc.), compiled in the 1960s, and there are many others – e.g. web chat logs, things in other languages
- Other resources include things like dictionaries, pronunciation guides, WordNet is a “concept hierarchy” that has grouped words like frog and toad descending from amphibian
- Stemming and lemma – stemming a word like “running” would result in its basic form/lemma “run”
- Word segmentation – how to split up tokens when boundaries are not clear, e.g. with spoken language or languages where written text does not have grouping boundaries
- Tagging – parts of speech often used to categorize words, with more POS than we normally consider in English
- N-gram tagging – deciding on tags using context, e.g. when considering the probabilistic tag for word #5, consider words #1-4’s tags
- Classifying texts – this is a big subject with a lot to consider – depending on what you want to classify, what features can be isolated by a computer program? How to judge accuracy? “Entropy” and information gain – how much more accurately can we classify texts with the addition of a new feature?
- naive Bayes classifiers – classifies text based on individual features and deciding to move closer to/farther from potential classifications with each piece of information; naive refers to considering all features independent
- Chunking – segments sentences into groups of multiple tokens, e.g. grabbing a noun phrase like “the first item on the news.” Chunking tools generally are built on a corpus that has a large section of training data, where text has grouped into the right chunks. The patterns of chunking in the training text informs the tool’s categorizing going forward.
- Processing grammar and ways to translate written information into forms that computers can easily process for querying (this gets into the realm of IBM Watson)
What are some potentially fun beginner projects to do with natural language processing?
So I haven’t done any of these yet; up until last week I was still struggling just to get python and nltk running on ubuntu and being able to download corpora. However, here are a few things that I think might be fun and not too difficult to make, some of which I’ve discussed before…
- What author are you? Take a sample of your writing and compare it against books available from the Gutenberg corpus
- Portmanteau-ifier – find a dictionary of root words and supply suggestions of good portmanteaus when given 2+ words
- Spam vs. not spam email, mean vs. not mean comments
- Rhyming poetry generation