I’ve always adored literature as much as spreadsheets, so it makes sense that I started wondering about natural language soon after I started at DBC. Regretfully I haven’t made much progress beyond wondering, but I’m slated to give a briefly ‘lightning talk’ on something tomorrow, so I figured now is the time to summarize what I’ve gathered so far about this topic.

What is natural language processing (NLP) ?

NLP is a field of computer science that considers human language and how computers can interact with it. This includes relatively simple things like describing human-generated text in terms of frequency distributions, to very complex things like extracting meaning from texts or generating human-like language.

Incidentally it’s interesting to note that google trends suggests “natural language” is actually less popular of a topic now than it was in 2005; that’s interesting – I wonder if it’s now branched out too far for the general term to be used often.

What tools are easily accessible to us (i.e. people who recently started programming, primarily in ruby) for processing natural language?

Ruby Treat

I figure I should mention this first since it’s a Ruby gem. I haven’t tried it yet, but it seems to have basic functions that are similar to python’s NLTK. Treat does things like tokenizing, stemming, parsing groups of words into syntactic trees (more detail on that later).


AlchemyAPI – a company that provides text-analysis services; a few groups have used this for final projects, since they do some high-level language processing for you instead of you having to write your own algorithms (I guess this could be crazy in the context of a week-long project). They have a nice “getting started” guide for developers with examples of what they can do, including:

  • Entity extraction, keyword extraction – finding the subjects of sentences or larger pieces of text
  • Relation extraction – within sentences, isolating subject, action, object
  • Sentiment analysis – providing a numerical value on whether context around specific words is positive or negative
  • Language detection
  • Taxonomy – grouping articles into topics like politics, gardening, education, etc.

Semantria – seems to be comparable to Alchemy in that they also have an API that allows developers to request sentiment analysis for pieces of text; from a glance their marketing seems to be more directed towards twitter/social media.


 

Python’s NLTK is a well known library for natural language processing, and python is relatively similar to ruby as a programming language. The NLTK introductory book is easy to read and simultaneously provides an introduction to python. The basic concepts are easy to understand, but they quickly develop into sophisticated problems that remain issues in academic research. Some important concepts/vocabulary words below… in order of the book’s mentions, which follows tasks that are basic and doable with simple built-in methods, to concepts that require writing functions and large sets of data to provide meaningful results.

  • Tokenizing – splitting text into character groups that are useful. Often these are words, but I think it’s interesting how a word like “didn’t” could be tokenized into “did” and “n’t”
  • Frequency distributions are often used – frequencies of words, phrases, parts of speech, verb tense – these are all ways that different types of texts can be categorized. For example,
  • Corpora – these are large bodies of text data that may have some structure to make processing easier. The Brown Corpus is a famous one that includes texts from a variety of sources (religion, humor, news, hobbies, etc.), compiled in the 1960s, and there are many others – e.g. web chat logs, things in other languages
  • Other resources include things like dictionaries, pronunciation guides, WordNet is a “concept hierarchy” that has grouped words like frog and toad descending from amphibian
  • Stemming and lemma – stemming a word like  “running” would result in its basic form/lemma “run”
  • Word segmentation – how to split up tokens when boundaries are not clear, e.g. with spoken language or languages where written text does not have grouping boundaries
  • Tagging – parts of speech often used to categorize words, with more POS than we normally consider in English
  • N-gram tagging – deciding on tags using context, e.g. when considering the probabilistic tag for word #5, consider words #1-4’s tags
  • Classifying texts – this is a big subject with a lot to consider – depending on what you want to classify, what features can be isolated by a computer program? How to judge accuracy? “Entropy” and information gain – how much more accurately can we classify texts with the addition of a new feature?
  • naive Bayes classifiers – classifies text based on individual features and deciding to move closer to/farther from potential classifications with each piece of information; naive refers to considering all features independent
  • Chunking – segments sentences into groups of multiple tokens, e.g. grabbing a noun phrase like “the first item on the news.” Chunking tools generally are built on a corpus that has a large section of training data, where text has grouped into the right chunks. The patterns of chunking in the training text informs the tool’s categorizing going forward.
  • Processing grammar and ways to translate written information into forms that computers can easily process for querying (this gets into the realm of IBM Watson)

What are some potentially fun beginner projects to do with natural language processing?

So I haven’t done any of these yet; up until last week I was still struggling just to get python and nltk running on ubuntu and being able to download corpora. However, here are a few things that I think might be fun and not too difficult to make, some of which I’ve discussed before…

  • What author are you? Take a sample of your writing and compare it against books available from the Gutenberg corpus
  • Portmanteau-ifier – find a dictionary of root words and supply suggestions of good portmanteaus when given 2+ words
  • Spam vs. not spam email, mean vs. not mean comments
  • Rhyming poetry generation

Well, it’s been a long week here at DBC. We learned rails over the weekend (not to mention I put up this blog – still proud of that!) and started a 5-day group project on Wednesday. My group is working on a rommate expense-sharing application modeled off my spreadsheet from back when I tracked expenses for the townhouse on the Upper West Side. It’s been wonderful working on something that could have real applications rather than toy projects – at some point I will write about how I feel about creating games while learning programming.

The roommate application fits under the umbrella of “things that could be good for the world” because I believe increased communal living among adults and nuclear families could be a wonderful thing for western society. Many people suffer from loneliness that partially results from not having a nuclear family or from being isolated to only their nuclear family on a daily basis, which could be alleviated if people belonged to larger, loosely affiliated groups that share spaces and responsibilities. My dream is to someday convince a bunch of my friends to take over a group of adjacent residences and raise children together as a group, cook and eat dinners as a group, etc. It could be highly efficient and beneficial to overall mental health.

I’ve been putting off giving a lightning talk on some sort of technical topic, which is required of us this week or next. I dislike the idea of looking into something purely to explain it to a group, so I’m sort of hoping that I naturally find the inclination to look into something this weekend. The things I’ve been thinking about throughout this program – natural language processing, statistics, image processing – perhaps one of those.

Other things on my mind – Chicago is a little warmer this week, had a beautiful moment of clarity in a sit spin attempt this morning (otherwise very wobbly on ice), and very hungry. Wondering how I’m perceived by my peers here, wondering whether we will keep in touch after the program.

Literally! I missed my stop on the #80 bus and found myself facing a wrought iron fence that blocked the alley that eventually opened into my street.

Also figuratively! I finally (finally!) set up an acceptable wordpress theme for this personal site and experienced a number of wordpress revelations heavily assisted by Ryan Bahniuk.

I meant to post a wordy update two months ago as I was moving to Chicago. I never wrote this update, so the bulleted Q&A version is below:

  • Why did you move?
    • I decided to quit my job in equity research and partake in DevBootcamp Chicago, a 9-week program that teaches people to be web developers.
  • Why did you decide to quit your job?
    • I think most of you I’ve talked to in person recently (i.e. within the last year) know that I genuinely liked my job and particularly liked the people there. And more of you probably know or could infer that I LOVE excel spreadsheets and quantifying things in general. But ultimately I feel that large-cap equity investing lacks purpose on an individual level (even though it is highly meaningful at the global scale), and the pace of the work wasn’t something I felt keen to sustain.
  • Why this web developer thing?
    • I feel the need to clarify that I’m not a closet computer geek nor do I have ambitions to found the next overvalued tech IPO. I like that programming has becoming a cheap platform for normal people to create useful things, and I like working with logic. I also like learning new things, and paying $12,000 to quickly learn the basic skills to launch an entirely new career sounded like a remarkably efficient use of time and money.
  • How is Chicago vs. New York?
    • Well, I should caveat this statement since it’s only September and Chicago has had wonderful weather the last two months. But, so far, Chicago beats New York soundly. I like New York and the experiences it has to offer, but the experiences that I value most can be found in basically any major city, although here they cost less and are less crowded. The one exception is keeping up figure skating – the skating culture is much bigger here, and with regional competitions coming up, the rink is now regularly filled with young counterclockwise skaters who are much better than me.
  • How is the program going?
    • Not bad! It’s a decent amount of work, but I’m pretty sure I ironically spend less time staring at a computer screen than when I was working. I think I had a bit of a head start compared to the average person who starts here due to the engineering major and my general math-iness, so it hasn’t been as stressful as I had been warned.  The people here are wonderful (I would have expected nothing less, since they are mostly midwestern), and I’ve been extremely impressed with how much work everyone is putting in and how much we’ve already learned in 6 weeks. The program has a strong and positive culture; I chose to come here partially because I wanted to attend yoga classes and learn about “engineering empathy” while also learning about programming, and I’m happy to report those expectations have been met and exceeded!
  • What will you do when you’re done in early October?
    • Not sure yet. My lease ends Jan 31, but I may still try to find a job in Austin or Portland (i.e. somewhere temperate) before winter. I plan to keep figure skating and to pick up learning piano/voice/other music stuff again. I plan to work on some programming-related projects, and probably visit the east coast sometime in November.

Well, that was much wordier than expected. I hope all of you east coast people (and anyone else) are doing well and haven’t forgotten about me!