Confirmed Sessions at NLP Day Texas 2017
We are just now beginning to announce the confirmed sessions. Check this page regularly for updates.
Chatbots from first principles
Jonathan Mugan - Deep Grammar
There are lots of frameworks for building chatbots, but those abstractions can obscure understanding and hinder application development. In this talk, we will cover building chatbots from the ground up in Python. This can be done with either classic NLP or deep learning. We will cover both approaches, but this talk will focus on how one can build a chatbot using spaCy, pattern matching, and context-free grammars.
An Introduction to Topic Modeling
Jacob Su Wang - Ojo Labs / University of Texas
Topic models (TM) are a class of generative models which aims at capturing the latent distributions that underpin given observational data (e.g. word distribution in a collection of documents). In the context of Natural Language Process (NLP), TMs are particularly useful for discovering the statistical regularities hidden in textual data in supervised/semi-supervised/unsupervised settings.
In this talk I address the following questions:
- What is a TM?
- What are its potential applications (in particular in NLP)?
- How to formulate and learn a TM (to suit one's particular purposes)?
- What are the common inference algorithms for TM, and which one should you use?
I focus on a deep understanding of TMs, for the objective that one is able to make informed decisions in constructing TMs to dovetail her specific projects.
Textual Analysis and High Finance: What Can We Learn From the Writing of Financial Contracts
Malcolm Wardlaw - UT Dallas
Text analysis and natural language processing have recently seen an incredible amount of growth in application to the area of financial economics, both in industry and academia. This talk will focus on how these tools are being used in academic research, with a focus on new research into the area of financial contracting. The talk will first provide a brief overview of the research at large, and then focus on its specific application to loan contracting, partially described in Dr. Wardlaw's recent paper (https://ssrn.com/abstract=2952567) with Bernhard Ganglmair. Particular attention will be paid to the implementation of topic models in this setting, along with a high level technical overview of the challenges faced when trying to analyze contract documents in a systematic way.
Multichannel Event Detection in Twitter
Dr. Joseph A. Gartner III - Sotera Defense Solutions
In their survey of event detection techniques in twitter , event detection techniques are divided into two broad 1 categories of document-pivot and feature pivot techniques, with the distinction that these techniques differentiate by focusing on document or temporal features in social streams. I propose to begin with a brief discussion of what makes working with tweet data difficult compared to traditional text, and the process of cleaning data. I will then briefly touch upon some of the background work in event detection, such as burst detection of words, and document clustering. From there, I will focus on my work, a novel approach to multi-analytic document grouping identifying both burst feature representation in hashtags and general text alignment using word2vec document projections. I will then discuss how these local topic groupings can be combine to form larger events. The talk will conclude with a presentation some results that are available as open sourced software.
Intended Audience: The broad task of event detection has application in a wide range of fields, from marketing to law enforcement. This talk aims to contrast the strengths and weaknesses of a few known techniques, as well as highlight open sourced tools available for the purpose of event detection.
Artificial Intelligence as a Tool (What makes or breaks natural language processing products)
Tad Turpen - NarrativeDX
Since the advent of artificial intelligence, inventors and scientists alike have been trying to monetize their efforts. Some have succeeded and some have failed. In this talk I will outline what makes or breaks natural language processing products as an example of artificial intelligence, namely: whether or not the products are treated as a tool to get work done. In this talk I will go through several historical examples of natural language processing companies that fell into the trap of becoming one-off consultancies, and how to engineer a single product that generates recurring revenue. It is also important to consider artificial intelligence as a tool to get work done, consider what would have happened if the car remained a toy for the rich. I will conclude that artificial intelligence can broadly fill the role of providing consulting services, but there is also a place for a dedicated product that generates recurring revenue if engineered correctly.
Sockpuppets, Secessionists, and Breitbart
Jonathon Morgan - New Knowledge
This presentation is based on Jonathon's recent paper, Sockpuppets, Secessionists, and Breitbart, published on Medium.
From the paper: New evidence points to a highly orchestrated, large-scale influence campaign that infiltrated Twitter, Facebook, and the comments section of Breitbart during the run up to the 2016 election. Tens of thousands of bots and hundreds of human-operated, fake accounts acted in concert to push a pro-Trump, nativist agenda across all three platforms in the spring of 2016. Many of these accounts have since been refocused to support US secessionist movements and far-right candidates in upcoming European election, all of which have strong ties to Moscow and suggest a coordinated Russian campaign. Jonathon will walk through the methodology to uncover and detail this campaign.
Words as Vectors - Introduction to Word Embedding
Erik Skiles - SparkCognition
Word Embeddings have made a significant impact in NLP over the last few years. The goal of this workshop is to provide attendees with the understanding and the tools needed to create word embeddings and use them in various downstream NLP tasks such as classification. The workshop begins by establishing the core concept of representing words as vectors. We will then do a deeper dive into what information word embeddings are learning. This will be followed by a survey of methods for creating word embeddings and some tips on selecting an algorithm or pre-built word embeddings. (note that I don’t want to get bogged down into the details of any one implementation for learning word embeddings) Finally we will explore a few NLP tasks that can benefit from word embeddings.
Install word2vec, sklearn, numpy
$pip install word2vec
Data for word embedding
Data for document classification - Reuters R8 train and test data. The test data should be concatenated into a single file. We will use sklearn’s cross-fold validation.
I will provide a link to code samples used during the code demonstrations.
Introduction / Overview
Getting comfortable with word vectors - Sparse vectors are commonly used to represent collections of words such as tweets, articles, and other types of documents. In this model, each word can be shown to be a sparse vector that indicates whether or not a word is present. These vectors can be aggregated in various ways to produce the sparse representations we are familiar with.
Distributional Semantics - The idea that words with similar meaning have similar contexts is well known and a number of techniques have been developed to find semantic similarity, often using co-occurrence counts. Word Embeddings attempt to capture this semantic information in a dense vector representation for each word in a corpus without the need for any supervision.
Objective Function - There are a number of techniques that have been proposed for creating word embeddings. Fundamentally, each of these are trying to capture the relationship between words as a vector in n-dimensional space. In this example of Countries and Capitals, the names of a country and its capital generally appear within similar contexts. Those contexts often contain the word “capital”. These objective functions attempt to simultaneously learn a vector for each capital and country while also learning a vector for the term “capital” that relates these pairs.
Comparison of objective functions for specific algorithms
Word2Vec / Skip Gram
Word2Vec / CBOW (Continuous Bag of Words)
GloVe Global Vectors for word representation
Building word embeddings - All of these models have implementations for creating word embeddings from a corpus.
Example of working with Google’s word2vec in Python:
Review the data file
Create a word embedding from the data file
Load the word embedding into a model
Examine the model
Make predictions using the model
Implementation References - Reference to original paper, code repo, and a helpful tutorial for each of the algorithms.
Pre-built models - There are a number of pre-built word embeddings available. These are built from various data sets and word embedding algorithms with varying vector sizes.
Tips - Tips for selecting a pre-built model or building a model from scratch.
Pre-built vs. Build from scratch
Selecting a pre-built model
Application to Supervised methods - Word embeddings are a powerful unsupervised tool that can be used with downstream processes such as classification (supervised). One of their more interesting advantages is that they are able to reduce the amount of unseen data(a common problem in large corpora where labeled data is lacking).
Example of using word embeddings for document classification
Create a vectorizer (maps each token in a string to a vector, then aggregates the vectors)
Create an sklearn pipeline to load the training data, process it, build a model, and evaluate the model.
Open to discussion about ideas on how word embeddings can be used in other NLP tasks
If there is time…
Extending Word2Vec to perform word sense disambiguation