Median Age as Predictor Variable
There is a ton of information in the TIGER Census files at the U.S. Gov Census site. Unfortunately, it is not easily mapped to geolocations. I had to get the tract level shapefiles and then transform...
View ArticlePython Static Dictionaries in Nearest Neighbor Queries
A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point. The brute force approach to this problem is joining the two tables...
View ArticleLatent Semantic Analysis in Solr using Clojure
I recently pushed a very alpha Solr plugin to GitHub that does unsupervised clustering on unstructured text documents. The plugin is written in Clojure and utilizes the Incanter and associated...
View ArticleDestructuring in Mathematica
A technique that I have particularly useful in Lisp-like languages like Mathematica and Clojure is destructuring. Destructuring is a mechanism for extracting parts of an expression. The Lisp “code as...
View ArticleStochastic Gradient Descent
Most machine learning algorithms and statistical inference techniques operate on the entire dataset. Think of ordinary least squares regression or estimating generalized linear models. The...
View ArticleData Science Meetup
CCRi was delighted to host the second meeting of the Cville Data Science group earlier this month. A full house packed our conference room, and a good time was had by all. The lineup for the talks...
View ArticleGoing beyond tabulating comentions
In this and my next post, I’ll be showing a a few quick analyses we performed using a new tool we developed, called Elias. In today’s post, we’ll see how topic modeling can be used to characterize how...
View ArticleWhich Armstrong?
In my last post, I described how we used Elias, an exploratory analysis tool for large-scale information extractions, to look at which (person,location) pairs are mentioned the most together, and then...
View ArticleCalculating Feature Importance in Data Streams with Concept Drift using...
I had the privilege of presenting my work on “Calculating Feature Importance in Data Streams with Concept Drift using Online Random Forest” at IEEE Big 2014 in Washington, DC this last week. The...
View ArticleCloud Computing With Spark: Using All Your executors
Sometimes Data Scientists find themselves with a map-reduce cloud architecture and computation that needs to be done on a large scale, but the data isn’t actually cloud scale. One great way to get the...
View Article