The NYC Taxi and Limousine Commission (TLC) has publicly released a dataset of taxi trips from January 2009 — June 2016 with GPS coordinates for starting and endpoints. Chris Whong originally sent a FOIA request to the TLC, getting them to release the data, and has produced a famous visualization, NYC Taxis: A Day in the Life. Mark Litwintschik benchmarked various relational database and big data technologies using this dataset given its moderate 400GB size. And notably, Todd W. Schneider produced some really nice summaries of the dataset, some of which are similar to work I show here. I actually was not aware of Todd’s work on this topic until after this post was written, so although there is a fair bit of overlap, this post and the graphics in it are original.
I downloaded the data files from TLC website, and (very painfully) using Python, Dask, and Spark, have produced a cleaned dataset in Parquet format, which I make this available for AWS users at the end of this post. So I was curious, where do taxis pick up passengers, or more precisely, what does the distribution of taxi pickup locations look like? With 1.3 billion taxi pickups, plotting the distribution in a way that does not wash out detail is very challenging. Scatter plots are useless due to overplotting, and 2D histograms are a form of kernel density estimation that necessarily blur or pixelate a lot of the details. Additionally, with the full dataset, the pickup locations alone total 21GB, which is more than the memory of my 16GB laptop. Out of core tools can solve that technical problem easily (and subsampling is easier than that), but what about the visual problem? Human eyes are incapable of absorbing 21GB of information in a plot.
The solution to this comes from an interesting library called Datashader. It dynamically generates a 2D Histogram at the resolution of your display (or a specified canvas). Each pixel on the display corresponds to certain histogram boundaries in the data. The library counts the number of data points that fall within those boundaries for each pixel, and this number is used to color the intensity of the pixel. Leveraging Dask, the creation of the histogram can scale to terabytes of data, and be spread across a cluster. Leveraging Bokeh, the final plot can be zoomed and panned. Using techniques from high dynamic range photography, intensity ranges are mapped so that maximum dynamic contrast is present at any zoom level, and in any given viewport.
This is what the map of taxi pickup locations (1.3 billion points) looks like over Manhattan, plotted using the Viridis perceptually uniform colormap. The first thing I notice is how clearly I can see the street patterns. In parts of Brooklyn and Queens, the street pattern is sharp. In Manhattan, the pattern is `fuzzier’, especially near the southern tip of Manhattan and in Midtown south of Central Park. There are an awful lot of pickups that, according to GPS coordinates, fall over the Hudson or East rivers, and quite a few pickups that fall in the portion of Central Park where there are no roads. Obviously, not a lot of taxi trips are starting in the rivers surrounding Manhattan, but what this plot shows is instead how important GPS error is. The fuzziness arises from tall buildings which make it quite difficult to get a good GPS fix, and the taller the buildings, the fuzzier the streets look. More broadly, the Midtown area south of Central Park is very bright, indicating a lot of taxi trips start there.
The second image is also taxi pickups, but on a much wider scale. Zoomed out, most of Manhattan lights up like a beacon, indicating far more pickups in Manhattan than the surrounding area. But the airports, JFK and La Guardia in particular, also light up, showing nearly as much visual intensity (trips per unit area starting there) as Midtown.
Now let’s examine the dropoff locations using the Inferno colormap. At first glance, the dropoff locations look a lot like the pickup locations within Manhattan. The same regions, Midtown south of Central Park, and the southern tip of Manhattan show the brightest (and fuzziest) streets.
Zooming out to the broader metro area, the streets in Brooklyn and Queens are much sharper and brighter, indicating there are a lot more dropoffs in the outer boroughs than pickups, and indicating the GPS error in these regions tends to be lower, presumably due to fewer tall buildings. In fact, in some places it looks good enough to use as a street map, indicating a relatively even distribution of taxi dropoffs in Brooklyn and Queens. This is quite distinct from the pickups map, indicating that there are relatively few pickups in the outer boroughs, but a lot of dropoffs there. Many people take taxis from Manhattan to the outer boroughs, but a lot fewer take taxis from the outer boroughs into Manhattan.
Read more of Ravi Shekhar’s fascinating study in Medium: https://medium.com/towards-data-science/if-taxi-trips-were-fireflies-1-3-billion-nyc-taxi-trips-plotted-b34e89f96cfa
- Fascinating data generate almost-complete ‘photo’ of New York taxi pick-ups and drop-offs.