by Pablo Duboue, PhD
The book has 10,000 lines of Python code in 5 different Jupyter notebooks, operating over 2.1Gb of compressed data. The code behind these case studies is intended as a communication tool for the ideas expressed in the book.
The task tackled in the first four chapters is that of predicting population of cities and small towns using different data sources. This task that can be attacked with structural features, with timestamped features, textual features and image features. In particular, for cities, this means their ontological properties (e.g., title of its leader or its time zone), based on its historical population and historical features (which involves a time series analysis), based on the textual description of the place (which involves text analysis, particularly as sometimes the text includes the population) and a satellite image of the city (which involves image processing).
These case studies reflect the author attempt to solve these problems through feature engineering alone with the following constraints:
Note that there are two obvious casualties from these decisions: not using a deep learning framework (like TF) nor performing hyperparameter search. This last item was a decision motivated by these constraints.
As mentioned in the GitHub README this code is intended as a way of communicating ideas. It is as far as production code as source can get.
DBpedia + GeoCities + Wikipedia + NASA tiles
Download 2.2Gb Zip compressed.
Download 1.7Gb Tar BZip2 compressed.
The data for each individual chapter is already available for download below. Note that Chapter 9 contains the source tiles and it is 3x larger that the files above. You will only need the source tiles if you want to try other types of box constructions around each city.
This notebook has 21 cells. It uses numpy, scikit-learn, matplotlib and opencv.
The chapter dataset contains the full NASA tiles (only needed if doing experiments changing the box extraction algorithm in Cell #3). The full all chapters data set contains only boxes around each city and it is much smaller.GitHub Rendered Notebook Dataset