Chapter 7: Timestamped data¶

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License¶

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations¶

Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
Resource limitations:
- At most one day of running time per notebook.
- No GPU required.
- Minimal dependencies.
- At most 8Gb of RAM.
Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Chapter 7: Case Study on Timestamped Data¶

In this chapter I will describe the creation of the extended version of WikiCities dataset with historical data. I will also present a country population prediction task to exemplify traditional time series algorithms.

WikiCities: Historical¶

Used the following historical versions of DBpedia:

Version	Year	Code
3.6	2010	dbpedia10
3.7	2011	dbpedia11
3.8	2012	dbpedia12
3.9	2013	dbpedia13
2014	2014	dbpedia14
2015-04	2015	dbpedia15

Using scripts outside of this notebook, I have extracted the relevant triples from the dumps for all years into files CODE_cities1000_base.ttl (e.g., dbpedia12_cities1000_base.ttl). Because due to changes in the script (going from or "Phoenix%2C_Arizona" to "Phoenix,Arizona") and in the source Wikipedia (going from "Utrecht" to "Utrecht%28city%29"), the entity designators change from year to year, I have tried to standardize them using the provided GeoNames link for each year. In the historical versions the GeoNames data was limited and as many as half the cities are missing. This might also be due to lower coverage in the source Wikipedia in the earlier years. All in all, this represents the type of challenges of dealing with historical data in production systems.

From these files, I extracted the conservative feature set into CODE_dev_conservative.tsv, also outside the notebooks. This produced files CODE_dev_conservative.tsv. Again, due to changes in the ontology many columns (sometimes up to 50% of the columns) are empty. Interestingly, this number changes from year to year (not necessarily starting low and increasing). Therefore, it enables the specific type of temporal imputation enabled by timestamped data.

Let us start doing some EDA and see how this data set behaves.

EDA: Exploratory Data Analysis¶

The first thing to take a look is how the total number of relations behaves year after year (Cell #1).

# CELL 1
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

dbpedia_rels = [
    1, # padding
    19969604, # 2010
    26770236, # 2011
    33742028, # 2012
    41804713, # 2013
    61481509, # 2014
    37791134, # 2015
    38285149 # 2016
]

city_col = 0
rel_col = 1
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
    datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
    if version == 6:
        datafile = "ch6_cell32_dev_feat_conservative.tsv"
    city_relcounts = dict()
    with open(datafile) as f:
        first = True
        for line in f:
            fields = line.split("\t")
            if first:
                first = False
                if fields[city_col] != 'name':
                    raise Exception("Expected 'name', got " + fields[city_col])
                if fields[rel_col] != 'rel#count':
                    raise Exception("Expected 'rel#count', got " + fields[rel_col])
            else:
                city_relcounts[fields[city_col]] = float(fields[rel_col])
    if city_to_data is None:
        city_to_data = dict()
        for city, rel in city_relcounts.items():
            city_to_data[city] = [ rel ]
    else:
        to_delete = list()
        for city, l in city_to_data.items():
            if city in city_relcounts:
                l.append(city_relcounts[city])
            else:
                deleted_cities.add(city)
                to_delete.append(city)
        for city in to_delete:
            del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
    relcounts = np.array([ city_to_data[city][version] for city in cities])
    to_plot.append(relcounts)
    means.append(np.mean(relcounts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Total relation log counts by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
ax2 = ax.twinx()
ax2.plot(list(map(math.log,dbpedia_rels)), linestyle=(0,(5,10)), color='black', linewidth=3.0)
plt.savefig("ch7_cell1_boxplot.pdf", bbox_inches='tight', dpi=300)
plt

Plotting on 44,959

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

Interestingly, the average number of relations has not increased as much as the average increase of DBpedia.

Let us see the behaviour of this number as related to the last year (Cell #2).

# CELL 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

city_col = 0
rel_col = 1
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
    datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
    if version == 6:
        datafile = "ch6_cell32_dev_feat_conservative.tsv"
    city_relcounts = dict()
    with open(datafile) as f:
        first = True
        for line in f:
            fields = line.split("\t")
            if first:
                first = False
                if fields[city_col] != 'name':
                    raise Exception("Expected 'name', got " + fields[city_col])
                if fields[rel_col] != 'rel#count':
                    raise Exception("Expected 'rel#count', got " + fields[rel_col])
            else:
                city_relcounts[fields[city_col]] = float(fields[rel_col])
    if city_to_data is None:
        city_to_data = dict()
        for city, rel in city_relcounts.items():
            city_to_data[city] = [ rel ]
    else:
        to_delete = list()
        for city, l in city_to_data.items():
            if city in city_relcounts:
                l.append(city_relcounts[city])
            else:
                deleted_cities.add(city)
                to_delete.append(city)
        for city in to_delete:
            del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
    relcounts = np.array([ city_to_data[city][version] - city_to_data[city][-1] for city in cities ])
    to_plot.append(relcounts)
    means.append(np.mean(relcounts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Difference relation log counts to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell2_boxplot.pdf", bbox_inches='tight', dpi=300)
plt

Plotting on 44,959

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

Let us now take a look at the number of different features to the current year (Cell #3).

# CELL 3
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

city_col = 0
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
    datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
    if version == 6:
        datafile = "ch6_cell32_dev_feat_conservative.tsv"
    this_city_data = dict()
    with open(datafile) as f:
        next(f) # drop header
        for line in f:
            fields = line.split("\t")
            this_city_data[fields[city_col]] = list(map(float,fields[1:]))
    if city_to_data is None:
        city_to_data = dict()
        for city, data in this_city_data.items():
            city_to_data[city] = [ data ]
    else:
        to_delete = list()
        for city, l in city_to_data.items():
            if city in this_city_data:
                l.append(this_city_data[city])
            else:
                deleted_cities.add(city)
                to_delete.append(city)
        for city in to_delete:
            del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
    equal_counts = list()
    for city in cities:
        equal_count = 0
        for idx in range(len(city_to_data[city][-1])):
            if city_to_data[city][version][idx] == city_to_data[city][-1][idx]:
                equal_count += 1
        equal_counts.append(equal_count)
    equal_counts = np.array(equal_counts)
    to_plot.append(equal_counts)
    means.append(np.mean(equal_counts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Equal features to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell3_boxplot.pdf", bbox_inches='tight', dpi=300)
plt

Plotting on 44,959

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

The trend is similar to the one observed in the rel count alone. This data seems reasonable for trying lagged features but temporal smoothing using a sliding window might provide better results given the bouncy behaviour of the data.

To conclude the EDA, let us take a look at the behaviour of the difference between the target variable, population (Cell #4).

# CELL 4
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

city_col = 0
pop_col = 99
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
    datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
    if version == 6:
        datafile = "ch6_cell32_dev_feat_conservative.tsv"
    city_pop  = dict()
    with open(datafile) as f:
        first = True
        for line in f:
            fields = line.strip().split("\t")
            if first:
                first = False
                if fields[city_col] != 'name':
                    raise Exception("Expected 'name', got " + fields[city_col])
                if fields[pop_col] != 'population':
                    raise Exception("Expected 'population', got " + fields[pop_col])
            else:
                city_pop[fields[city_col]] = float(fields[pop_col])
    if city_to_data is None:
        city_to_data = dict()
        for city, rel in city_pop.items():
            city_to_data[city] = [ rel ]
    else:
        to_delete = list()
        for city, l in city_to_data.items():
            if city in city_pop and city_pop[city] > 0:
                l.append(city_pop[city])
            else:
                deleted_cities.add(city)
                to_delete.append(city)
        for city in to_delete:
            del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
    pops = np.array([ city_to_data[city][-1] - city_to_data[city][version] for city in cities])
    nonzero = np.count_nonzero(pops)
    print("Year {}, different: {:,} ({:%})".format(2010+version, nonzero, nonzero / len(pops)))
    to_plot.append(pops)
    means.append(np.mean(pops))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Difference in log population to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell4_boxplot.pdf", bbox_inches='tight', dpi=300)
plt

Plotting on 30,380
Year 2010, different: 21,054 (69.302172%)
Year 2011, different: 16,327 (53.742594%)
Year 2012, different: 12,211 (40.194207%)
Year 2013, different: 8,543 (28.120474%)
Year 2014, different: 6,207 (20.431205%)
Year 2015, different: 4,541 (14.947334%)
Year 2016, different: 0 (0.000000%)

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

Interestingly, the population numbers have changed since 2010 until 2016, even though census happen in different countries once every ten years. Also, restricting to the cities with known population for all the years does not show any bump in 2015 as in the other years. Initially, I would have expected the target population will be so accurate that giving it to the ML will constitute a target leak. While the actual number might not be the exact number, the box plot shows the number is on average the same, with some marked outliers.

Let's try to study the missing values now.

Missing values¶

There are two sources for needs of missing values:

Empty relation for that year (the whole column will be empty)
Missing city for that year (the whole row will be empty)

I will detect these two cases, making an indicator feature for each cell based on these two sources and saving them for further imputation (Cell #5).

# CELL 5
import random
import math
import numpy as np
from collections import OrderedDict

rand = random.Random(42)

# get data sizes
VERSIONS=6 # + 1 for current

with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    FEATS = len(next(f).split('\t')) - 1 # name
    CITIES = 0
    for _ in f:
        CITIES += 1

# load data
CITY_COL = 0
header_line = None
city_names = OrderedDict() # name to idx

table = np.zeros( (CITIES, FEATS * 2 * VERSIONS + FEATS) )

with open("ch6_cell32_dev_feat_conservative.tsv") as f:
    header_line = next(f)
    city_idx = 0
    for line in f:
        fields = line.strip().split("\t")
        floats = list(map(float,fields[1:]))
        for idx, val in enumerate(floats):
            if math.isnan(val):
                floats[idx] = 0.0
        table[city_idx,-FEATS:] = floats
        city_names[fields[CITY_COL]] = city_idx
        city_idx += 1
header = header_line.strip().split("\t")

year_defined = { city: set() for city in city_names }
for version in range(VERSIONS):
    with open("dbpedia1" + str(version) + "_dev_conservative.tsv") as f:
        header_line = next(f)
        for line in f:
            fields = line.strip().split("\t")
            city = fields[CITY_COL]
            city_idx = city_names.get(city, None)
            if city_idx is None:
                continue
            year_defined[city].add(version)
            floats = list(map(float,fields[1:]))
            for idx, val in enumerate(floats):
                table[city_idx,FEATS * 2 * version + 2 * idx] = 1.0 # defined
                if math.isnan(val):
                    table[city_idx,FEATS * 2 * version + 2 * idx + 1] = 0.0
                else:
                    table[city_idx,FEATS * 2 * version + 2 * idx + 1] = val

# column-based missing values: determine empty/constant columns
empty_cols_per_year = list()
for version in range(VERSIONS):
    empty_cols = 0
    for col in range(FEATS):
        value = None
        constant = True
        for city_idx in range(CITIES):
            this_val = table[city_idx, 2 * version * FEATS + 2 * col + 1]
            if value is None:
                value = this_val
            elif value != this_val:
                constant = False
                break
        if constant:
            empty_cols += 1
            table[:, 2 * version * FEATS + 2 * col] = 0.0 # missing
    empty_cols_per_year.append(empty_cols)
print("Total empty cols {:,}".format(sum(empty_cols_per_year)))
for version in range(6):
    print("Empty cols for {}: {}".format(2010+version, empty_cols_per_year[version]))
print("Working on {:,} cities".format(CITIES))

# row-based imputation: determine empty years
empty_rows_per_year = list()
for version in range(6):
    empty_rows = 0
    for city, city_idx in city_names.items():
        if version not in year_defined[city] or \
        all(map(lambda x:table[city_idx, 2*version * FEATS + 2 * x + 1] == 0.0, range(FEATS))):
            empty_rows += 1
            for col in range(FEATS):
                table[city_idx, 2 * version * FEATS + 2 * col] = 0.0
    empty_rows_per_year.append(empty_rows)
print("Total empty rows {:,}".format(sum(empty_rows_per_year)))
for version in range(6):
    print("Empty rows for {}: {}".format(2010+version, empty_rows_per_year[version]))

# write feature table
missing_cells = 0
total_cells = 0
with open("ch7_cell5_dev_missing.tsv", "w") as w:
    w.write('name')
    for version in range(VERSIONS):
        for featname in header[1:]:
            w.write("\t{}_{}?is_defined".format(2010+version, featname))
            w.write("\t{}_{}".format(2010+version, featname))
    w.write('\t' + '\t'.join(header[1:]) + "\n")
    for city, city_idx in city_names.items():
        w.write(city + "\t" + "\t".join(map(str, table[city_idx,:])) + "\n")
        total_cells += FEATS * VERSIONS
        for version in range(VERSIONS):
            for col in range(FEATS):
                if table[city_idx, 2 * version * FEATS + 2 * col] == 0.0:
                    missing_cells += 1
print("Total cells: {:,} missing: {:,} ({:%})".format(total_cells, missing_cells, missing_cells / total_cells))

Total empty cols 186
Empty cols for 2010: 37
Empty cols for 2011: 29
Empty cols for 2012: 29
Empty cols for 2013: 29
Empty cols for 2014: 62
Empty cols for 2015: 0
Working on 44,959 cities
Total empty rows 37,144
Empty rows for 2010: 9990
Empty rows for 2011: 8147
Empty rows for 2012: 6808
Empty rows for 2013: 2160
Empty rows for 2014: 1538
Empty rows for 2015: 8501
Total cells: 26,705,646 missing: 11,078,309 (41.483022%)

There is a significant amount of missing values. Let us see if we can visualize now individual entries. For visualization, I'll use a heatmap, discretizing features into up to six bins using k-means clustering. The top three clusters will be shown in green, the bottom in blue. Missing values will be shown in red.

The graph aligns features for each year to follow them, uncovering the story of the data (Cell #6).

# CELL 6
import random
import math
import numpy as np
from sklearn.cluster import KMeans

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell5_dev_missing.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        feats = list(map(float, fields[1:]))
        data.append( (feats, feats[-1], name) )

table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
    table[idx] = row[0]

base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

# compute boundaries using kMeans
boundaries = list()
for feat in range(FEATS):
    col = table[:,2*VERSIONS*FEATS + feat]
    vals = set(col)
    num_clusters = min(len(vals), 6)
    if num_clusters < 6:
        boundaries.append(sorted(vals))
    else:
        kmeans = KMeans(n_clusters=6, random_state=1).fit(col.reshape(-1,1))
        boundaries.append(sorted(set(list(kmeans.cluster_centers_.reshape(-1)))))

# pick some cities at random to graph
to_show = rand.sample(list(range(len(data))), 20)

def colorize(feat, feat_val):
    num_classes = len(boundaries[feat])
    half_classes = num_classes // 2
    if half_classes == 0:
        half_classes = 1
    dists = sorted(map(lambda x: (abs(x[1]-feat_val), x[0]), enumerate(boundaries[feat])))
    clazz = dists[0][1]
    if clazz >= half_classes:
        component = 1
        clazz -= half_classes
    else:
        component = 2
        clazz = half_classes - clazz - 1
        
    intensity = None
    if clazz == 0:
        if half_classes == 3:
            intensity = 0.3
        elif half_classes == 2:
            intensity = 0.6
        elif half_classes == 1:
            intensity = 1.0
    elif clazz == 1:
        if half_classes == 3:
            intensity = 0.6
        elif half_classes == 2:
            intensity = 1.0
        elif half_classes == 1:
            intensity = 1.0
    elif clazz == 2:
        intensity = 1.0
    box_color = [ 0.0, 0.0, 0.0 ]
    box_color[component] = intensity
    #print(dists, clazz, num_classes, half_classes, box_color)
    return box_color

def gray(box_color):
    if box_color[0] == 1.0:
        col = 0.0
    elif box_color[1] == 0.0:
        col = box_color[2] / 2
    else:
        col = 0.5 + box_color[1] / 2
    return [ col, col, col ]

HEIGHT=2
WIDTH=2

def shrink_name(feat_name):
    if '@' in feat_name:
        parts = feat_name.split('@')
        return parts[0].split("/")[-1] + "@" + parts[1].split("/")[-1][:-1]
    else:
        if '/' in feat_name:
            return feat_name.split("/")[-1]
    return feat_name

img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0

from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [20, 40]

plt.axis('on')
yticks_locs = list(map(lambda x:x*HEIGHT, range(VERSIONS+1))) 
yticks_labels = list(map(lambda x:2016-x, range(VERSIONS+1))) 
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:shrink_name(x), base_features))
plt.yticks(yticks_locs, yticks_labels)
plt.xticks(xticks_locs, xticks_labels, rotation=90)

idx = list(filter(lambda x:x[1][-1].endswith("Mérida,_Yucatán>"), enumerate(data)))[0][0]
xbox = 0
for feat in range(FEATS):
    ybox = feat * WIDTH
    img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = gray(colorize(feat, 
                                                                      table[idx, 2 * VERSIONS * FEATS + feat]))
for version in range(VERSIONS):
    xbox = (VERSIONS - version) * HEIGHT
    for feat in range(FEATS):
        ybox = feat * WIDTH
        box_color = gray(colorize(feat, table[idx, 2 * version * FEATS + 2 * feat + 1]))
        # mark imputation
        if table[idx,  2 * FEATS * version + 2 * feat] == 0.0:
            img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = [ 1.0, 1.0, 1.0 ]
            img[ xbox + 1, ybox ] = [ 0.0, 0.0, 0.0 ]
            img[ xbox, ybox + 1 ] = [ 0.0, 0.0, 0.0 ]
        else:
            img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color 
print("{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
                                               int(data[idx][1]*1000)/1000.0,
                                               int(10**data[idx][1])))
plt.imshow(img, aspect='equal')
plt.savefig("ch7_cell6_viz.pdf", bbox_inches='tight', dpi=300)


from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [20, 40]

plt.axis('on')
for img_idx, idx in enumerate(to_show):
    plt.subplot(len(to_show), 1, img_idx + 1)
    plt.title("{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
                                               int(data[idx][1]*1000)/1000.0,
                                               int(10**data[idx][1])))
    img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
    img[:,:,:] = 1.0

    xbox = 0
    for feat in range(FEATS):
        ybox = feat * WIDTH
        img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = colorize(feat, 
                                                                     table[idx, 2 * VERSIONS * FEATS + feat]) 
    for version in range(VERSIONS):
        xbox = (VERSIONS - version) * HEIGHT
        for feat in range(FEATS):
            ybox = feat * WIDTH
            box_color = colorize(feat, table[idx, 2 * version * FEATS + 2 * feat + 1])
            # mark imputation
            if table[idx,  2 * FEATS * version + 2 * feat] == 0.0:
                img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = [1.0, 0, 0 ]
            else:
                img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color 
    plt.imshow(img, aspect='equal')
    plt.yticks(yticks_locs, yticks_labels)
    plt.xticks(xticks_locs, [] if img_idx < len(to_show) - 1 else xticks_labels, rotation=90)
plt.savefig("ch7_cell6_viz_full.pdf", bbox_inches='tight', dpi=300)

Mérida,_Yucatán: 5.986 (970,376 people)

From the figure we can see that the relation counts change from version to version but the target population seem to stay within the same range. There are plenty of missing features and the colours indicate a relative growth over time.

The "seeAlso" features are missing for all years, that's a big loss.

First featurization: time lagged features¶

Given the large percentage of cities that are not present in all years, we will need a way to impute them.

Imputation¶

I will use the closer available year, given priority to earlier years (Cell #7).

# CELL 7
import random
import math
import numpy as np

rand = random.Random(42)

header = None
data = list()
# load data
with open("ch7_cell5_dev_missing.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        feats = list(map(float, fields[1:]))
        data.append( (feats, feats[-1], name) )

table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
    table[idx] = row[0]
    
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

# impute
back_imputed = 0
forward_imputed = 0
present_imputed = 0
for city_idx in range(CITIES):
    for version in range(VERSIONS):
        for feat in range(FEATS):
            if table[city_idx,  2 * FEATS * version + 2 * feat] == 1.0:
                continue
            lag = 1
            imputed = False
            while version + lag <= VERSIONS or version - lag >= 0:
                if version - lag >= 0 and table[city_idx,  2 * FEATS * (version - lag) + 2 * feat] == 1.0:
                    table[city_idx,  2 * FEATS * version + 2 * feat + 1] = \
                      table[city_idx,  2 * FEATS * (version - lag) + 2 * feat + 1]
                    back_imputed += 1
                    imputed = True
                    break
                if version + lag < VERSIONS and table[city_idx,  2 * FEATS * (version + lag) + 2 * feat] == 1.0:
                    table[city_idx,  2 * FEATS * version + 2 * feat + 1] = \
                      table[city_idx,  2 * FEATS * (version + lag) + 2 * feat + 1]
                    forward_imputed += 1
                    imputed = True
                    break
                if version + lag == VERSIONS: # present
                    table[city_idx,  2 * FEATS * version + 2 * feat + 1] = \
                      table[city_idx,  2 * FEATS * (version + lag) + feat]
                    present_imputed += 1
                    imputed = True
                    break
                lag += 1
            if not imputed:
                print(version, feat, data[city_idx][-1])

# write full feature table
imputed_cells = 0
total_cells = 0
with open("ch7_cell7_dev_imputed.tsv", "w") as w:
    w.write('name')
    w.write('\t' + '\t'.join(header) + "\n")
    for city_idx, city in enumerate(map(lambda x:x[-1], data)):
        total_cells += FEATS * VERSIONS
        w.write(city)
        w.write('\t' + '\t'.join(map(str, table[city_idx,:])))
        w.write('\n')
imputed_cells = back_imputed + forward_imputed + present_imputed
print("Total cells: {:,} imputed: {:,} ({:%})".format(total_cells, imputed_cells, imputed_cells / total_cells))
print("Backward imputed: {:,} ({:%})".format(back_imputed, back_imputed / imputed_cells))
print("Forward imputed: {:,} ({:%})".format(forward_imputed, forward_imputed / imputed_cells))
print("Present imputed: {:,} ({:%})".format(present_imputed, present_imputed / imputed_cells))

Total cells: 26,705,646 imputed: 11,078,309 (41.483022%)
Backward imputed: 1,770,176 (15.978756%)
Forward imputed: 7,154,111 (64.577644%)
Present imputed: 2,154,022 (19.443599%)

So most features are not present-imputed. That is important because otherwise there will be little value of the historical features.

Let us visualize this (Cell #8).

# CELL 8
import random
import math
import numpy as np
from sklearn.cluster import KMeans

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        feats = list(map(float, fields[1:]))
        data.append( (feats, feats[-1], name) )

table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
    table[idx] = row[0]

base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

# compute boundaries using kMeans
boundaries = list()
for feat in range(FEATS):
    col = table[:,2*VERSIONS*FEATS + feat]
    vals = set(col)
    num_clusters = min(len(vals), 6)
    if num_clusters < 6:
        boundaries.append(sorted(vals))
    else:
        kmeans = KMeans(n_clusters=6, random_state=1).fit(col.reshape(-1,1))
        boundaries.append(sorted(set(list(kmeans.cluster_centers_.reshape(-1)))))
        
# save boundaries
with open("ch7_cell8_boundaries.tsv", "w") as b:
    for idx, boundary_list in enumerate(boundaries):
        b.write("{}\t{}\n".format(idx, "\t".join(map(str,boundary_list))))

# pick some cities at random to graph
to_show = rand.sample(list(range(len(data))), 20)

def cell8_colorize(feat, feat_val, boundaries):
    num_classes = len(boundaries[feat])
    half_classes = num_classes // 2
    if half_classes == 0:
        half_classes = 1
    dists = sorted(map(lambda x: (abs(x[1]-feat_val), x[0]), enumerate(boundaries[feat])))
    clazz = dists[0][1]
    if clazz >= half_classes:
        component = 1
        clazz -= half_classes
    else:
        component = 2
        clazz = half_classes - clazz - 1
        
    intensity = None
    if clazz == 0:
        if half_classes == 3:
            intensity = 0.3
        elif half_classes == 2:
            intensity = 0.6
        elif half_classes == 1:
            intensity = 1.0
    elif clazz == 1:
        if half_classes == 3:
            intensity = 0.6
        elif half_classes == 2:
            intensity = 1.0
        elif half_classes == 1:
            intensity = 1.0
    elif clazz == 2:
        intensity = 1.0
    box_color = [ 0.0, 0.0, 0.0 ]
    box_color[component] = intensity
    #print(dists, clazz, num_classes, half_classes, box_color)
    return box_color

HEIGHT=2
WIDTH=2

def cell8_shrink_name(feat_name):
    if '@' in feat_name:
        parts = feat_name.split('@')
        return parts[0].split("/")[-1] + "@" + parts[1].split("/")[-1][:-1]
    else:
        if '/' in feat_name:
            return feat_name.split("/")[-1]
    return feat_name

img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0


yticks_locs = list(map(lambda x:x*HEIGHT, range(VERSIONS+1))) 
yticks_labels = list(map(lambda x:2016-x, range(VERSIONS+1))) 
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:cell8_shrink_name(x), base_features))

from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [20, 40]

def cell8_draw_features(plot, row, title, versions, boundaries,
                        yticks_locs, yticks_labels, xticks_locs, xticks_labels):
    if title is not None:
        plot.title(title)
    img = np.zeros( [ (versions + 1) * HEIGHT, FEATS * WIDTH, 3 ])
    img[:,:,:] = 1.0

    xbox = 0
    for feat in range(FEATS):
        ybox = feat * WIDTH
        img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = \
           cell8_colorize(feat, row[2 * versions * FEATS + feat], boundaries) 
    for version in range(versions):
        xbox = (versions - version) * HEIGHT
        for feat in range(FEATS):
            ybox = feat * WIDTH
            box_color = cell8_colorize(feat, row[2 * version * FEATS + 2 * feat + 1], boundaries)
            # mark imputation
            if row[2 * FEATS * version + 2 * feat] == 0.0:
                img[ xbox:(xbox + HEIGHT // 2), ybox: (ybox + WIDTH) ] = [1.0, 0, 0 ]
                img[ (xbox + HEIGHT // 2):(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color
            else:
                img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color 
    plot.imshow(img, aspect='equal')
    plot.yticks(yticks_locs, yticks_labels)
    plot.xticks(xticks_locs, xticks_labels, rotation=90)    

plt.axis('on')
for img_idx, idx in enumerate(to_show):
    plt.subplot(len(to_show), 1, img_idx + 1)
    cell8_draw_features(plt, table[idx,:], "{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
                                            int(table[idx,-1]*1000)/1000.0,
                                            int(10**table[idx,-1])), 
                        VERSIONS, boundaries, 
                        yticks_locs, yticks_labels, xticks_locs, 
                        [] if img_idx < len(to_show) - 1 else xticks_labels)

plt.savefig("ch7_cell8_viz_full.pdf", bbox_inches='tight', dpi=300)

The imputation looks like it completes quite well the values. Watertown seems to have some surprises in the center. We are now ready for the first featurization.

For the first featurization, I will add the historical version of each feature, for the last two versions (Cell #9). Due to the large number of features, I will switch to RFs.

# CELL 9
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
        
# assemble train/test for a given lag        
PARAM_LAG = 2

train_data = list()
test_data = list()

print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
    header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])

for full_row in data:
    feats = list()
    for lag in range(PARAM_LAG):
        feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
    feats.extend(full_row[0][2*VERSIONS*FEATS:])
    row = (feats, full_row[1], full_row[-1])
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

# write feature set
with open("ch7_cell9_dev_feat1.tsv", "w") as w:
    w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
    all_data = list(train_data)
    all_data.extend(test_data)
    for row in all_data:
        w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# train
print("Training on {:,} cities".format(len(xtrain)))

rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

print("Writing files for error analysis")
np.savez_compressed("ch7_cell9_feat1_rf.npz", 
                    xtrain=xtrain, ytrain=ytrain,
                    xtest=xtest, ytest=ytest,
                    ytest_pred=ytest_pred)
import pickle
with open("ch7_cell9_model.pk", "wb") as pkl:
    pickle.dump(rfr, pkl)
with open("ch7_cell9_test_names.tsv", "w") as names:
    for idx, name in enumerate(test_names):
        names.write("{}\t{}\n".format(idx, name))
rfr=None # free memory

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell9_rf_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Features: 490
Training on 35,971 cities
RMSE 0.3539960933029513
Writing files for error analysis

<matplotlib.legend.Legend at 0x7f81f85d4210>

So this does barely improves over the SVR baseline (0.3578). Let us see the RMSE for using other lags in Cell #10.

# CELL 10
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

for lag in range(0, VERSIONS):
    train_data = list()
    test_data = list()
    
    rand = random.Random(42)
    for full_row in data:
        feats = list()
        for l in range(lag):
            feats.extend(full_row[0][2*(VERSIONS-l-1)*FEATS: 2*(VERSIONS-l)*FEATS])
        feats.extend(full_row[0][2*VERSIONS*FEATS:])
        row = (feats, full_row[1], full_row[-1])
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
    test_data = sorted(test_data, key=lambda t:t[1])
    test_names = list(map(lambda t:t[2], test_data))

    xtrain = np.array(list(map(lambda t:t[0], train_data)))
    ytrain = np.array(list(map(lambda t:t[1], train_data)))
    xtest = np.array(list(map(lambda t:t[0], test_data)))
    ytest = np.array(list(map(lambda t:t[1], test_data)))

    # train
    rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
    rfr.fit(xtrain, ytrain)
    ytest_pred = rfr.predict(xtest)
    RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
    print("Lag: {}, Features: {:4d}, RMSE RF: {}".format(lag, FEATS * lag * 2 + FEATS, RMSE))
    rfr=None # free memory

Lag: 0, Features:   98, RMSE RF: 0.35944769738144156
Lag: 1, Features:  294, RMSE RF: 0.35512799575351456
Lag: 2, Features:  490, RMSE RF: 0.35399609330295134
Lag: 3, Features:  686, RMSE RF: 0.35392127313759164
Lag: 4, Features:  882, RMSE RF: 0.3542685277009543
Lag: 5, Features: 1078, RMSE RF: 0.35400003180280026

Before doing Error Analysis, let's look at computing differences (Cell #11).

# CELL 11
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

for lag in range(0, VERSIONS):
    train_data = list()
    test_data = list()
    
    rand = random.Random(42)
    for full_row in data:
        feats = list()
        for l in range(lag):
            for feat in range(FEATS):
                feats.append(full_row[0][2*VERSIONS*FEATS+feat] - full_row[0][2*(VERSIONS-l-1)*FEATS+2*feat+1])
        feats.extend(full_row[0][2*VERSIONS*FEATS:])
        row = (feats, full_row[1], full_row[-1])
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
    test_data = sorted(test_data, key=lambda t:t[1])
    test_names = list(map(lambda t:t[2], test_data))

    xtrain = np.array(list(map(lambda t:t[0], train_data)))
    ytrain = np.array(list(map(lambda t:t[1], train_data)))
    xtest = np.array(list(map(lambda t:t[0], test_data)))
    ytest = np.array(list(map(lambda t:t[1], test_data)))

    # train
    rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
    rfr.fit(xtrain, ytrain)
    ytest_pred = rfr.predict(xtest)
    RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
    print("Lag: {}, Features: {:4d}, RMSE RF: {}".format(lag, FEATS * lag * 2 + FEATS, RMSE))
    rfr=None # free memory

Lag: 0, Features:   98, RMSE RF: 0.3594476973814416
Lag: 1, Features:  294, RMSE RF: 0.35994266636334954
Lag: 2, Features:  490, RMSE RF: 0.35881072923353247
Lag: 3, Features:  686, RMSE RF: 0.3588757551525587
Lag: 4, Features:  882, RMSE RF: 0.35841226499154577
Lag: 5, Features: 1078, RMSE RF: 0.3583744135420653

The absolute differences might be a problem and scaling the differences will not work either as many features are zero. Let us drill down.

Error Analysis¶

I'll use the feature heatmaps to see which cities did worse with the lags than without them (Cell #12).

# CELL 12
import random
import math
import pickle
import numpy as np
from sklearn.ensemble import RandomForestRegressor

# load
arrays = np.load("ch7_cell9_feat1_rf.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain']
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_pred = arrays['ytest_pred']

with open("ch7_cell9_dev_feat1.tsv") as f:
    header = next(f).split('\t')
    del header[0] # name
    header.pop() # pop

base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
LAGS = 2

test_names = list()
with open("ch7_cell9_test_names.tsv") as names:
    for line in names:
        test_names.append(line.split("\t")[-1])

# train on base features only
xtrain_base = xtrain[:,2*LAGS*FEATS:]
xtest_base = xtest[:,2*LAGS*FEATS:]
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain_base, ytrain)
ytest_pred_base = rfr.predict(xtest_base)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
RMSE_base = math.sqrt(sum((ytest - ytest_pred_base)**2) / len(ytest))
print("RMSE lag 2: {}, RMSE base: {}".format(RMSE, RMSE_base))
rfr=None # free memory

# find 10 cities where it helped them most and 10 cities where it hurt the most
sqe_lag2 = (ytest - ytest_pred) ** 2
sqe_base = (ytest - ytest_pred_base) ** 2

diff = sqe_base - sqe_lag2
idx_diff = sorted(list(zip(diff, range(len(diff)))))
    
# visualize
boundaries = list()
with open("ch7_cell8_boundaries.tsv") as b:
    for line in b:
        boundaries.append(list(map(float, line.split("\t")[1:])))

yticks_locs = list(map(lambda x:x*HEIGHT, range(LAGS+1))) 
yticks_labels = list(map(lambda x:2016-x, range(LAGS+1))) 
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:cell8_shrink_name(x), base_features))

from matplotlib import pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = [20, 40]
plt.axis('on')
for idx in range(30):
    city_idx = idx_diff[idx][1]
    plt.subplot(40, 1, idx + 1)
    row = list(xtest[city_idx,:])
    row.append(ytest[city_idx])
    cell8_draw_features(plt, row, 
                        "WORSE> {}. Pop: {:,}, lag-2: {:,} base: {:,}"
                           .format(test_names[city_idx].strip().split("/")[-1][:-1],
                                   int(10**row[-1]), int(10**ytest_pred[city_idx]), 
                                   int(10**ytest_pred_base[city_idx])),
                        LAGS, boundaries, yticks_locs, yticks_labels, xticks_locs, [])
for idx in range(10):
    city_idx = idx_diff[len(idx_diff) - idx - 1][1]
    plt.subplot(40, 1, 30 + idx + 1)
    row = list(xtest[city_idx,:])
    row.append(ytest[city_idx])
    cell8_draw_features(plt, row, 
                        "BEST> {}. Pop: {:,}, lag-2: {:,} base: {:,}"
                           .format(test_names[city_idx].strip().split("/")[-1][:-1],
                                   int(10**row[-1]), int(10**ytest_pred[city_idx]), 
                                   int(10**ytest_pred_base[city_idx])),
                        LAGS, boundaries, yticks_locs, yticks_labels, xticks_locs, 
                        [] if idx < 9 else xticks_labels)

plt.savefig("ch7_cell12_viz_full.pdf", bbox_inches='tight', dpi=300)

RMSE lag 2: 0.3539960933029513, RMSE base: 0.35944769738144156

From the error analysis, it seems regressions on the first column might be problem. Smoothing that behaviour could work.

Second Featurization¶

Let's start with a simple moving average (Cell #13).

# CELL 13
import random
import math
import numpy as np
from sklearn.ensemble import RandomForestRegressor

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
    table[idx] = row[0]

# SMA
smoothing_square = 0
smoothed_cells = 0
for city_idx in range(CITIES):
    for feat in range(FEATS):
        sma = 0
        for version in range(3):
            sma += table[city_idx,  2 * FEATS * version + 2 * feat + 1] / 3.0
        for version in range(3, VERSIONS + 1):
            diff = table[city_idx,  2 * FEATS * (version - 1) + 2 * feat + 1] - sma
            smoothed_cells += 1
            smoothing_square += diff * diff
            table[city_idx,  2 * FEATS * (version - 1) + 2 * feat + 1] = sma
            if version < VERSIONS:
                sma -= table[city_idx,  2 * FEATS * (version - 3) + 2 * feat + 1] / 3.0
                sma += table[city_idx,  2 * FEATS * version + 2 * feat + 1] / 3.0

# write full feature table
with open("ch7_cell13_dev_sma.tsv", "w") as w:
    w.write('name\t' + '\t'.join(header[2*FEATS*2:]) + "\n")
    for city_idx, city in enumerate(map(lambda x:x[-1], data)):
        w.write(city + '\t' + '\t'.join(map(str, table[city_idx,2*FEATS*2:])) + '\n')
        
print("Smoothed {:,} cells, root mean squared smoothing: {:,}".format(
   smoothed_cells, math.sqrt(smoothing_square/smoothed_cells)))

Smoothed 17,803,764 cells, root mean squared smoothing: 871,544,520.7773069

That's... a lot of smoothing. Now let's try again with lag-2 over the SMA for the second featurization (Cell #14).

# CELL 14
import random
import math
import numpy as np

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell13_dev_sma.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 4
CITIES = len(data)
        
# assemble train/test for a given lag        
PARAM_LAG = 2

train_data = list()
test_data = list()

print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
    header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])

for full_row in data:
    feats = list()
    for lag in range(PARAM_LAG):
        feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
    feats.extend(full_row[0][2*VERSIONS*FEATS:])
    row = (feats, full_row[1], full_row[-1])
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

# write feature set
with open("ch7_cell14_dev_feat2.tsv", "w") as w:
    w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
    all_data = list(train_data)
    all_data.extend(test_data)
    for row in all_data:
        w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# train
print("Training on {:,} cities".format(len(xtrain)))

rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr=None # free memory

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell14_rf_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Features: 490
Training on 35,971 cities
RMSE 0.35805841568706775

<matplotlib.legend.Legend at 0x7f8152afe2d0>

It got worse, which is not surprising given the level of smoothing. Let's try a different smoothing.

Third Featurization¶

I will now use an exponential moving average from the present into the past, so I will combine smoothing with imputation (Cell #15).

# CELL 15
import random
import math
import numpy as np

header = None
data = list()
# load data
with open("ch7_cell5_dev_missing.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        feats = list(map(float, fields[1:]))
        data.append( (feats, feats[-1], name) )

table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
    table[idx] = row[0]
    
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)

# impute and EMA
PARAM_EMA_WEIGHT = 0.2 # history is worth 80%, strong smoothing

smoothing_square = 0
smoothed_cells = 0
missed_imputed = 0
for city_idx in range(CITIES):
    for feat in range(FEATS):
        ema = table[city_idx,  2 * FEATS * VERSIONS + feat] 
        for upversion in range(VERSIONS):
            version = VERSIONS - upversion - 1
            smoothed = ema
            if table[city_idx,  2 * FEATS * version + 2 * feat] == 1.0:
                value = table[city_idx,  2 * FEATS * version + 2 * feat + 1]
                smoothed = PARAM_EMA_WEIGHT * value + (1.0 - PARAM_EMA_WEIGHT) * ema 
                smoothed_cells += 1
                diff = value - smoothed
                smoothing_square += diff * diff
            else:
                missed_imputed += 1
            table[city_idx,  2 * FEATS * version + 2 * feat + 1] = smoothed
            ema = smoothed

# write full feature table
imputed_cells = 0
total_cells = 0
with open("ch7_cell15_dev_ema_imputed.tsv", "w") as w:
    w.write('name')
    w.write('\t' + '\t'.join(header) + "\n")
    for city_idx, city in enumerate(map(lambda x:x[-1], data)):
        w.write(city)
        w.write('\t' + '\t'.join(map(str, table[city_idx,:])))
        w.write('\n')
print("Total cells: {:,} missed imputed: {:,}".format(table.shape[0]*table.shape[1], missed_imputed))
print("Smoothed {:,} cells, root mean squared smoothing: {:,}".format(
   smoothed_cells, math.sqrt(smoothing_square/smoothed_cells)))

Total cells: 57,862,233 missed imputed: 11,078,309
Smoothed 15,627,337 cells, root mean squared smoothing: 4,243,407,341.0603514

And now with a lag-2 RF (Cell #16).

# CELL 16
import random
import math
import numpy as np

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell15_dev_ema_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS    = len(base_features)
VERSIONS = 6
CITIES   = len(data)

# assemble train/test for a given lag        
PARAM_LAG = 2

train_data = list()
test_data  = list()

print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
    header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])

for full_row in data:
    feats = list()
    for lag in range(PARAM_LAG):
        feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
    feats.extend(full_row[0][2*VERSIONS*FEATS:])
    row = (feats, full_row[1], full_row[-1])
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

# write feature set
with open("ch7_cell16_dev_feat3.tsv", "w") as w:
    w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
    all_data = list(train_data)
    all_data.extend(test_data)
    for row in all_data:
        w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# train
print("Training on {:,} cities".format(len(xtrain)))

rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr = None # free memory

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell16_rf_feat3.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Features: 490
Training on 35,971 cities
RMSE 0.3550061299937911

<matplotlib.legend.Legend at 0x7f81c02473d0>

This improve a tiny bit, but it seems there is no value in the time dimension for this data set.

Fourth Featurization¶

Finally, let's try using the extra rows as more training data (Cell #17).

# CELL 17
import random
import math
import numpy as np

from sklearn.ensemble import RandomForestRegressor

rand = random.Random(42)

header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
    header = next(f).strip().split('\t')
    header.pop(0) # name
    pop_locs = list(reversed(list(map(lambda x:x[0],
                                      filter(lambda x:
                                             x[1].endswith("population") or
                                             x[1].endswith("population?is_defined"), enumerate(header))))))
    for del_loc in pop_locs:
        del header[del_loc]
    for line in f:
        fields = line.strip().split('\t')
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:]))
        for del_loc in pop_locs:
            del feats[del_loc]
        data.append( (feats, pop, name) )
        
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS    = len(base_features)
VERSIONS = 6
CITIES   = len(data)

full_train_data = list()
full_test_data  = list()

print("Features: {:,}".format(FEATS))

# stratify
for row in data:
    if rand.random() < 0.2:
        full_test_data.append(row) 
    else:
        full_train_data.append(row)
        
train_data = list()
test_data = list()
for row in full_train_data:
    feats = row[0]
    # current
    train_data.append( (feats[2*FEATS*VERSIONS:], row[1], row[2]) )
    for version in range(VERSIONS):
        historical_feats = list()
        for feat in range(FEATS):
            if feats[2 * FEATS * version + 2 * feat] == 1.0:
                historical_feats.append(feats[2 * FEATS * version + 2 * feat + 1])
            else: # use current instead of imputed
                historical_feats.append(feats[2 * FEATS * VERSIONS + feat])
        # historical
        train_data.append( (historical_feats, row[1], row[2]) )

for row in full_test_data:
    # test only on current
    test_data.append( (row[0][2*FEATS*VERSIONS:], row[1], row[2]) )
    
# write feature set
with open("ch7_cell17_dev_feat4.tsv", "w") as w:
    w.write('name\t' + '\t'.join(header[-98:]) + '\tpopulation\n')
    all_data = list(train_data)
    all_data.extend(test_data)
    for row in all_data:
        w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# train
print("Training on {:,} cities".format(len(xtrain)))

rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr = None # free memory

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell17_rf_feat4.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Features: 98
Training on 251,797 cities
RMSE 0.372269900642798

<matplotlib.legend.Legend at 0x7f81f8745390>

That increased the available training material five-fold (from 35,000 to 25,000). However, things got much worse.

At this stage, I'll conclude the exercise, things left to try by the reader:

Use the target population of the historical version instead of the current population
Use an imputed version of the feature (e.g., the EMA version) instead of the current version
Drop rows with too many imputed values
Drop columns with too many imputed values

With the current experimental results I wouldn't advise to use the historical data for this task. If I am forced to pick a high performance feature set, it will be the imputed version using lag-5. A conservative feature set would be just adding the EMA-smoothing for historical relation counts (and nothing else).

WikiCountries¶

Let us start by identifying all entities of type http://dbpedia.org/ontology/Country (Cell #18).

# CELL 18
import re
import bz2

cell18_triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")

def cell18_compressed_triples_gen(filename):
    with bz2.BZ2File(filename,"r") as compressed:
        for byteline in compressed:
            if byteline[0] == ord('#'):
                continue
            line = byteline.decode("utf-8")
            s,v,o = cell18_triple_re.match(line).groups()
            yield line,s,v,o

total_countries = 0
with open("ch7_cell18_countries.ttl", "w") as countries:
    for line, subj, verb, obj in cell18_compressed_triples_gen("instance_types_en.ttl.bz2"):
        if obj == '<http://dbpedia.org/ontology/Country>':
            countries.write(line)
            total_countries += 1
print("Found {:,} entities of type Country".format(total_countries))

Found 3,424 entities of type Country

As there are an order of magnitude less countries than 3,424 that clearly count many, many historical and fictional countries.

Now let's transform the source data from the World Bank into a TSV (Cell #19).

# CELL 19
import csv
rows = list()
with open("API_SP.POP.TOTL_DS2_en_csv_v2_10224786.csv") as wb:
    dialect = csv.Sniffer().sniff(wb.read(1024))
    wb.seek(0)
    reader = csv.reader(wb, dialect) #delimiter='"', quotechar='\\')
    for row in reader:
        rows.append(row)
header=rows[4]
rows=rows[5:]
total_countries = 0
with open("ch7_cell19_wb_pop.tsv", "w") as wb:
    del header[1:4]
    header[0]='name'
    wb.write("\t".join(header) + "\n")
    for row in rows:
        del row[1:4]
        wb.write("\t".join(row) + "\n")
        total_countries += 1
print("Written population for {:,} countries for {:,} years".format(total_countries, len(header)-1))

Written population for 264 countries for 59 years

Let us see the intersection and missing countries (Cell #20)

# CELL 20
from collections import OrderedDict

dbpedia_countries = set()
with open("ch7_cell18_countries.ttl") as c:
    for line in c:
        s,v,o = cell18_triple_re.match(line).groups()
        dbpedia_countries.add(s)
print("DBpedia countries: {:,}".format(len(dbpedia_countries)))

wb_countries = set()
with open("ch7_cell19_wb_pop.tsv") as c:
    next(c) # header
    for line in c:
        wb_countries.add(line.split("\t")[0])
print("World Bank countries: {:,}".format(len(wb_countries)))

country_mapping = OrderedDict()
missing_countries = list()
for country in wb_countries:
    uri_name = country.replace(" ","_")
    uri = "<http://dbpedia.org/resource/{}>".format(uri_name)
    if uri in dbpedia_countries:
        country_mapping[country] = uri
    else:
        missing_countries.append(country)
        
print("Mapped {:,} countries".format(len(country_mapping)))
print("Missing {:,}:\n".format(len(missing_countries)) + "\n".join(missing_countries))

DBpedia countries: 3,424
World Bank countries: 264
Mapped 180 countries
Missing 84:
Pacific island small states
Virgin Islands (U.S.)
Low income
IDA blend
Central Europe and the Baltics
Latin America & Caribbean
Eswatini
IDA & IBRD total
Venezuela, RB
St. Vincent and the Grenadines
Congo, Dem. Rep.
Middle East & North Africa
Latin America & the Caribbean (IDA & IBRD countries)
Euro area
East Asia & Pacific
Lao PDR
Korea, Rep.
Europe & Central Asia (excluding high income)
Kyrgyz Republic
West Bank and Gaza
North America
IDA total
Hong Kong SAR, China
Uganda
Yemen, Rep.
Lower middle income
Upper middle income
Other small states
Iran, Islamic Rep.
Korea, Dem. People’s Rep.
Micronesia, Fed. Sts.
Late-demographic dividend
South Asia
High income
Small states
Least developed countries: UN classification
Post-demographic dividend
Gambia, The
Caribbean small states
Latin America & Caribbean (excluding high income)
Channel Islands
Heavily indebted poor countries (HIPC)
Sao Tome and Principe
Europe & Central Asia
Bahamas, The
IBRD only
World
Fragile and conflict affected situations
St. Lucia
East Asia & Pacific (IDA & IBRD countries)
Timor-Leste
Macao SAR, China
Brunei Darussalam
Middle East & North Africa (IDA & IBRD countries)
Russian Federation
St. Kitts and Nevis
Low & middle income
Europe & Central Asia (IDA & IBRD countries)
Ireland
Pre-demographic dividend
Early-demographic dividend
East Asia & Pacific (excluding high income)
Botswana
Syrian Arab Republic
Georgia
Sub-Saharan Africa (excluding high income)
St. Martin (French part)
Sint Maarten (Dutch part)
Slovak Republic
South Asia (IDA & IBRD)
Egypt, Arab Rep.
Macedonia, FYR
IDA only
Arab World
Curacao
Middle income
Middle East & North Africa (excluding high income)
Congo, Rep.
Not classified
Cabo Verde
Sub-Saharan Africa
Cote d'Ivoire
OECD members
Sub-Saharan Africa (IDA & IBRD countries)

Many of these are regions (like North America). Others include republic designators that seems to be handled differently in Wikipedia. I do the final mapping by hand in Cell #21.

# CELL 21

dbpedia_countries = set()
with open("ch7_cell18_countries.ttl") as c:
    for line in c:
        s,v,o = cell18_triple_re.match(line).groups()
        dbpedia_countries.add(s)
print("DBpedia countries: {:,}".format(len(dbpedia_countries)))

wb_countries = set()
with open("ch7_cell19_wb_pop.tsv") as c:
    next(c) # header
    for line in c:
        wb_countries.add(line.split("\t")[0])
print("World Bank countries: {:,}".format(len(wb_countries)))

regions = set(["Sub-Saharan Africa (IDA & IBRD countries)", "Not classified", 
               "East Asia & Pacific (excluding high income)", "Pre-demographic dividend", "Upper middle income",
               "Small states", "Latin America & Caribbean", "High income", "IBRD only", 
               "Central Europe and the Baltics", "Post-demographic dividend", 
               "Europe & Central Asia (IDA & IBRD countries)",
               "Latin America & the Caribbean (IDA & IBRD countries)", "IDA total", "Low & middle income",
               "Caribbean small states", "Channel Islands", "Fragile and conflict affected situations",
               "Europe & Central Asia (excluding high income)", "IDA blend", "Other small states",
               "Sub-Saharan Africa", "Early-demographic dividend", "IDA only", "OECD members", 
               "Middle East & North Africa", "IDA & IBRD total", "South Asia (IDA & IBRD)", "Lower middle income",
               "Late-demographic dividend", "Europe & Central Asia", 
               "Middle East & North Africa (IDA & IBRD countries)", "East Asia & Pacific",
               "Pacific island small states", "Middle East & North Africa (excluding high income)",
               "Euro area", "South Asia", "East Asia & Pacific (IDA & IBRD countries)",
               "Heavily indebted poor countries (HIPC)", "Arab World",
               "Latin America & Caribbean (excluding high income)", "World", "Middle income", "Low income",
               "North America", "Least developed countries: UN classification",
               "Sub-Saharan Africa (excluding high income)"])
extra_mappings = {
    "Georgia": "<http://dbpedia.org/resource/Georgia_(country)>",
    "Ireland": "<http://dbpedia.org/resource/Republic_of_Ireland>",
    "Korea, Rep.": "<http://dbpedia.org/resource/South_Korea>",
    "Macao SAR, China": "<http://dbpedia.org/resource/Macau>",
    "Congo, Dem. Rep.": "<http://dbpedia.org/resource/Democratic_Republic_of_the_Congo>",
    "Venezuela, RB": "<http://dbpedia.org/resource/Venezuela>",
    "Iran, Islamic Rep.": "<http://dbpedia.org/resource/Iran>",
    "Yemen, Rep.": "<http://dbpedia.org/resource/Yemen>",
    "Egypt, Arab Rep.": "<http://dbpedia.org/resource/Egypt>",
    "Cote d'Ivoire": "<http://dbpedia.org/resource/Ivory_Coast>",
    "St. Vincent and the Grenadines": "<http://dbpedia.org/resource/Saint_Vincent_and_the_Grenadines> ",
    "Eswatini": "<http://dbpedia.org/resource/Swaziland>",
    "Congo, Rep.": "<http://dbpedia.org/resource/Republic_of_the_Congo>",
    "Micronesia, Fed. Sts.": "<http://dbpedia.org/resource/Federated_States_of_Micronesia>",
    "Sao Tome and Principe": "<http://dbpedia.org/resource/São_Tomé_and_Príncipe>",
    "Curacao": "<http://dbpedia.org/resource/Curaçao>",
    "Korea, Dem. People’s Rep.": "<http://dbpedia.org/resource/North_Korea>",
    "Gambia, The": "<http://dbpedia.org/resource/The_Gambia>",
    "Hong Kong SAR, China": "<http://dbpedia.org/resource/Hong_Kong>",
    "St. Kitts and Nevis": "<http://dbpedia.org/resource/Saint_Kitts_and_Nevis>",
    "Virgin Islands (U.S.)": "<http://dbpedia.org/resource/United_States_Virgin_Islands>",
    "Brunei Darussalam": "<http://dbpedia.org/resource/Brunei>",
    "Cabo Verde": "<http://dbpedia.org/resource/Cape_Verde>",
    "St. Lucia": "<http://dbpedia.org/resource/Saint_Lucia>",
    "St. Martin (French part)": "<http://dbpedia.org/resource/Collectivity_of_Saint_Martin>",
    "Sint Maarten (Dutch part)": "<http://dbpedia.org/resource/Sint_Maarten>",
    "Slovak Republic": "<http://dbpedia.org/resource/Slovakia>",
    "Macedonia, FYR": "<http://dbpedia.org/resource/Republic_of_Macedonia>",
    "Timor-Leste": "<http://dbpedia.org/resource/East_Timor>",
    "Russian Federation": "<http://dbpedia.org/resource/Russia>",
    "Botswana": "<http://dbpedia.org/resource/Botswana>",
    "Kyrgyz Republic": "<http://dbpedia.org/resource/Kyrgyzstan>",
    "West Bank and Gaza": "<http://dbpedia.org/resource/Palestinian_territories>",
    "Lao PDR": "<http://dbpedia.org/resource/Laos>",
    "Bahamas, The": "<http://dbpedia.org/resource/The_Bahamas>",
    "Syrian Arab Republic": "<http://dbpedia.org/resource/Syria>",
    "Uganda": "<http://dbpedia.org/resource/Uganda>"
}

country_mapping = dict()
missing_countries = list()
for country in wb_countries:
    uri_name = country.replace(" ","_")
    uri = "<http://dbpedia.org/resource/{}>".format(uri_name)
    if uri in dbpedia_countries:
        country_mapping[country] = uri
    elif country in regions:
        pass
    elif country in extra_mappings:
        if extra_mappings[country] is not None:
            country_mapping[country] = extra_mappings[country]
    else:
        missing_countries.append(country)
        
print("Mapped {:,} countries".format(len(country_mapping)))
print("Missing {:,}:\n".format(len(missing_countries)) + "\n".join(missing_countries))   

with open("ch7_cell21_wb_uri_mapping.tsv", "w") as mapping:
    for wb in sorted(country_mapping.keys()):
        db = country_mapping[wb]
        mapping.write("{}\t{}\n".format(wb, db))

DBpedia countries: 3,424
World Bank countries: 264
Mapped 217 countries
Missing 0:

Finding those extra mappings took a while, for example, Georgia appears as "Georgia (Country)" and as "Democratic Republic of Georgia" (which ceased to exist in 1921). Also, DBpedia has errors sometimes, "Botswana" and "Uganda" are not listed as having the type "Country".

Now for the found URIs, let's compute the in and out relations (Cell #22).

# CELL 22
country_uris = set(map(lambda x:x.split("\t")[-1].strip(),open("ch7_cell21_wb_uri_mapping.tsv").readlines()))

rels_in  = { uri: 0 for uri in country_uris }
rels_out = { uri: 0 for uri in country_uris }

for line, subj, verb, obj in cell18_compressed_triples_gen("mappingbased_literals_en.ttl.bz2"):
    if subj in country_uris:
        rels_out[subj] += 1

for line, subj, verb, obj in cell18_compressed_triples_gen("mappingbased_objects_en.ttl.bz2"):
    if subj in country_uris:
        rels_out[subj] += 1
    if obj  in country_uris:
        rels_in[obj]   += 1

total_rels_in = 0
total_rels_out = 0
with open("ch7_cell22_rels_in_out.tsv", "w") as tsv:
    tsv.write("name\trels\trels?inv\n")
    for country in sorted(country_uris):
        tsv.write("{}\t{}\t{}\n".format(country, rels_out[country], rels_in[country]))
        total_rels_in  += rels_in[country]
        total_rels_out += rels_out[country]

print("Written {:,} rels in, {:,} rels out".format(total_rels_in, total_rels_out))

Written 1,314,998 rels in, 7,346 rels out

Countries have lots of relations coming into them, as expected. We can now merge the time series data with the relations data and split into training data and final test data (Cell #23).

# CELL 23
import random

rels = dict()
with open("ch7_cell22_rels_in_out.tsv") as tsv:
    next(tsv) # header
    for line in tsv:
        country, rels_out, rels_in = line.strip().split("\t")
        rels[country] =  (int(rels_out), int(rels_in))

mapping = dict()
with open("ch7_cell21_wb_uri_mapping.tsv") as m:
    for line in m:
        wb, db = line.strip().split("\t")
        mapping[wb] = db

data = list()
with open("ch7_cell19_wb_pop.tsv") as wb:
    line   = next(wb)
    header = line.strip().split("\t")
    header.insert(1, "rels_out")
    header.insert(2, "rels_in")
    for line in wb:
        fields = line.strip().split("\t")
        wb_country = fields.pop(0)
        if wb_country not in mapping:
            continue
        db_country = mapping[wb_country]
        fields = list(map(lambda x: 0 if x == '' else int(x),fields))
        fields.insert(0, rels[db_country][1])
        fields.insert(0, rels[db_country][0])
        data.append( (wb_country, fields) )
print("Data for {:,} countries".format(len(data)))
        
# save all data
with open("ch7_cell23_countries_data.tsv", "w") as tsv:
    tsv.write("\t".join(header) + "\n")
    for name, fields in data:
        tsv.write(name + "\t" + "\t".join(map(str,fields)) + "\n")

# split train and final test
rand = random.Random(42)
rand.shuffle(data)
pivot   = int(len(data) * 0.8)
devset  = data[:pivot]
heldout = data[pivot:]
with open("ch7_cell23_countries_dev.tsv", "w") as dev:
    dev.write("\t".join(header) + "\n")
    for name, fields in devset:
        dev.write(name + "\t" + "\t".join(map(str,fields)) + "\n")
with open("ch7_cell23_countries_held.tsv", "w") as held:
    held.write("\t".join(header) + "\n")
    for name, fields in heldout:
        held.write(name + "\t" + "\t".join(map(str,fields)) + "\n")
print("Devset size:  {:,}".format(len(devset)))
print("Heldout size: {:,}".format(len(heldout)))

Data for 217 countries
Devset size:  173
Heldout size: 44

With the data split properly, we can start some EDA on it.

EDA¶

Let's plot the relations and current population to see if there is a correlation (Cell #24).

# CELL 24
import math

pops = list()
num_in_rels = list()
num_out_rels = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    next(feats) # skip header
    for line in feats:
        fields = line.strip().split("\t")
        pop = float(fields[-1])
        out_rels = float(fields[1])
        in_rels = float(fields[2])
        if out_rels == 0:
            continue
        pops.append(math.log(pop, 10))
        num_out_rels.append(math.log(out_rels, 10))
        num_in_rels.append(math.log(in_rels, 10))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(pops, num_in_rels, '.', color='gray', label='in')
plt.plot(pops, num_out_rels, '.', color='black', label='out')
plt.xlabel('scaled log population')
plt.ylabel('scaled log number of relations')
plt.savefig("ch7_cell24_logpop_vs_log_items.pdf", bbox_inches='tight', dpi=300)
plt.legend()
plt

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

From the figure we can see that the number of in relations is informative, but the number of out relations is not, most of the countries are involved in the same number of standard relations.

Let us now take 10 random countries and look at their time series data (Cell #25).

# CELL 25
import math
import random

years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(int,fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 20]

for idx, row in enumerate(to_show):
    plt.subplot(len(to_show) // 4, 4, idx + 1)
    
    plt.plot(years, row[1], '.', color='gray')
    plt.xlabel('years')
    plt.ylabel('population')
    plt.title(row[0])
    
plt.savefig("ch7_cell25_sample_country_pop.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

In the figure we can see trend reversal (Ireland in the 1980-1990, Slovak Republic in 2000-2010), missing data (Sint Maarten) and a variety of curves. Let us take a look at the ACF and PACF (Cell #26).

# CELL 26
import statsmodels.tsa.api as tsa
import math
import random

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(int,fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]

lags=list(range(30))
layout = ( 2 * len(to_show) // 3, 3 * 2 )
for idx, row in enumerate(to_show):
    
    ts_plt   = plt.subplot2grid(layout, (2 * (idx // 3),     2 * (idx % 3)), colspan=2)
    acf_plt  = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
    pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )

    ts_plt.set_title(row[0])
    ts_plt.plot(years, row[1], '.', color='gray')
    tsa.graphics.plot_acf(row[1],  lags=lags, ax=acf_plt,  alpha=0.5)
    tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
    
plt.savefig("ch7_cell26_sample_country_pop_acf.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

From all these graphs we see no impact on the PACF, so MA won't be needed. But we can see also a strong trend effect.

Let's try detrending with log (Cell #27).

# CELL 27
import statsmodels.tsa.api as tsa
import math
import random

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(lambda x:0 if x=="0" else math.log(int(x)), fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]

lags=list(range(30))
layout = ( 2 * len(to_show) // 3, 3 * 2 )
for idx, row in enumerate(to_show):
    
    ts_plt   = plt.subplot2grid(layout, (2 * (idx // 3),     2 * (idx % 3)), colspan=2)
    acf_plt  = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
    pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )

    ts_plt.set_title(row[0])
    ts_plt.plot(years, row[1], '.', color='gray')
    tsa.graphics.plot_acf(row[1],  lags=lags, ax=acf_plt,  alpha=0.5)
    tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
    
plt.savefig("ch7_cell27_sample_country_logpop_acf.pdf", bbox_inches='tight', dpi=300)
plt

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

That did not work, let's try differencing then (Cell #28).

# CELL 28
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(int,fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))
table = np.diff(table, n=1, axis=1) # the defaults, just for clarity
years.pop(0) # differencing losses one year

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]

lags   = list(range(30))
layout = ( 2 * 12 // 3, 3 * 2 )
for idx in range(12):
    row = ( data[idx][0], table[idx,:] )
    
    ts_plt   = plt.subplot2grid(layout, (2 * (idx // 3),     2 * (idx % 3)), colspan=2)
    acf_plt  = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
    pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )

    ts_plt.set_title(row[0])
    ts_plt.plot(years, row[1], '.', color='gray')
    tsa.graphics.plot_acf(row[1],  lags=lags, ax=acf_plt,  alpha=0.5)
    tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
    
plt.savefig("ch7_cell28_sample_country_pop_acf_diff1.pdf", bbox_inches='tight', dpi=300)
plt

feateng/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1358: RuntimeWarning: invalid value encountered in sqrt
  return rho, np.sqrt(sigmasq)

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

That seems more informative, let us look at second order differencing (Cell #29).

# CELL 29
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(int,fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))
table = np.diff(table, n=2, axis=1) # the defaults, just for clarity
years = years[2:] # differencing losses two years

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]

lags   = list(range(30))
layout = ( 2 * 12 // 3, 3 * 2 )
for idx in range(12):
    row = ( data[idx][0], table[idx,:] )
    
    ts_plt   = plt.subplot2grid(layout, (2 * (idx // 3),     2 * (idx % 3)), colspan=2)
    acf_plt  = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
    pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )

    ts_plt.set_title(row[0])
    ts_plt.plot(years, row[1], '.', color='gray')
    tsa.graphics.plot_acf(row[1],  lags=lags, ax=acf_plt,  alpha=0.5)
    tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
    
plt.savefig("ch7_cell29_sample_country_pop_acf_diff2.pdf", bbox_inches='tight', dpi=300)
plt

feateng/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1358: RuntimeWarning: invalid value encountered in sqrt
  return rho, np.sqrt(sigmasq)

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

Hmm? Let's move to test with ADF (Cell #30).

# CELL 30
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
from statsmodels.tsa.stattools import adfuller

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(lambda x:1 if x=="0" else int(x), fields[3:]))) )
        
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))

table_log   = np.log(table) # the defaults, just for clarity
table_diff1 = np.diff(table, n=1, axis=1) # the defaults, just for clarity

lags=list(range(12))
for idx in range(12):
    x = table[idx]
    print("\n" + data[idx][0] + " " + str(idx))
    for regression in [ 'nc', 'c', 'ct', 'ctt']:
        adf, pvalue, crits, results = adfuller(x, regression=regression,autolag='AIC',store=True,regresults=True)
        if pvalue < 0.1:
            print("\n\t(reg={}) pvalue: {:1.5}".format(regression, pvalue))
            print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))
        adf, pvalue, crits, results = adfuller(table_diff1[idx], 
                                               regression=regression,autolag='AIC',store=True,regresults=True)
        if pvalue < 0.1:
            print("\n\t(reg={}) pvalue (diff 1): {:1.5}".format(regression, pvalue))
            print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))

        adf, pvalue, crits, results = adfuller(table_log[idx], 
                                               regression=regression,autolag='AIC',store=True,regresults=True)
        if pvalue < 0.1:
            print("\n\t(reg={}) pvalue (log): {:1.5}".format(regression, pvalue))
            print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))

Turkmenistan 0

	(reg=c) pvalue (log): 0.033629
	(reg=c) Lags used: 11

Korea, Dem. People’s Rep. 1

	(reg=nc) pvalue (diff 1): 0.03045
	(reg=nc) Lags used: 8

	(reg=ctt) pvalue (diff 1): 0.00010948
	(reg=ctt) Lags used: 10

	(reg=ctt) pvalue (log): 0.00011662
	(reg=ctt) Lags used: 8

Bangladesh 2

	(reg=nc) pvalue (log): 0.071694
	(reg=nc) Lags used: 8

	(reg=c) pvalue (log): 2.4863e-07
	(reg=c) Lags used: 11

	(reg=ct) pvalue: 0.00052031
	(reg=ct) Lags used: 8

Trinidad and Tobago 3

	(reg=nc) pvalue (diff 1): 0.0033378
	(reg=nc) Lags used: 11

	(reg=c) pvalue: 2.2847e-07
	(reg=c) Lags used: 11

	(reg=c) pvalue (log): 1.7288e-12
	(reg=c) Lags used: 11

	(reg=ct) pvalue (diff 1): 2.1334e-07
	(reg=ct) Lags used: 10

	(reg=ct) pvalue (log): 0.00036514
	(reg=ct) Lags used: 11

	(reg=ctt) pvalue: 1.5116e-06
	(reg=ctt) Lags used: 6

	(reg=ctt) pvalue (diff 1): 1.1549e-07
	(reg=ctt) Lags used: 10

Seychelles 4

	(reg=c) pvalue (diff 1): 7.8076e-14
	(reg=c) Lags used: 0

	(reg=c) pvalue (log): 0.0067436
	(reg=c) Lags used: 0

	(reg=ct) pvalue: 0.0066665
	(reg=ct) Lags used: 8

	(reg=ct) pvalue (diff 1): 4.5205e-12
	(reg=ct) Lags used: 0

	(reg=ctt) pvalue: 0.041611
	(reg=ctt) Lags used: 8

	(reg=ctt) pvalue (diff 1): 5.2107e-12
	(reg=ctt) Lags used: 0

Sint Maarten (Dutch part) 5

	(reg=nc) pvalue (diff 1): 8.6038e-12
	(reg=nc) Lags used: 0

	(reg=c) pvalue (diff 1): 3.3727e-11
	(reg=c) Lags used: 0

	(reg=ct) pvalue (diff 1): 5.8309e-10
	(reg=ct) Lags used: 0

	(reg=ctt) pvalue (diff 1): 3.8133e-09
	(reg=ctt) Lags used: 0

Tunisia 6

	(reg=ct) pvalue: 0.043032
	(reg=ct) Lags used: 8

	(reg=ctt) pvalue (diff 1): 0.043182
	(reg=ctt) Lags used: 10

	(reg=ctt) pvalue (log): 0.0033417
	(reg=ctt) Lags used: 11

Cote d'Ivoire 7

British Virgin Islands 8

	(reg=ct) pvalue (diff 1): 0.058068
	(reg=ct) Lags used: 5

	(reg=ctt) pvalue (diff 1): 0.0010611
	(reg=ctt) Lags used: 11

Virgin Islands (U.S.) 9

	(reg=c) pvalue: 0.0013416
	(reg=c) Lags used: 9

	(reg=c) pvalue (log): 0.0010982
	(reg=c) Lags used: 9

	(reg=ct) pvalue: 0.059757
	(reg=ct) Lags used: 9

	(reg=ct) pvalue (log): 0.018043
	(reg=ct) Lags used: 9

	(reg=ctt) pvalue: 1.3434e-05
	(reg=ctt) Lags used: 10

	(reg=ctt) pvalue (log): 1.6394e-05
	(reg=ctt) Lags used: 11

Slovak Republic 10

	(reg=nc) pvalue (diff 1): 0.063432
	(reg=nc) Lags used: 2

Ireland 11

	(reg=c) pvalue (diff 1): 0.09875
	(reg=c) Lags used: 9

	(reg=ctt) pvalue: 0.045298
	(reg=ctt) Lags used: 9

	(reg=ctt) pvalue (log): 0.03338
	(reg=ctt) Lags used: 9

We can see that many countries do not achieve stationarity with differencing or log nor liner or quadratic regression. For others, it depends to find the exact method. The simplest one is Turkmekistan that is stationary on its log, so let's try to predict it (Cell #31).

# CELL 31
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA( table_log[0,:-1], order=(1,0,0)  )
ar_results = model.fit()

print("Log forecast: {}".format(ar_results.forecast(1)))
print("Log actual: {}".format(table_log[0,-1]))
print("Absolute error, predicted: {:,}".format(math.exp(ar_results.forecast(1)[0][0]) - math.exp(table_log[0,-1])))
print("Absolute error, previous: {:,}".format(math.exp(table_log[0,-2]) - math.exp(table_log[0,-1])))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]

plt.plot(table_log[0], color='black')
plt.plot(ar_results.fittedvalues, color='gray')
plt.title("{}: log pop AR(1)".format(data[0][0]))

plt.savefig("ch7_cell31_turmekistan_log_ar1.pdf", bbox_inches='tight', dpi=300)
plt

Log forecast: (array([15.54893032]), array([0.02365336]), array([[15.5025706 , 15.59529005]]))
Log actual: 15.566113775425455
Absolute error, predicted: -98,098.35004603863
Absolute error, previous: -95,530.99999999441

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

That looks nice, but there is a clear lag between the prediction and the actual value. Also, the prediction error is 3,000 people worse off than using the last known value of the sequence. It is a shame to do so much work to end up worse than were we started. But in Cell #30 the autolag found stationarity at lag 11, so let's try with an AR(11) process instead (Cell #32).

# CELL 32
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA( table_log[0,:-1], order=(11, 0, 0)  )
ar_results = model.fit()

print("Log forecast: {}".format(ar_results.forecast(1)))
print("Log actual: {}".format(table_log[0,-1]))

log_forecast = ar_results.forecast(1)[0][0]
forecast = math.exp(log_forecast)
log_actual = table_log[0, -1]
actual = math.exp(log_actual)
previous = math.exp(table_log[0, -2])
previous2 = math.exp(table_log[0, -3])

print("Absolute error, predicted: {:,}".format(forecast - actual))
print("Absolute error, previous: {:,}".format(previous - actual))
print("Absolute error, previous + diff: {:,}".format(previous + (previous - previous2) - actual))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]

plt.plot(table_log[0], color='black')
plt.plot(ar_results.fittedvalues, color='gray')
plt.title("{}: log pop AR(11)".format(data[0][0]))

plt.savefig("ch7_cell32_turmekistan_log_ar11.pdf", bbox_inches='tight', dpi=300)
plt

feateng/lib/python3.7/site-packages/statsmodels/base/model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)
feateng/lib/python3.7/site-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
feateng/lib/python3.7/site-packages/statsmodels/tsa/arima_model.py:1500: RuntimeWarning: invalid value encountered in sqrt
  fcasterr = np.sqrt(sigma2 * np.cumsum(ma_rep**2))

Log forecast: (array([15.56610628]), array([nan]), array([[nan, nan]]))
Log actual: 15.566113775425455
Absolute error, predicted: -43.17216261755675
Absolute error, previous: -95,530.99999999441
Absolute error, previous + diff: 1,729.0000000065193

<module 'matplotlib.pyplot' from 'feateng/lib/python3.7/site-packages/matplotlib/pyplot.py'>

Now we are in business! The question is how to operationalize obtaining all these fitted models and predictions automatically. I discuss this with the third featurization, let us start with the baselines, no time series and using the time series as a training data.

Given the EDA, it looks likely the TS systems will underbehave compared to the baseline approaches.

First Featurization: no TS data¶

Let us start by using only the number of relations (Cell #33)

# CELL 33
import math
import random
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header = [ header[0], "log_" + header[1], "log_" + header[2], "logpop" ]
    for line in feats:
        fields   = line.strip().split("\t")
        name     = fields[0]
        pop      = float(fields[-1])
        out_rels = float(fields[1])
        in_rels  = float(fields[2])
        if out_rels == 0:
            out_rels = 1
        row = ( [ math.log(out_rels, 10), math.log(in_rels, 10) ], math.log(pop, 10), name )
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
            
with open("ch7_cell33_feat1.tsv", "w") as feat:
    feat.write("\t".join(header) + "\n")
    for feats, logpop, name in train_data + test_data:
        feat.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,feats)), logpop))
               
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_orig = ytrain.copy()
ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} countries".format(len(xtrain)))

best_c = 100.0
best_epsilon = 0.05

SEARCH_TEST = 50
best_rmse = 1000
for c in [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 50.0, 100.0]:
    svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
    svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
    ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
    ytrain_pred_end *= 1.0/ytrain_scaling
    ytrain_pred_end += ytrain_min
    RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
    print("C", c, "RMSE", RMSE)
    if RMSE < best_rmse:
        best_c = c
        best_rmse = RMSE

print("Best C", best_c,"best RMSE", best_rmse)        

best_rmse = 1000
for epsilon in [0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.0, 1.0, 10.0, 100.0]:
    svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
    svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
    ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
    ytrain_pred_end *= 1.0/ytrain_scaling
    ytrain_pred_end += ytrain_min
    RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
    print("Epsilon", epsilon, "RMSE", RMSE)
    if RMSE < best_rmse:
        best_epsilon = epsilon
        best_rmse = RMSE

print("Best epsilon", best_epsilon,"best RMSE", best_rmse)

svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)


import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell33_svr_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 140 countries
C 0.01 RMSE 1.074204810382449
C 0.1 RMSE 0.8090564628670425
C 0.5 RMSE 0.7091245240776414
C 1.0 RMSE 0.7177892887020444
C 1.5 RMSE 0.7159910497027019
C 2.0 RMSE 0.7054068673855378
C 5.0 RMSE 0.7028442756177381
C 10.0 RMSE 0.6974059125373745
C 50.0 RMSE 0.7169340339921174
C 100.0 RMSE 0.7725200419969169
Best C 10.0 best RMSE 0.6974059125373745
Epsilon 0.0001 RMSE 0.6913114694367364
Epsilon 0.001 RMSE 0.6871954628240701
Epsilon 0.01 RMSE 0.6951797961785876
Epsilon 0.02 RMSE 0.695918991384224
Epsilon 0.05 RMSE 0.6974059125373745
Epsilon 0.1 RMSE 0.6834592511300636
Epsilon 0.2 RMSE 0.7231458831001565
Epsilon 0.5 RMSE 1.1225989857188645
Epsilon 0.0 RMSE 0.6914775913481219
Epsilon 1.0 RMSE 1.1225989857188645
Epsilon 10.0 RMSE 1.1225989857188647
Epsilon 100.0 RMSE 1.1225989857188663
Best epsilon 0.1 best RMSE 0.6834592511300636
RMSE 0.6512360835560099

<matplotlib.legend.Legend at 0x7f81adc79390>

The RMSE is very small for the amount of training data and pauperity of the features. Let us add the TS as extra features.

Second Featurization: Using TS as Features¶

I will also try to find the range that produces the best results (Cell #34).

# CELL 34
import math
import random
from sklearn.svm import SVR
import numpy as np

rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats)
    logheader = list(map(lambda x:"log_"+x, header.strip().split("\t")))
    header = [ header[0] ] + logheader[1:-1] + [ "logpop" ]
    for line in feats:
        fields = line.strip().split("\t")
        name = fields[0]
        pop = float(fields[-1])
        feats = list(map(float, fields[1:-1]))
        for idx, feat in enumerate(feats):
            if feat == 0:
                feats[idx] = 1
        row = ( list(map(lambda x:math.log(x, 10), feats)), math.log(pop, 10), name )
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)
            
test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

full_xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
full_xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))

# SVRs need scaling
full_xtrain_min = full_xtrain.min(axis=0); full_xtrain_max = full_xtrain.max(axis=0)
# some can be zero if the column is constant in training
full_xtrain_diff = full_xtrain_max - full_xtrain_min
for idx in range(len(full_xtrain_diff)):
    if full_xtrain_diff[idx] == 0.0:
        full_xtrain_diff[idx] = 1.0
full_xtrain_scaling = 1.0 / full_xtrain_diff
full_xtrain -= full_xtrain_min; full_xtrain *= full_xtrain_scaling

ytrain_orig = ytrain.copy()
ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

full_xtest_orig = full_xtest.copy()
full_xtest -= full_xtrain_min; full_xtest *= full_xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} countries".format(len(full_xtrain)))
               
best_history = len(header) - 5
best_history_rmse = 1000
best_history_c = 100.0
best_history_epsilon = 0.05

for history in range(1, len(header) - 5):
    
    print("History", history)
    
    xtrain = np.zeros( (full_xtrain.shape[0], 2 + history))
    xtrain[:,:2] = full_xtrain[:,:2] # rels
    xtrain[:, 2:] = full_xtrain[:,-history:]
    
    best_c = 100.0
    best_epsilon = 0.05

    SEARCH_TEST = 50
    best_rmse = 1000
    for c in [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 50.0, 100.0]:
        svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
        svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
        ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
        ytrain_pred_end *= 1.0/ytrain_scaling
        ytrain_pred_end += ytrain_min
        RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
        if RMSE < best_rmse:
            best_c = c
            best_rmse = RMSE

    print("Best C", best_c,"best RMSE", best_rmse)        

    best_rmse = 1000
    for epsilon in [0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.0, 1.0, 10.0, 100.0]:
        svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
        svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
        ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
        ytrain_pred_end *= 1.0/ytrain_scaling
        ytrain_pred_end += ytrain_min
        RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
        if RMSE < best_rmse:
            best_epsilon = epsilon
            best_rmse = RMSE

    print("Best epsilon", best_epsilon,"best RMSE", best_rmse)

    if best_rmse < best_history_rmse:
        best_history_rmse = best_rmse
        best_history = history
        best_history_c = best_c
        best_history_epsilon = best_epsilon

print("Best history", best_history, "best c", best_c, "best epsilon", best_epsilon,"best RMSE", best_rmse)

xtrain = np.zeros( (full_xtrain.shape[0], 2 + best_history))
xtrain[:,:2] = full_xtrain[:,:2] # rels
xtrain[:, 2:] = full_xtrain[:,-best_history:]

xtest = np.zeros( (full_xtest.shape[0], 2 + best_history))
xtest[:,:2] = full_xtest[:,:2] # rels
xtest[:, 2:] = full_xtest[:,-best_history:]

svr_rbf = SVR(epsilon=best_history_epsilon, C=best_history_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

# use previous year to predict current

RMSE = math.sqrt(sum((full_xtest_orig[:,-1] - ytest_pred)**2) / len(ytest))
print("Previous year RMSE", RMSE)
               
with open("ch7_cell34_feat2.tsv", "w") as feat:
    feat.write("\t".join(header[:2] + header[-(best_history+1):]) + "\n")
    for feats, logpop, name in train_data + test_data:
        feat.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,feats[:2] + feats[-best_history:])), logpop))

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell34_svr_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 140 countries
History 1
Best C 1.5 best RMSE 0.12233640335264187
Best epsilon 0.0001 best RMSE 0.03251167662471659
History 2
Best C 1.5 best RMSE 0.11938965515214714
Best epsilon 0.0 best RMSE 0.02921682970902913
History 3
Best C 1.0 best RMSE 0.1182001900295011
Best epsilon 0.0001 best RMSE 0.03130831272980187
History 4
Best C 0.5 best RMSE 0.11731038389815374
Best epsilon 0.001 best RMSE 0.04048445439648792
History 5
Best C 1.0 best RMSE 0.11746844884537583
Best epsilon 0.0 best RMSE 0.031175872317681545
History 6
Best C 1.0 best RMSE 0.11775230311849348
Best epsilon 0.0 best RMSE 0.03174244213583967
History 7
Best C 0.5 best RMSE 0.1181798897689338
Best epsilon 0.0 best RMSE 0.043828397192121886
History 8
Best C 0.5 best RMSE 0.11876236428723513
Best epsilon 0.0001 best RMSE 0.043602656633612985
History 9
Best C 0.5 best RMSE 0.11985469751900125
Best epsilon 0.0001 best RMSE 0.04405567070698307
History 10
Best C 0.5 best RMSE 0.12127441973124253
Best epsilon 0.0001 best RMSE 0.04357792477180867
History 11
Best C 0.5 best RMSE 0.12290222460157342
Best epsilon 0.0 best RMSE 0.0452076858125475
History 12
Best C 0.5 best RMSE 0.12462990540032874
Best epsilon 0.0 best RMSE 0.046644302858812194
History 13
Best C 1.0 best RMSE 0.1262867465420595
Best epsilon 0.0001 best RMSE 0.039206489718387576
History 14
Best C 1.0 best RMSE 0.12809150688256654
Best epsilon 0.0 best RMSE 0.039527174159763534
History 15
Best C 1.0 best RMSE 0.12981699057150425
Best epsilon 0.0 best RMSE 0.04091429901440826
History 16
Best C 1.0 best RMSE 0.13014939760473754
Best epsilon 0.0 best RMSE 0.04105067310561612
History 17
Best C 1.0 best RMSE 0.13061520152043288
Best epsilon 0.0 best RMSE 0.041738533589880486
History 18
Best C 1.0 best RMSE 0.13118854511274927
Best epsilon 0.0 best RMSE 0.04211581309590853
History 19
Best C 1.0 best RMSE 0.13185696262487404
Best epsilon 0.0 best RMSE 0.04311600309779866
History 20
Best C 1.0 best RMSE 0.13146363208100542
Best epsilon 0.0 best RMSE 0.04342164592822993
History 21
Best C 1.0 best RMSE 0.1311974305124104
Best epsilon 0.0 best RMSE 0.04450554524829456
History 22
Best C 1.0 best RMSE 0.13104436484084797
Best epsilon 0.0001 best RMSE 0.044771104034692685
History 23
Best C 1.0 best RMSE 0.13099314510900034
Best epsilon 0.0 best RMSE 0.04510467692312541
History 24
Best C 1.0 best RMSE 0.13103123725660276
Best epsilon 0.0 best RMSE 0.045517457046258085
History 25
Best C 1.0 best RMSE 0.13115069773354915
Best epsilon 0.0 best RMSE 0.046018371881848597
History 26
Best C 1.0 best RMSE 0.13134518808419174
Best epsilon 0.0 best RMSE 0.046390214108515924
History 27
Best C 1.0 best RMSE 0.13160762386288044
Best epsilon 0.0001 best RMSE 0.04612224733992526
History 28
Best C 1.0 best RMSE 0.131163607984925
Best epsilon 0.0001 best RMSE 0.04777362408334737
History 29
Best C 1.0 best RMSE 0.13052181820033107
Best epsilon 0.0 best RMSE 0.047909437108807996
History 30
Best C 1.5 best RMSE 0.13054425680946077
Best epsilon 0.0 best RMSE 0.04096329937622534
History 31
Best C 1.5 best RMSE 0.1311438344835536
Best epsilon 0.0 best RMSE 0.04093299473436645
History 32
Best C 1.5 best RMSE 0.13114213137464556
Best epsilon 0.0001 best RMSE 0.041185886914063
History 33
Best C 1.5 best RMSE 0.1309993425522971
Best epsilon 0.0 best RMSE 0.041335885832837226
History 34
Best C 1.0 best RMSE 0.13097462317828215
Best epsilon 0.001 best RMSE 0.048338732379556104
History 35
Best C 1.5 best RMSE 0.13073536106179073
Best epsilon 0.0 best RMSE 0.04195265456411776
History 36
Best C 1.5 best RMSE 0.1307462687023608
Best epsilon 0.0 best RMSE 0.04222047773707569
History 37
Best C 1.5 best RMSE 0.1306031580166691
Best epsilon 0.0 best RMSE 0.04232818917365116
History 38
Best C 1.5 best RMSE 0.13059250931041802
Best epsilon 0.0001 best RMSE 0.04231741357266616
History 39
Best C 1.5 best RMSE 0.13071927696205154
Best epsilon 0.0 best RMSE 0.04289293751290579
History 40
Best C 1.5 best RMSE 0.1309817091608136
Best epsilon 0.0 best RMSE 0.04266562692760044
History 41
Best C 1.5 best RMSE 0.13149732448085114
Best epsilon 0.0001 best RMSE 0.04276798592554878
History 42
Best C 1.5 best RMSE 0.13193802424471832
Best epsilon 0.0001 best RMSE 0.04270017998177398
History 43
Best C 1.5 best RMSE 0.13246559351255383
Best epsilon 0.0 best RMSE 0.04275890323406509
History 44
Best C 1.5 best RMSE 0.1330748489927381
Best epsilon 0.0001 best RMSE 0.04242124389955867
History 45
Best C 1.5 best RMSE 0.13232327075124944
Best epsilon 0.001 best RMSE 0.04272712413784279
History 46
Best C 1.5 best RMSE 0.1321769637050945
Best epsilon 0.0 best RMSE 0.042242046592026516
History 47
Best C 1.5 best RMSE 0.13293219730806136
Best epsilon 0.0001 best RMSE 0.04241474647139744
History 48
Best C 1.5 best RMSE 0.13316509926536912
Best epsilon 0.0001 best RMSE 0.04231061224375536
History 49
Best C 1.5 best RMSE 0.13272489987597955
Best epsilon 0.0001 best RMSE 0.042175674705108655
History 50
Best C 1.5 best RMSE 0.13266737064958845
Best epsilon 0.0001 best RMSE 0.04231380606711473
History 51
Best C 1.5 best RMSE 0.13287691424158599
Best epsilon 0.0001 best RMSE 0.04214898578998868
History 52
Best C 1.5 best RMSE 0.13323683308464368
Best epsilon 0.001 best RMSE 0.04254564456045043
History 53
Best C 1.5 best RMSE 0.13313943584994517
Best epsilon 0.0001 best RMSE 0.04238726553653345
History 54
Best C 1.5 best RMSE 0.1333976469304854
Best epsilon 0.0001 best RMSE 0.04219511901040926
History 55
Best C 2.0 best RMSE 0.13359689159784688
Best epsilon 0.0 best RMSE 0.03881861240143473
Best history 2 best c 2.0 best epsilon 0.0 best RMSE 0.03881861240143473
RMSE 0.052467612131081046
Previous year RMSE 0.05364521266440857

<matplotlib.legend.Legend at 0x7f8153bd6a10>

That is going to be very difficult to improve upon but we can see that the error rate for just using the previous year is comparable with the SVR. Let us give the TS a try.

Third Featurization: Prediction as Feature¶

To get TS predictions, I will do the following:

Keep current year (row[-1]) unseen as it is the test data
Train on all up to two-previous-to-last (including row[-3]), test on previous-to-last (row[-2])
If the error is less than the error from the predictor row[-3] + (row[-3] - row[-4]), use the TS prediction, otherwise
I will also save the method that was used to obtain the prediction.
If everything fails, try using smaller data segments.

In terms of detrending techniques I will try:

no detrending
log
linear regression

I will skip differencing and quadratic regression as it did not seem particularly useful in EDA and it makes the code more complicated. I use the simplest possible detrending, basically preferring the simplest possible model.

I will then train a model using the history from previous featurization (2 lags) and the prediction from the TS (if any, otherwise use the previous + previous_diff). Cell #35.

# CELL 35
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA

years = list()
data  = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
    header = next(feats).strip().split("\t")
    years = list(map(int,header[3:]))
    for line in feats:
        fields = line.strip().split("\t")
        data.append( (fields[0], list(map(lambda x:1 if x=="0" else int(x), fields[3:])),
                      list(map(lambda x:1 if x=="0" else math.log(float(x), 10), fields[1:3]))) )

        
table     = np.array(list(map(lambda x:x[1], data)))
table_log = np.log10(table)

conditions = [
    ( 'base', 'c',  50 ),
    ( 'base', 'c',  25 ),
    ( 'log',  'c',  50 ),
    ( 'log',  'c',  25 ),
    ( 'base', 'ct', 50 ),
    ( 'base', 'ct', 25 ),
    ( 'log',  'ct', 50 ),
    ( 'log',  'ct', 25 ),
]

predicted        = list()
condition_used   = list()
errors           = 0
fitted_countries = set()

for country_idx in range(table.shape[0]):
    print(country_idx, data[country_idx][0])
    years  = table[country_idx, : -2] # skip last year
    target = table[country_idx, -2]  # use previous to last for prediction
    prediction = years[-1] + (years[-1] - years[-2])
    best_prediction_se = (target - prediction)**2
    best_condition     = ( 'none', 'c', 2 )
    
    for condition in conditions:
        length = condition[-1]
        if condition[0] == 'base':
            this_table = table
        else:
            this_table = table_log
        years = this_table[country_idx, -(length+2): -2] # skip last year
        target = table[country_idx, -2]  # use previous to last for prediction
        regression = condition[1]
        adf, pvalue, crits, results = adfuller(years, regression=regression,autolag='AIC',
                                               store=True,regresults=True)
        if pvalue < 0.1:
            print("\t{}".format(condition))
            # train and predict          
            model = ARIMA( results.resols.resid, order=(results.usedlag, 0, 0) )
            
            try:
                ar_results = model.fit()
            except:
                print("\t\te")
                errors += 1
                continue
                
            fitted_countries.add(country_idx)
                
            df_ols = int(results.resols.df_model)
            
            predict     = ar_results.forecast(1)[0][0]
            ols_predict = results.resols.predict(np.diff(years[-(df_ols+2):]).reshape(1,df_ols+1))[0]

            predict += years[-1] + ols_predict
            if condition[0] == 'log':
                predict = 10**predict
            se = (target - predict)**2
            if se <= best_prediction_se:
                print("\t\ts {:,} and {:,}".format(int(se), best_prediction_se))
                best_prediction_se = se
                best_condition = condition
                break
            else:
                print("\t\t{:,} vs {:,}".format(int(se), best_prediction_se))
                
    diff_prediction = table[country_idx, -2] + (table[country_idx, -2] - table[country_idx, -3])

    if best_condition[0] != 'none':
        # compute final prediction
        length = best_condition[-1]
        if best_condition[0] == 'base':
            this_table = table
        else:
            this_table = table_log
        years = this_table[country_idx, -(length+1): -1] # skip last year
        regression = best_condition[1]
        adf, pvalue, crits, results = adfuller( years, regression=regression,autolag='AIC',
                                               store=True,regresults=True)
        if pvalue < 0.1:
            # train and predict
            model = ARIMA( results.resols.resid, order=(results.usedlag, 0, 0)  )
            in_error = False
            try:
                ar_results = model.fit()
            except:
                print("\t\t\tE")
                in_error = True
                errors += 1
            if in_error:
                best_condition = ( 'none', 'c', 2 )
                prediction = diff_prediction
            else:
                df_ols = int(results.resols.df_model)

                prediction = ar_results.forecast(1)[0][0]
                ols_predict = results.resols.predict(np.diff(years[-(df_ols+2):]).reshape(1,df_ols+1))[0]
                prediction += years[-1] + ols_predict
                  
                if best_condition[0] == 'log':
                    prediction = 10**prediction
        else:
            best_condition = ( 'none', 'c', 2 )
            prediction = diff_prediction
    else:
            prediction = diff_prediction
    predicted.append(prediction)
    condition_used.append(best_condition)

print("Errors",    errors)
print("Fitted",    len(fitted_countries))
print("Using TS:", len(list(filter(lambda x:x[0] != 'none',condition_used))))

with open("ch7_cell35_ts_predicted.tsv", "w") as pred:
    pred.write("name\tpredicted\tmethod\tregression\tsize\n")
    for country_idx, pair in enumerate(zip(predicted, condition_used)):
        prediction, condition = pair
        pred.write("{}\t{}\t{}\t{}\t{}\n".format(data[country_idx][0], prediction, *condition))

0 Greenland
1 Senegal
	('log', 'c', 50)
		10,983,523,149 vs 13,965,169
	('log', 'ct', 25)
		e
2 Cote d'Ivoire
	('log', 'ct', 25)
		e
3 Mozambique
	('log', 'c', 25)
		e
4 Somalia
5 Grenada

feateng/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1358: RuntimeWarning: invalid value encountered in sqrt
  return rho, np.sqrt(sigmasq)

	('log', 'ct', 50)
		6,905 vs 961
6 Cuba
	('base', 'c', 50)

feateng/lib/python3.7/site-packages/statsmodels/base/model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)

		630,590,950,713,677,312 vs 50,623,225
	('base', 'c', 25)
		e
	('log', 'c', 50)
		136,315,134 vs 50,623,225
	('log', 'c', 25)
		e
7 Bhutan
	('base', 'c', 25)
		e
	('log', 'c', 25)
		e
	('base', 'ct', 50)
		7,808,174,442,944,441 vs 312,481
8 Turks and Caicos Islands
	('log', 'c', 50)
		81,441 vs 1,521
	('log', 'c', 25)
		3,784 vs 1,521
9 Bosnia and Herzegovina
	('base', 'ct', 25)
		e
	('log', 'ct', 25)
		e
10 Cambodia
	('base', 'c', 25)
		e
	('log', 'c', 25)
		e
	('log', 'ct', 50)
		2,276,217,702 vs 4,452,100
11 Micronesia, Fed. Sts.
	('base', 'c', 50)

feateng/lib/python3.7/site-packages/statsmodels/base/model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)
feateng/lib/python3.7/site-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
feateng/lib/python3.7/site-packages/statsmodels/tsa/arima_model.py:1500: RuntimeWarning: invalid value encountered in sqrt
  fcasterr = np.sqrt(sigma2 * np.cumsum(ma_rep**2))

		34,269,070,884 vs 7,396
	('base', 'c', 25)
		e
	('log', 'c', 50)

feateng/lib/python3.7/site-packages/statsmodels/base/model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)

		391,783 vs 7,396
	('base', 'ct', 50)
		56,061,545,091 vs 7,396
	('base', 'ct', 25)
		e
	('log', 'ct', 50)

feateng/lib/python3.7/site-packages/statsmodels/base/model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
  'available', HessianInversionWarning)

As only 6 out of 163 predictors used the TS, there is no point to proceed with the remaining featurizations.