Accompanying code for the book The Art of Feature Engineering.
This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng
Copyright 2019 Pablo Duboue
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
In this chapter I will describe the creation of the extended version of WikiCities dataset with historical data. I will also present a country population prediction task to exemplify traditional time series algorithms.
Used the following historical versions of DBpedia:
Version | Year | Code |
---|---|---|
3.6 | 2010 | dbpedia10 |
3.7 | 2011 | dbpedia11 |
3.8 | 2012 | dbpedia12 |
3.9 | 2013 | dbpedia13 |
2014 | 2014 | dbpedia14 |
2015-04 | 2015 | dbpedia15 |
Using scripts outside of this notebook, I have extracted the relevant triples from the dumps for all years into files CODE_cities1000_base.ttl (e.g., dbpedia12_cities1000_base.ttl). Because due to changes in the script (going from or "Phoenix%2C_Arizona" to "Phoenix,Arizona") and in the source Wikipedia (going from "Utrecht" to "Utrecht%28city%29"), the entity designators change from year to year, I have tried to standardize them using the provided GeoNames link for each year. In the historical versions the GeoNames data was limited and as many as half the cities are missing. This might also be due to lower coverage in the source Wikipedia in the earlier years. All in all, this represents the type of challenges of dealing with historical data in production systems.
From these files, I extracted the conservative feature set into CODE_dev_conservative.tsv, also outside the notebooks. This produced files CODE_dev_conservative.tsv. Again, due to changes in the ontology many columns (sometimes up to 50% of the columns) are empty. Interestingly, this number changes from year to year (not necessarily starting low and increasing). Therefore, it enables the specific type of temporal imputation enabled by timestamped data.
Let us start doing some EDA and see how this data set behaves.
The first thing to take a look is how the total number of relations behaves year after year (Cell #1).
# CELL 1
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
dbpedia_rels = [
1, # padding
19969604, # 2010
26770236, # 2011
33742028, # 2012
41804713, # 2013
61481509, # 2014
37791134, # 2015
38285149 # 2016
]
city_col = 0
rel_col = 1
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
if version == 6:
datafile = "ch6_cell32_dev_feat_conservative.tsv"
city_relcounts = dict()
with open(datafile) as f:
first = True
for line in f:
fields = line.split("\t")
if first:
first = False
if fields[city_col] != 'name':
raise Exception("Expected 'name', got " + fields[city_col])
if fields[rel_col] != 'rel#count':
raise Exception("Expected 'rel#count', got " + fields[rel_col])
else:
city_relcounts[fields[city_col]] = float(fields[rel_col])
if city_to_data is None:
city_to_data = dict()
for city, rel in city_relcounts.items():
city_to_data[city] = [ rel ]
else:
to_delete = list()
for city, l in city_to_data.items():
if city in city_relcounts:
l.append(city_relcounts[city])
else:
deleted_cities.add(city)
to_delete.append(city)
for city in to_delete:
del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
relcounts = np.array([ city_to_data[city][version] for city in cities])
to_plot.append(relcounts)
means.append(np.mean(relcounts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Total relation log counts by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
ax2 = ax.twinx()
ax2.plot(list(map(math.log,dbpedia_rels)), linestyle=(0,(5,10)), color='black', linewidth=3.0)
plt.savefig("ch7_cell1_boxplot.pdf", bbox_inches='tight', dpi=300)
plt
Interestingly, the average number of relations has not increased as much as the average increase of DBpedia.
Let us see the behaviour of this number as related to the last year (Cell #2).
# CELL 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
city_col = 0
rel_col = 1
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
if version == 6:
datafile = "ch6_cell32_dev_feat_conservative.tsv"
city_relcounts = dict()
with open(datafile) as f:
first = True
for line in f:
fields = line.split("\t")
if first:
first = False
if fields[city_col] != 'name':
raise Exception("Expected 'name', got " + fields[city_col])
if fields[rel_col] != 'rel#count':
raise Exception("Expected 'rel#count', got " + fields[rel_col])
else:
city_relcounts[fields[city_col]] = float(fields[rel_col])
if city_to_data is None:
city_to_data = dict()
for city, rel in city_relcounts.items():
city_to_data[city] = [ rel ]
else:
to_delete = list()
for city, l in city_to_data.items():
if city in city_relcounts:
l.append(city_relcounts[city])
else:
deleted_cities.add(city)
to_delete.append(city)
for city in to_delete:
del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
relcounts = np.array([ city_to_data[city][version] - city_to_data[city][-1] for city in cities ])
to_plot.append(relcounts)
means.append(np.mean(relcounts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Difference relation log counts to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell2_boxplot.pdf", bbox_inches='tight', dpi=300)
plt
Let us now take a look at the number of different features to the current year (Cell #3).
# CELL 3
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
city_col = 0
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
if version == 6:
datafile = "ch6_cell32_dev_feat_conservative.tsv"
this_city_data = dict()
with open(datafile) as f:
next(f) # drop header
for line in f:
fields = line.split("\t")
this_city_data[fields[city_col]] = list(map(float,fields[1:]))
if city_to_data is None:
city_to_data = dict()
for city, data in this_city_data.items():
city_to_data[city] = [ data ]
else:
to_delete = list()
for city, l in city_to_data.items():
if city in this_city_data:
l.append(this_city_data[city])
else:
deleted_cities.add(city)
to_delete.append(city)
for city in to_delete:
del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
equal_counts = list()
for city in cities:
equal_count = 0
for idx in range(len(city_to_data[city][-1])):
if city_to_data[city][version][idx] == city_to_data[city][-1][idx]:
equal_count += 1
equal_counts.append(equal_count)
equal_counts = np.array(equal_counts)
to_plot.append(equal_counts)
means.append(np.mean(equal_counts))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Equal features to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell3_boxplot.pdf", bbox_inches='tight', dpi=300)
plt
The trend is similar to the one observed in the rel count alone. This data seems reasonable for trying lagged features but temporal smoothing using a sliding window might provide better results given the bouncy behaviour of the data.
To conclude the EDA, let us take a look at the behaviour of the difference between the target variable, population (Cell #4).
# CELL 4
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
city_col = 0
pop_col = 99
to_plot = list()
means = list()
city_to_data = None
deleted_cities = set()
for version in range(7):
datafile = "dbpedia1" + str(version) + "_dev_conservative.tsv"
if version == 6:
datafile = "ch6_cell32_dev_feat_conservative.tsv"
city_pop = dict()
with open(datafile) as f:
first = True
for line in f:
fields = line.strip().split("\t")
if first:
first = False
if fields[city_col] != 'name':
raise Exception("Expected 'name', got " + fields[city_col])
if fields[pop_col] != 'population':
raise Exception("Expected 'population', got " + fields[pop_col])
else:
city_pop[fields[city_col]] = float(fields[pop_col])
if city_to_data is None:
city_to_data = dict()
for city, rel in city_pop.items():
city_to_data[city] = [ rel ]
else:
to_delete = list()
for city, l in city_to_data.items():
if city in city_pop and city_pop[city] > 0:
l.append(city_pop[city])
else:
deleted_cities.add(city)
to_delete.append(city)
for city in to_delete:
del city_to_data[city]
print("Plotting on {:,}".format(len(city_to_data)))
cities = sorted(list(city_to_data.keys()))
for version in range(7):
pops = np.array([ city_to_data[city][-1] - city_to_data[city][version] for city in cities])
nonzero = np.count_nonzero(pops)
print("Year {}, different: {:,} ({:%})".format(2010+version, nonzero, nonzero / len(pops)))
to_plot.append(pops)
means.append(np.mean(pops))
fig, ax = plt.subplots()
fig.set_size_inches(15, 15)
ax.set_title('Difference in log population to 2016 by year')
ax.boxplot(to_plot, labels=list(range(2010,2017)), meanline=True)
means.insert(0,0)
ax.plot(range(8), means, 'b-', color='gray')
plt.savefig("ch7_cell4_boxplot.pdf", bbox_inches='tight', dpi=300)
plt
Interestingly, the population numbers have changed since 2010 until 2016, even though census happen in different countries once every ten years. Also, restricting to the cities with known population for all the years does not show any bump in 2015 as in the other years. Initially, I would have expected the target population will be so accurate that giving it to the ML will constitute a target leak. While the actual number might not be the exact number, the box plot shows the number is on average the same, with some marked outliers.
Let's try to study the missing values now.
There are two sources for needs of missing values:
I will detect these two cases, making an indicator feature for each cell based on these two sources and saving them for further imputation (Cell #5).
# CELL 5
import random
import math
import numpy as np
from collections import OrderedDict
rand = random.Random(42)
# get data sizes
VERSIONS=6 # + 1 for current
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
FEATS = len(next(f).split('\t')) - 1 # name
CITIES = 0
for _ in f:
CITIES += 1
# load data
CITY_COL = 0
header_line = None
city_names = OrderedDict() # name to idx
table = np.zeros( (CITIES, FEATS * 2 * VERSIONS + FEATS) )
with open("ch6_cell32_dev_feat_conservative.tsv") as f:
header_line = next(f)
city_idx = 0
for line in f:
fields = line.strip().split("\t")
floats = list(map(float,fields[1:]))
for idx, val in enumerate(floats):
if math.isnan(val):
floats[idx] = 0.0
table[city_idx,-FEATS:] = floats
city_names[fields[CITY_COL]] = city_idx
city_idx += 1
header = header_line.strip().split("\t")
year_defined = { city: set() for city in city_names }
for version in range(VERSIONS):
with open("dbpedia1" + str(version) + "_dev_conservative.tsv") as f:
header_line = next(f)
for line in f:
fields = line.strip().split("\t")
city = fields[CITY_COL]
city_idx = city_names.get(city, None)
if city_idx is None:
continue
year_defined[city].add(version)
floats = list(map(float,fields[1:]))
for idx, val in enumerate(floats):
table[city_idx,FEATS * 2 * version + 2 * idx] = 1.0 # defined
if math.isnan(val):
table[city_idx,FEATS * 2 * version + 2 * idx + 1] = 0.0
else:
table[city_idx,FEATS * 2 * version + 2 * idx + 1] = val
# column-based missing values: determine empty/constant columns
empty_cols_per_year = list()
for version in range(VERSIONS):
empty_cols = 0
for col in range(FEATS):
value = None
constant = True
for city_idx in range(CITIES):
this_val = table[city_idx, 2 * version * FEATS + 2 * col + 1]
if value is None:
value = this_val
elif value != this_val:
constant = False
break
if constant:
empty_cols += 1
table[:, 2 * version * FEATS + 2 * col] = 0.0 # missing
empty_cols_per_year.append(empty_cols)
print("Total empty cols {:,}".format(sum(empty_cols_per_year)))
for version in range(6):
print("Empty cols for {}: {}".format(2010+version, empty_cols_per_year[version]))
print("Working on {:,} cities".format(CITIES))
# row-based imputation: determine empty years
empty_rows_per_year = list()
for version in range(6):
empty_rows = 0
for city, city_idx in city_names.items():
if version not in year_defined[city] or \
all(map(lambda x:table[city_idx, 2*version * FEATS + 2 * x + 1] == 0.0, range(FEATS))):
empty_rows += 1
for col in range(FEATS):
table[city_idx, 2 * version * FEATS + 2 * col] = 0.0
empty_rows_per_year.append(empty_rows)
print("Total empty rows {:,}".format(sum(empty_rows_per_year)))
for version in range(6):
print("Empty rows for {}: {}".format(2010+version, empty_rows_per_year[version]))
# write feature table
missing_cells = 0
total_cells = 0
with open("ch7_cell5_dev_missing.tsv", "w") as w:
w.write('name')
for version in range(VERSIONS):
for featname in header[1:]:
w.write("\t{}_{}?is_defined".format(2010+version, featname))
w.write("\t{}_{}".format(2010+version, featname))
w.write('\t' + '\t'.join(header[1:]) + "\n")
for city, city_idx in city_names.items():
w.write(city + "\t" + "\t".join(map(str, table[city_idx,:])) + "\n")
total_cells += FEATS * VERSIONS
for version in range(VERSIONS):
for col in range(FEATS):
if table[city_idx, 2 * version * FEATS + 2 * col] == 0.0:
missing_cells += 1
print("Total cells: {:,} missing: {:,} ({:%})".format(total_cells, missing_cells, missing_cells / total_cells))
There is a significant amount of missing values. Let us see if we can visualize now individual entries. For visualization, I'll use a heatmap, discretizing features into up to six bins using k-means clustering. The top three clusters will be shown in green, the bottom in blue. Missing values will be shown in red.
The graph aligns features for each year to follow them, uncovering the story of the data (Cell #6).
# CELL 6
import random
import math
import numpy as np
from sklearn.cluster import KMeans
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell5_dev_missing.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
for line in f:
fields = line.strip().split('\t')
name = fields[0]
feats = list(map(float, fields[1:]))
data.append( (feats, feats[-1], name) )
table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
table[idx] = row[0]
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# compute boundaries using kMeans
boundaries = list()
for feat in range(FEATS):
col = table[:,2*VERSIONS*FEATS + feat]
vals = set(col)
num_clusters = min(len(vals), 6)
if num_clusters < 6:
boundaries.append(sorted(vals))
else:
kmeans = KMeans(n_clusters=6, random_state=1).fit(col.reshape(-1,1))
boundaries.append(sorted(set(list(kmeans.cluster_centers_.reshape(-1)))))
# pick some cities at random to graph
to_show = rand.sample(list(range(len(data))), 20)
def colorize(feat, feat_val):
num_classes = len(boundaries[feat])
half_classes = num_classes // 2
if half_classes == 0:
half_classes = 1
dists = sorted(map(lambda x: (abs(x[1]-feat_val), x[0]), enumerate(boundaries[feat])))
clazz = dists[0][1]
if clazz >= half_classes:
component = 1
clazz -= half_classes
else:
component = 2
clazz = half_classes - clazz - 1
intensity = None
if clazz == 0:
if half_classes == 3:
intensity = 0.3
elif half_classes == 2:
intensity = 0.6
elif half_classes == 1:
intensity = 1.0
elif clazz == 1:
if half_classes == 3:
intensity = 0.6
elif half_classes == 2:
intensity = 1.0
elif half_classes == 1:
intensity = 1.0
elif clazz == 2:
intensity = 1.0
box_color = [ 0.0, 0.0, 0.0 ]
box_color[component] = intensity
#print(dists, clazz, num_classes, half_classes, box_color)
return box_color
def gray(box_color):
if box_color[0] == 1.0:
col = 0.0
elif box_color[1] == 0.0:
col = box_color[2] / 2
else:
col = 0.5 + box_color[1] / 2
return [ col, col, col ]
HEIGHT=2
WIDTH=2
def shrink_name(feat_name):
if '@' in feat_name:
parts = feat_name.split('@')
return parts[0].split("/")[-1] + "@" + parts[1].split("/")[-1][:-1]
else:
if '/' in feat_name:
return feat_name.split("/")[-1]
return feat_name
img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 40]
plt.axis('on')
yticks_locs = list(map(lambda x:x*HEIGHT, range(VERSIONS+1)))
yticks_labels = list(map(lambda x:2016-x, range(VERSIONS+1)))
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:shrink_name(x), base_features))
plt.yticks(yticks_locs, yticks_labels)
plt.xticks(xticks_locs, xticks_labels, rotation=90)
idx = list(filter(lambda x:x[1][-1].endswith("Mérida,_Yucatán>"), enumerate(data)))[0][0]
xbox = 0
for feat in range(FEATS):
ybox = feat * WIDTH
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = gray(colorize(feat,
table[idx, 2 * VERSIONS * FEATS + feat]))
for version in range(VERSIONS):
xbox = (VERSIONS - version) * HEIGHT
for feat in range(FEATS):
ybox = feat * WIDTH
box_color = gray(colorize(feat, table[idx, 2 * version * FEATS + 2 * feat + 1]))
# mark imputation
if table[idx, 2 * FEATS * version + 2 * feat] == 0.0:
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = [ 1.0, 1.0, 1.0 ]
img[ xbox + 1, ybox ] = [ 0.0, 0.0, 0.0 ]
img[ xbox, ybox + 1 ] = [ 0.0, 0.0, 0.0 ]
else:
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color
print("{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
int(data[idx][1]*1000)/1000.0,
int(10**data[idx][1])))
plt.imshow(img, aspect='equal')
plt.savefig("ch7_cell6_viz.pdf", bbox_inches='tight', dpi=300)
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 40]
plt.axis('on')
for img_idx, idx in enumerate(to_show):
plt.subplot(len(to_show), 1, img_idx + 1)
plt.title("{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
int(data[idx][1]*1000)/1000.0,
int(10**data[idx][1])))
img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0
xbox = 0
for feat in range(FEATS):
ybox = feat * WIDTH
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = colorize(feat,
table[idx, 2 * VERSIONS * FEATS + feat])
for version in range(VERSIONS):
xbox = (VERSIONS - version) * HEIGHT
for feat in range(FEATS):
ybox = feat * WIDTH
box_color = colorize(feat, table[idx, 2 * version * FEATS + 2 * feat + 1])
# mark imputation
if table[idx, 2 * FEATS * version + 2 * feat] == 0.0:
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = [1.0, 0, 0 ]
else:
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color
plt.imshow(img, aspect='equal')
plt.yticks(yticks_locs, yticks_labels)
plt.xticks(xticks_locs, [] if img_idx < len(to_show) - 1 else xticks_labels, rotation=90)
plt.savefig("ch7_cell6_viz_full.pdf", bbox_inches='tight', dpi=300)
From the figure we can see that the relation counts change from version to version but the target population seem to stay within the same range. There are plenty of missing features and the colours indicate a relative growth over time.
The "seeAlso" features are missing for all years, that's a big loss.
Given the large percentage of cities that are not present in all years, we will need a way to impute them.
I will use the closer available year, given priority to earlier years (Cell #7).
# CELL 7
import random
import math
import numpy as np
rand = random.Random(42)
header = None
data = list()
# load data
with open("ch7_cell5_dev_missing.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
for line in f:
fields = line.strip().split('\t')
name = fields[0]
feats = list(map(float, fields[1:]))
data.append( (feats, feats[-1], name) )
table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
table[idx] = row[0]
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# impute
back_imputed = 0
forward_imputed = 0
present_imputed = 0
for city_idx in range(CITIES):
for version in range(VERSIONS):
for feat in range(FEATS):
if table[city_idx, 2 * FEATS * version + 2 * feat] == 1.0:
continue
lag = 1
imputed = False
while version + lag <= VERSIONS or version - lag >= 0:
if version - lag >= 0 and table[city_idx, 2 * FEATS * (version - lag) + 2 * feat] == 1.0:
table[city_idx, 2 * FEATS * version + 2 * feat + 1] = \
table[city_idx, 2 * FEATS * (version - lag) + 2 * feat + 1]
back_imputed += 1
imputed = True
break
if version + lag < VERSIONS and table[city_idx, 2 * FEATS * (version + lag) + 2 * feat] == 1.0:
table[city_idx, 2 * FEATS * version + 2 * feat + 1] = \
table[city_idx, 2 * FEATS * (version + lag) + 2 * feat + 1]
forward_imputed += 1
imputed = True
break
if version + lag == VERSIONS: # present
table[city_idx, 2 * FEATS * version + 2 * feat + 1] = \
table[city_idx, 2 * FEATS * (version + lag) + feat]
present_imputed += 1
imputed = True
break
lag += 1
if not imputed:
print(version, feat, data[city_idx][-1])
# write full feature table
imputed_cells = 0
total_cells = 0
with open("ch7_cell7_dev_imputed.tsv", "w") as w:
w.write('name')
w.write('\t' + '\t'.join(header) + "\n")
for city_idx, city in enumerate(map(lambda x:x[-1], data)):
total_cells += FEATS * VERSIONS
w.write(city)
w.write('\t' + '\t'.join(map(str, table[city_idx,:])))
w.write('\n')
imputed_cells = back_imputed + forward_imputed + present_imputed
print("Total cells: {:,} imputed: {:,} ({:%})".format(total_cells, imputed_cells, imputed_cells / total_cells))
print("Backward imputed: {:,} ({:%})".format(back_imputed, back_imputed / imputed_cells))
print("Forward imputed: {:,} ({:%})".format(forward_imputed, forward_imputed / imputed_cells))
print("Present imputed: {:,} ({:%})".format(present_imputed, present_imputed / imputed_cells))
So most features are not present-imputed. That is important because otherwise there will be little value of the historical features.
Let us visualize this (Cell #8).
# CELL 8
import random
import math
import numpy as np
from sklearn.cluster import KMeans
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
for line in f:
fields = line.strip().split('\t')
name = fields[0]
feats = list(map(float, fields[1:]))
data.append( (feats, feats[-1], name) )
table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
table[idx] = row[0]
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# compute boundaries using kMeans
boundaries = list()
for feat in range(FEATS):
col = table[:,2*VERSIONS*FEATS + feat]
vals = set(col)
num_clusters = min(len(vals), 6)
if num_clusters < 6:
boundaries.append(sorted(vals))
else:
kmeans = KMeans(n_clusters=6, random_state=1).fit(col.reshape(-1,1))
boundaries.append(sorted(set(list(kmeans.cluster_centers_.reshape(-1)))))
# save boundaries
with open("ch7_cell8_boundaries.tsv", "w") as b:
for idx, boundary_list in enumerate(boundaries):
b.write("{}\t{}\n".format(idx, "\t".join(map(str,boundary_list))))
# pick some cities at random to graph
to_show = rand.sample(list(range(len(data))), 20)
def cell8_colorize(feat, feat_val, boundaries):
num_classes = len(boundaries[feat])
half_classes = num_classes // 2
if half_classes == 0:
half_classes = 1
dists = sorted(map(lambda x: (abs(x[1]-feat_val), x[0]), enumerate(boundaries[feat])))
clazz = dists[0][1]
if clazz >= half_classes:
component = 1
clazz -= half_classes
else:
component = 2
clazz = half_classes - clazz - 1
intensity = None
if clazz == 0:
if half_classes == 3:
intensity = 0.3
elif half_classes == 2:
intensity = 0.6
elif half_classes == 1:
intensity = 1.0
elif clazz == 1:
if half_classes == 3:
intensity = 0.6
elif half_classes == 2:
intensity = 1.0
elif half_classes == 1:
intensity = 1.0
elif clazz == 2:
intensity = 1.0
box_color = [ 0.0, 0.0, 0.0 ]
box_color[component] = intensity
#print(dists, clazz, num_classes, half_classes, box_color)
return box_color
HEIGHT=2
WIDTH=2
def cell8_shrink_name(feat_name):
if '@' in feat_name:
parts = feat_name.split('@')
return parts[0].split("/")[-1] + "@" + parts[1].split("/")[-1][:-1]
else:
if '/' in feat_name:
return feat_name.split("/")[-1]
return feat_name
img = np.zeros( [ (VERSIONS + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0
yticks_locs = list(map(lambda x:x*HEIGHT, range(VERSIONS+1)))
yticks_labels = list(map(lambda x:2016-x, range(VERSIONS+1)))
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:cell8_shrink_name(x), base_features))
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 40]
def cell8_draw_features(plot, row, title, versions, boundaries,
yticks_locs, yticks_labels, xticks_locs, xticks_labels):
if title is not None:
plot.title(title)
img = np.zeros( [ (versions + 1) * HEIGHT, FEATS * WIDTH, 3 ])
img[:,:,:] = 1.0
xbox = 0
for feat in range(FEATS):
ybox = feat * WIDTH
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = \
cell8_colorize(feat, row[2 * versions * FEATS + feat], boundaries)
for version in range(versions):
xbox = (versions - version) * HEIGHT
for feat in range(FEATS):
ybox = feat * WIDTH
box_color = cell8_colorize(feat, row[2 * version * FEATS + 2 * feat + 1], boundaries)
# mark imputation
if row[2 * FEATS * version + 2 * feat] == 0.0:
img[ xbox:(xbox + HEIGHT // 2), ybox: (ybox + WIDTH) ] = [1.0, 0, 0 ]
img[ (xbox + HEIGHT // 2):(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color
else:
img[ xbox:(xbox + HEIGHT), ybox: (ybox + WIDTH) ] = box_color
plot.imshow(img, aspect='equal')
plot.yticks(yticks_locs, yticks_labels)
plot.xticks(xticks_locs, xticks_labels, rotation=90)
plt.axis('on')
for img_idx, idx in enumerate(to_show):
plt.subplot(len(to_show), 1, img_idx + 1)
cell8_draw_features(plt, table[idx,:], "{}: {} ({:,} people)".format(data[idx][-1].split("/")[-1][:-1],
int(table[idx,-1]*1000)/1000.0,
int(10**table[idx,-1])),
VERSIONS, boundaries,
yticks_locs, yticks_labels, xticks_locs,
[] if img_idx < len(to_show) - 1 else xticks_labels)
plt.savefig("ch7_cell8_viz_full.pdf", bbox_inches='tight', dpi=300)
The imputation looks like it completes quite well the values. Watertown seems to have some surprises in the center. We are now ready for the first featurization.
For the first featurization, I will add the historical version of each feature, for the last two versions (Cell #9). Due to the large number of features, I will switch to RFs.
# CELL 9
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# assemble train/test for a given lag
PARAM_LAG = 2
train_data = list()
test_data = list()
print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])
for full_row in data:
feats = list()
for lag in range(PARAM_LAG):
feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
feats.extend(full_row[0][2*VERSIONS*FEATS:])
row = (feats, full_row[1], full_row[-1])
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
# write feature set
with open("ch7_cell9_dev_feat1.tsv", "w") as w:
w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
all_data = list(train_data)
all_data.extend(test_data)
for row in all_data:
w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
print("Training on {:,} cities".format(len(xtrain)))
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
print("Writing files for error analysis")
np.savez_compressed("ch7_cell9_feat1_rf.npz",
xtrain=xtrain, ytrain=ytrain,
xtest=xtest, ytest=ytest,
ytest_pred=ytest_pred)
import pickle
with open("ch7_cell9_model.pk", "wb") as pkl:
pickle.dump(rfr, pkl)
with open("ch7_cell9_test_names.tsv", "w") as names:
for idx, name in enumerate(test_names):
names.write("{}\t{}\n".format(idx, name))
rfr=None # free memory
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell9_rf_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()
So this does barely improves over the SVR baseline (0.3578). Let us see the RMSE for using other lags in Cell #10.
# CELL 10
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
for lag in range(0, VERSIONS):
train_data = list()
test_data = list()
rand = random.Random(42)
for full_row in data:
feats = list()
for l in range(lag):
feats.extend(full_row[0][2*(VERSIONS-l-1)*FEATS: 2*(VERSIONS-l)*FEATS])
feats.extend(full_row[0][2*VERSIONS*FEATS:])
row = (feats, full_row[1], full_row[-1])
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("Lag: {}, Features: {:4d}, RMSE RF: {}".format(lag, FEATS * lag * 2 + FEATS, RMSE))
rfr=None # free memory
Before doing Error Analysis, let's look at computing differences (Cell #11).
# CELL 11
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
for lag in range(0, VERSIONS):
train_data = list()
test_data = list()
rand = random.Random(42)
for full_row in data:
feats = list()
for l in range(lag):
for feat in range(FEATS):
feats.append(full_row[0][2*VERSIONS*FEATS+feat] - full_row[0][2*(VERSIONS-l-1)*FEATS+2*feat+1])
feats.extend(full_row[0][2*VERSIONS*FEATS:])
row = (feats, full_row[1], full_row[-1])
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("Lag: {}, Features: {:4d}, RMSE RF: {}".format(lag, FEATS * lag * 2 + FEATS, RMSE))
rfr=None # free memory
The absolute differences might be a problem and scaling the differences will not work either as many features are zero. Let us drill down.
I'll use the feature heatmaps to see which cities did worse with the lags than without them (Cell #12).
# CELL 12
import random
import math
import pickle
import numpy as np
from sklearn.ensemble import RandomForestRegressor
# load
arrays = np.load("ch7_cell9_feat1_rf.npz")
xtrain = arrays['xtrain']
ytrain = arrays['ytrain']
xtest = arrays['xtest']
ytest = arrays['ytest']
ytest_pred = arrays['ytest_pred']
with open("ch7_cell9_dev_feat1.tsv") as f:
header = next(f).split('\t')
del header[0] # name
header.pop() # pop
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
LAGS = 2
test_names = list()
with open("ch7_cell9_test_names.tsv") as names:
for line in names:
test_names.append(line.split("\t")[-1])
# train on base features only
xtrain_base = xtrain[:,2*LAGS*FEATS:]
xtest_base = xtest[:,2*LAGS*FEATS:]
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain_base, ytrain)
ytest_pred_base = rfr.predict(xtest_base)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
RMSE_base = math.sqrt(sum((ytest - ytest_pred_base)**2) / len(ytest))
print("RMSE lag 2: {}, RMSE base: {}".format(RMSE, RMSE_base))
rfr=None # free memory
# find 10 cities where it helped them most and 10 cities where it hurt the most
sqe_lag2 = (ytest - ytest_pred) ** 2
sqe_base = (ytest - ytest_pred_base) ** 2
diff = sqe_base - sqe_lag2
idx_diff = sorted(list(zip(diff, range(len(diff)))))
# visualize
boundaries = list()
with open("ch7_cell8_boundaries.tsv") as b:
for line in b:
boundaries.append(list(map(float, line.split("\t")[1:])))
yticks_locs = list(map(lambda x:x*HEIGHT, range(LAGS+1)))
yticks_labels = list(map(lambda x:2016-x, range(LAGS+1)))
xticks_locs = list(map(lambda x:x*WIDTH, range(FEATS)))
xticks_labels = list(map(lambda x:cell8_shrink_name(x), base_features))
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 40]
plt.axis('on')
for idx in range(30):
city_idx = idx_diff[idx][1]
plt.subplot(40, 1, idx + 1)
row = list(xtest[city_idx,:])
row.append(ytest[city_idx])
cell8_draw_features(plt, row,
"WORSE> {}. Pop: {:,}, lag-2: {:,} base: {:,}"
.format(test_names[city_idx].strip().split("/")[-1][:-1],
int(10**row[-1]), int(10**ytest_pred[city_idx]),
int(10**ytest_pred_base[city_idx])),
LAGS, boundaries, yticks_locs, yticks_labels, xticks_locs, [])
for idx in range(10):
city_idx = idx_diff[len(idx_diff) - idx - 1][1]
plt.subplot(40, 1, 30 + idx + 1)
row = list(xtest[city_idx,:])
row.append(ytest[city_idx])
cell8_draw_features(plt, row,
"BEST> {}. Pop: {:,}, lag-2: {:,} base: {:,}"
.format(test_names[city_idx].strip().split("/")[-1][:-1],
int(10**row[-1]), int(10**ytest_pred[city_idx]),
int(10**ytest_pred_base[city_idx])),
LAGS, boundaries, yticks_locs, yticks_labels, xticks_locs,
[] if idx < 9 else xticks_labels)
plt.savefig("ch7_cell12_viz_full.pdf", bbox_inches='tight', dpi=300)
From the error analysis, it seems regressions on the first column might be problem. Smoothing that behaviour could work.
Let's start with a simple moving average (Cell #13).
# CELL 13
import random
import math
import numpy as np
from sklearn.ensemble import RandomForestRegressor
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
table[idx] = row[0]
# SMA
smoothing_square = 0
smoothed_cells = 0
for city_idx in range(CITIES):
for feat in range(FEATS):
sma = 0
for version in range(3):
sma += table[city_idx, 2 * FEATS * version + 2 * feat + 1] / 3.0
for version in range(3, VERSIONS + 1):
diff = table[city_idx, 2 * FEATS * (version - 1) + 2 * feat + 1] - sma
smoothed_cells += 1
smoothing_square += diff * diff
table[city_idx, 2 * FEATS * (version - 1) + 2 * feat + 1] = sma
if version < VERSIONS:
sma -= table[city_idx, 2 * FEATS * (version - 3) + 2 * feat + 1] / 3.0
sma += table[city_idx, 2 * FEATS * version + 2 * feat + 1] / 3.0
# write full feature table
with open("ch7_cell13_dev_sma.tsv", "w") as w:
w.write('name\t' + '\t'.join(header[2*FEATS*2:]) + "\n")
for city_idx, city in enumerate(map(lambda x:x[-1], data)):
w.write(city + '\t' + '\t'.join(map(str, table[city_idx,2*FEATS*2:])) + '\n')
print("Smoothed {:,} cells, root mean squared smoothing: {:,}".format(
smoothed_cells, math.sqrt(smoothing_square/smoothed_cells)))
That's... a lot of smoothing. Now let's try again with lag-2 over the SMA for the second featurization (Cell #14).
# CELL 14
import random
import math
import numpy as np
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell13_dev_sma.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 4
CITIES = len(data)
# assemble train/test for a given lag
PARAM_LAG = 2
train_data = list()
test_data = list()
print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])
for full_row in data:
feats = list()
for lag in range(PARAM_LAG):
feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
feats.extend(full_row[0][2*VERSIONS*FEATS:])
row = (feats, full_row[1], full_row[-1])
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
# write feature set
with open("ch7_cell14_dev_feat2.tsv", "w") as w:
w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
all_data = list(train_data)
all_data.extend(test_data)
for row in all_data:
w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
print("Training on {:,} cities".format(len(xtrain)))
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr=None # free memory
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell14_rf_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()
It got worse, which is not surprising given the level of smoothing. Let's try a different smoothing.
I will now use an exponential moving average from the present into the past, so I will combine smoothing with imputation (Cell #15).
# CELL 15
import random
import math
import numpy as np
header = None
data = list()
# load data
with open("ch7_cell5_dev_missing.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
for line in f:
fields = line.strip().split('\t')
name = fields[0]
feats = list(map(float, fields[1:]))
data.append( (feats, feats[-1], name) )
table = np.zeros( (len(data), len(header)) )
for idx, row in enumerate(data):
table[idx] = row[0]
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# impute and EMA
PARAM_EMA_WEIGHT = 0.2 # history is worth 80%, strong smoothing
smoothing_square = 0
smoothed_cells = 0
missed_imputed = 0
for city_idx in range(CITIES):
for feat in range(FEATS):
ema = table[city_idx, 2 * FEATS * VERSIONS + feat]
for upversion in range(VERSIONS):
version = VERSIONS - upversion - 1
smoothed = ema
if table[city_idx, 2 * FEATS * version + 2 * feat] == 1.0:
value = table[city_idx, 2 * FEATS * version + 2 * feat + 1]
smoothed = PARAM_EMA_WEIGHT * value + (1.0 - PARAM_EMA_WEIGHT) * ema
smoothed_cells += 1
diff = value - smoothed
smoothing_square += diff * diff
else:
missed_imputed += 1
table[city_idx, 2 * FEATS * version + 2 * feat + 1] = smoothed
ema = smoothed
# write full feature table
imputed_cells = 0
total_cells = 0
with open("ch7_cell15_dev_ema_imputed.tsv", "w") as w:
w.write('name')
w.write('\t' + '\t'.join(header) + "\n")
for city_idx, city in enumerate(map(lambda x:x[-1], data)):
w.write(city)
w.write('\t' + '\t'.join(map(str, table[city_idx,:])))
w.write('\n')
print("Total cells: {:,} missed imputed: {:,}".format(table.shape[0]*table.shape[1], missed_imputed))
print("Smoothed {:,} cells, root mean squared smoothing: {:,}".format(
smoothed_cells, math.sqrt(smoothing_square/smoothed_cells)))
And now with a lag-2 RF (Cell #16).
# CELL 16
import random
import math
import numpy as np
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell15_dev_ema_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
# assemble train/test for a given lag
PARAM_LAG = 2
train_data = list()
test_data = list()
print("Features: {:,}".format(FEATS * PARAM_LAG * 2 + FEATS))
full_header = header
header = list()
for lag in range(PARAM_LAG):
header.extend(full_header[2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
header.extend(full_header[2*VERSIONS*FEATS:])
for full_row in data:
feats = list()
for lag in range(PARAM_LAG):
feats.extend(full_row[0][2*(VERSIONS-lag-1)*FEATS: 2*(VERSIONS-lag)*FEATS])
feats.extend(full_row[0][2*VERSIONS*FEATS:])
row = (feats, full_row[1], full_row[-1])
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
# write feature set
with open("ch7_cell16_dev_feat3.tsv", "w") as w:
w.write('name\t' + '\t'.join(header) + '\tpopulation\n')
all_data = list(train_data)
all_data.extend(test_data)
for row in all_data:
w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
print("Training on {:,} cities".format(len(xtrain)))
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr = None # free memory
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell16_rf_feat3.pdf", bbox_inches='tight', dpi=300)
plt.legend()
This improve a tiny bit, but it seems there is no value in the time dimension for this data set.
Finally, let's try using the extra rows as more training data (Cell #17).
# CELL 17
import random
import math
import numpy as np
from sklearn.ensemble import RandomForestRegressor
rand = random.Random(42)
header = None
data = list()
with open("ch7_cell7_dev_imputed.tsv") as f:
header = next(f).strip().split('\t')
header.pop(0) # name
pop_locs = list(reversed(list(map(lambda x:x[0],
filter(lambda x:
x[1].endswith("population") or
x[1].endswith("population?is_defined"), enumerate(header))))))
for del_loc in pop_locs:
del header[del_loc]
for line in f:
fields = line.strip().split('\t')
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:]))
for del_loc in pop_locs:
del feats[del_loc]
data.append( (feats, pop, name) )
base_features=list(filter(lambda x: not x.startswith("201"), header))
FEATS = len(base_features)
VERSIONS = 6
CITIES = len(data)
full_train_data = list()
full_test_data = list()
print("Features: {:,}".format(FEATS))
# stratify
for row in data:
if rand.random() < 0.2:
full_test_data.append(row)
else:
full_train_data.append(row)
train_data = list()
test_data = list()
for row in full_train_data:
feats = row[0]
# current
train_data.append( (feats[2*FEATS*VERSIONS:], row[1], row[2]) )
for version in range(VERSIONS):
historical_feats = list()
for feat in range(FEATS):
if feats[2 * FEATS * version + 2 * feat] == 1.0:
historical_feats.append(feats[2 * FEATS * version + 2 * feat + 1])
else: # use current instead of imputed
historical_feats.append(feats[2 * FEATS * VERSIONS + feat])
# historical
train_data.append( (historical_feats, row[1], row[2]) )
for row in full_test_data:
# test only on current
test_data.append( (row[0][2*FEATS*VERSIONS:], row[1], row[2]) )
# write feature set
with open("ch7_cell17_dev_feat4.tsv", "w") as w:
w.write('name\t' + '\t'.join(header[-98:]) + '\tpopulation\n')
all_data = list(train_data)
all_data.extend(test_data)
for row in all_data:
w.write("{}\t{}\t{}\n".format(row[-1], '\t'.join(map(str, row[0])), row[1]))
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# train
print("Training on {:,} cities".format(len(xtrain)))
rfr = RandomForestRegressor(max_features=1.0, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rfr.fit(xtrain, ytrain)
ytest_pred = rfr.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
rfr = None # free memory
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell17_rf_feat4.pdf", bbox_inches='tight', dpi=300)
plt.legend()
That increased the available training material five-fold (from 35,000 to 25,000). However, things got much worse.
At this stage, I'll conclude the exercise, things left to try by the reader:
With the current experimental results I wouldn't advise to use the historical data for this task. If I am forced to pick a high performance feature set, it will be the imputed version using lag-5. A conservative feature set would be just adding the EMA-smoothing for historical relation counts (and nothing else).
Let us start by identifying all entities of type http://dbpedia.org/ontology/Country (Cell #18).
# CELL 18
import re
import bz2
cell18_triple_re = re.compile("(<[^>]+>)\s(<[^>]+>)\s(.*) \.")
def cell18_compressed_triples_gen(filename):
with bz2.BZ2File(filename,"r") as compressed:
for byteline in compressed:
if byteline[0] == ord('#'):
continue
line = byteline.decode("utf-8")
s,v,o = cell18_triple_re.match(line).groups()
yield line,s,v,o
total_countries = 0
with open("ch7_cell18_countries.ttl", "w") as countries:
for line, subj, verb, obj in cell18_compressed_triples_gen("instance_types_en.ttl.bz2"):
if obj == '<http://dbpedia.org/ontology/Country>':
countries.write(line)
total_countries += 1
print("Found {:,} entities of type Country".format(total_countries))
As there are an order of magnitude less countries than 3,424 that clearly count many, many historical and fictional countries.
Now let's transform the source data from the World Bank into a TSV (Cell #19).
# CELL 19
import csv
rows = list()
with open("API_SP.POP.TOTL_DS2_en_csv_v2_10224786.csv") as wb:
dialect = csv.Sniffer().sniff(wb.read(1024))
wb.seek(0)
reader = csv.reader(wb, dialect) #delimiter='"', quotechar='\\')
for row in reader:
rows.append(row)
header=rows[4]
rows=rows[5:]
total_countries = 0
with open("ch7_cell19_wb_pop.tsv", "w") as wb:
del header[1:4]
header[0]='name'
wb.write("\t".join(header) + "\n")
for row in rows:
del row[1:4]
wb.write("\t".join(row) + "\n")
total_countries += 1
print("Written population for {:,} countries for {:,} years".format(total_countries, len(header)-1))
Let us see the intersection and missing countries (Cell #20)
# CELL 20
from collections import OrderedDict
dbpedia_countries = set()
with open("ch7_cell18_countries.ttl") as c:
for line in c:
s,v,o = cell18_triple_re.match(line).groups()
dbpedia_countries.add(s)
print("DBpedia countries: {:,}".format(len(dbpedia_countries)))
wb_countries = set()
with open("ch7_cell19_wb_pop.tsv") as c:
next(c) # header
for line in c:
wb_countries.add(line.split("\t")[0])
print("World Bank countries: {:,}".format(len(wb_countries)))
country_mapping = OrderedDict()
missing_countries = list()
for country in wb_countries:
uri_name = country.replace(" ","_")
uri = "<http://dbpedia.org/resource/{}>".format(uri_name)
if uri in dbpedia_countries:
country_mapping[country] = uri
else:
missing_countries.append(country)
print("Mapped {:,} countries".format(len(country_mapping)))
print("Missing {:,}:\n".format(len(missing_countries)) + "\n".join(missing_countries))
Many of these are regions (like North America). Others include republic designators that seems to be handled differently in Wikipedia. I do the final mapping by hand in Cell #21.
# CELL 21
dbpedia_countries = set()
with open("ch7_cell18_countries.ttl") as c:
for line in c:
s,v,o = cell18_triple_re.match(line).groups()
dbpedia_countries.add(s)
print("DBpedia countries: {:,}".format(len(dbpedia_countries)))
wb_countries = set()
with open("ch7_cell19_wb_pop.tsv") as c:
next(c) # header
for line in c:
wb_countries.add(line.split("\t")[0])
print("World Bank countries: {:,}".format(len(wb_countries)))
regions = set(["Sub-Saharan Africa (IDA & IBRD countries)", "Not classified",
"East Asia & Pacific (excluding high income)", "Pre-demographic dividend", "Upper middle income",
"Small states", "Latin America & Caribbean", "High income", "IBRD only",
"Central Europe and the Baltics", "Post-demographic dividend",
"Europe & Central Asia (IDA & IBRD countries)",
"Latin America & the Caribbean (IDA & IBRD countries)", "IDA total", "Low & middle income",
"Caribbean small states", "Channel Islands", "Fragile and conflict affected situations",
"Europe & Central Asia (excluding high income)", "IDA blend", "Other small states",
"Sub-Saharan Africa", "Early-demographic dividend", "IDA only", "OECD members",
"Middle East & North Africa", "IDA & IBRD total", "South Asia (IDA & IBRD)", "Lower middle income",
"Late-demographic dividend", "Europe & Central Asia",
"Middle East & North Africa (IDA & IBRD countries)", "East Asia & Pacific",
"Pacific island small states", "Middle East & North Africa (excluding high income)",
"Euro area", "South Asia", "East Asia & Pacific (IDA & IBRD countries)",
"Heavily indebted poor countries (HIPC)", "Arab World",
"Latin America & Caribbean (excluding high income)", "World", "Middle income", "Low income",
"North America", "Least developed countries: UN classification",
"Sub-Saharan Africa (excluding high income)"])
extra_mappings = {
"Georgia": "<http://dbpedia.org/resource/Georgia_(country)>",
"Ireland": "<http://dbpedia.org/resource/Republic_of_Ireland>",
"Korea, Rep.": "<http://dbpedia.org/resource/South_Korea>",
"Macao SAR, China": "<http://dbpedia.org/resource/Macau>",
"Congo, Dem. Rep.": "<http://dbpedia.org/resource/Democratic_Republic_of_the_Congo>",
"Venezuela, RB": "<http://dbpedia.org/resource/Venezuela>",
"Iran, Islamic Rep.": "<http://dbpedia.org/resource/Iran>",
"Yemen, Rep.": "<http://dbpedia.org/resource/Yemen>",
"Egypt, Arab Rep.": "<http://dbpedia.org/resource/Egypt>",
"Cote d'Ivoire": "<http://dbpedia.org/resource/Ivory_Coast>",
"St. Vincent and the Grenadines": "<http://dbpedia.org/resource/Saint_Vincent_and_the_Grenadines> ",
"Eswatini": "<http://dbpedia.org/resource/Swaziland>",
"Congo, Rep.": "<http://dbpedia.org/resource/Republic_of_the_Congo>",
"Micronesia, Fed. Sts.": "<http://dbpedia.org/resource/Federated_States_of_Micronesia>",
"Sao Tome and Principe": "<http://dbpedia.org/resource/São_Tomé_and_Príncipe>",
"Curacao": "<http://dbpedia.org/resource/Curaçao>",
"Korea, Dem. People’s Rep.": "<http://dbpedia.org/resource/North_Korea>",
"Gambia, The": "<http://dbpedia.org/resource/The_Gambia>",
"Hong Kong SAR, China": "<http://dbpedia.org/resource/Hong_Kong>",
"St. Kitts and Nevis": "<http://dbpedia.org/resource/Saint_Kitts_and_Nevis>",
"Virgin Islands (U.S.)": "<http://dbpedia.org/resource/United_States_Virgin_Islands>",
"Brunei Darussalam": "<http://dbpedia.org/resource/Brunei>",
"Cabo Verde": "<http://dbpedia.org/resource/Cape_Verde>",
"St. Lucia": "<http://dbpedia.org/resource/Saint_Lucia>",
"St. Martin (French part)": "<http://dbpedia.org/resource/Collectivity_of_Saint_Martin>",
"Sint Maarten (Dutch part)": "<http://dbpedia.org/resource/Sint_Maarten>",
"Slovak Republic": "<http://dbpedia.org/resource/Slovakia>",
"Macedonia, FYR": "<http://dbpedia.org/resource/Republic_of_Macedonia>",
"Timor-Leste": "<http://dbpedia.org/resource/East_Timor>",
"Russian Federation": "<http://dbpedia.org/resource/Russia>",
"Botswana": "<http://dbpedia.org/resource/Botswana>",
"Kyrgyz Republic": "<http://dbpedia.org/resource/Kyrgyzstan>",
"West Bank and Gaza": "<http://dbpedia.org/resource/Palestinian_territories>",
"Lao PDR": "<http://dbpedia.org/resource/Laos>",
"Bahamas, The": "<http://dbpedia.org/resource/The_Bahamas>",
"Syrian Arab Republic": "<http://dbpedia.org/resource/Syria>",
"Uganda": "<http://dbpedia.org/resource/Uganda>"
}
country_mapping = dict()
missing_countries = list()
for country in wb_countries:
uri_name = country.replace(" ","_")
uri = "<http://dbpedia.org/resource/{}>".format(uri_name)
if uri in dbpedia_countries:
country_mapping[country] = uri
elif country in regions:
pass
elif country in extra_mappings:
if extra_mappings[country] is not None:
country_mapping[country] = extra_mappings[country]
else:
missing_countries.append(country)
print("Mapped {:,} countries".format(len(country_mapping)))
print("Missing {:,}:\n".format(len(missing_countries)) + "\n".join(missing_countries))
with open("ch7_cell21_wb_uri_mapping.tsv", "w") as mapping:
for wb in sorted(country_mapping.keys()):
db = country_mapping[wb]
mapping.write("{}\t{}\n".format(wb, db))
Finding those extra mappings took a while, for example, Georgia appears as "Georgia (Country)" and as "Democratic Republic of Georgia" (which ceased to exist in 1921). Also, DBpedia has errors sometimes, "Botswana" and "Uganda" are not listed as having the type "Country".
Now for the found URIs, let's compute the in and out relations (Cell #22).
# CELL 22
country_uris = set(map(lambda x:x.split("\t")[-1].strip(),open("ch7_cell21_wb_uri_mapping.tsv").readlines()))
rels_in = { uri: 0 for uri in country_uris }
rels_out = { uri: 0 for uri in country_uris }
for line, subj, verb, obj in cell18_compressed_triples_gen("mappingbased_literals_en.ttl.bz2"):
if subj in country_uris:
rels_out[subj] += 1
for line, subj, verb, obj in cell18_compressed_triples_gen("mappingbased_objects_en.ttl.bz2"):
if subj in country_uris:
rels_out[subj] += 1
if obj in country_uris:
rels_in[obj] += 1
total_rels_in = 0
total_rels_out = 0
with open("ch7_cell22_rels_in_out.tsv", "w") as tsv:
tsv.write("name\trels\trels?inv\n")
for country in sorted(country_uris):
tsv.write("{}\t{}\t{}\n".format(country, rels_out[country], rels_in[country]))
total_rels_in += rels_in[country]
total_rels_out += rels_out[country]
print("Written {:,} rels in, {:,} rels out".format(total_rels_in, total_rels_out))
Countries have lots of relations coming into them, as expected. We can now merge the time series data with the relations data and split into training data and final test data (Cell #23).
# CELL 23
import random
rels = dict()
with open("ch7_cell22_rels_in_out.tsv") as tsv:
next(tsv) # header
for line in tsv:
country, rels_out, rels_in = line.strip().split("\t")
rels[country] = (int(rels_out), int(rels_in))
mapping = dict()
with open("ch7_cell21_wb_uri_mapping.tsv") as m:
for line in m:
wb, db = line.strip().split("\t")
mapping[wb] = db
data = list()
with open("ch7_cell19_wb_pop.tsv") as wb:
line = next(wb)
header = line.strip().split("\t")
header.insert(1, "rels_out")
header.insert(2, "rels_in")
for line in wb:
fields = line.strip().split("\t")
wb_country = fields.pop(0)
if wb_country not in mapping:
continue
db_country = mapping[wb_country]
fields = list(map(lambda x: 0 if x == '' else int(x),fields))
fields.insert(0, rels[db_country][1])
fields.insert(0, rels[db_country][0])
data.append( (wb_country, fields) )
print("Data for {:,} countries".format(len(data)))
# save all data
with open("ch7_cell23_countries_data.tsv", "w") as tsv:
tsv.write("\t".join(header) + "\n")
for name, fields in data:
tsv.write(name + "\t" + "\t".join(map(str,fields)) + "\n")
# split train and final test
rand = random.Random(42)
rand.shuffle(data)
pivot = int(len(data) * 0.8)
devset = data[:pivot]
heldout = data[pivot:]
with open("ch7_cell23_countries_dev.tsv", "w") as dev:
dev.write("\t".join(header) + "\n")
for name, fields in devset:
dev.write(name + "\t" + "\t".join(map(str,fields)) + "\n")
with open("ch7_cell23_countries_held.tsv", "w") as held:
held.write("\t".join(header) + "\n")
for name, fields in heldout:
held.write(name + "\t" + "\t".join(map(str,fields)) + "\n")
print("Devset size: {:,}".format(len(devset)))
print("Heldout size: {:,}".format(len(heldout)))
With the data split properly, we can start some EDA on it.
Let's plot the relations and current population to see if there is a correlation (Cell #24).
# CELL 24
import math
pops = list()
num_in_rels = list()
num_out_rels = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
next(feats) # skip header
for line in feats:
fields = line.strip().split("\t")
pop = float(fields[-1])
out_rels = float(fields[1])
in_rels = float(fields[2])
if out_rels == 0:
continue
pops.append(math.log(pop, 10))
num_out_rels.append(math.log(out_rels, 10))
num_in_rels.append(math.log(in_rels, 10))
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(pops, num_in_rels, '.', color='gray', label='in')
plt.plot(pops, num_out_rels, '.', color='black', label='out')
plt.xlabel('scaled log population')
plt.ylabel('scaled log number of relations')
plt.savefig("ch7_cell24_logpop_vs_log_items.pdf", bbox_inches='tight', dpi=300)
plt.legend()
plt
From the figure we can see that the number of in relations is informative, but the number of out relations is not, most of the countries are involved in the same number of standard relations.
Let us now take 10 random countries and look at their time series data (Cell #25).
# CELL 25
import math
import random
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(int,fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 20]
for idx, row in enumerate(to_show):
plt.subplot(len(to_show) // 4, 4, idx + 1)
plt.plot(years, row[1], '.', color='gray')
plt.xlabel('years')
plt.ylabel('population')
plt.title(row[0])
plt.savefig("ch7_cell25_sample_country_pop.pdf", bbox_inches='tight', dpi=300)
plt
In the figure we can see trend reversal (Ireland in the 1980-1990, Slovak Republic in 2000-2010), missing data (Sint Maarten) and a variety of curves. Let us take a look at the ACF and PACF (Cell #26).
# CELL 26
import statsmodels.tsa.api as tsa
import math
import random
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(int,fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]
lags=list(range(30))
layout = ( 2 * len(to_show) // 3, 3 * 2 )
for idx, row in enumerate(to_show):
ts_plt = plt.subplot2grid(layout, (2 * (idx // 3), 2 * (idx % 3)), colspan=2)
acf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )
ts_plt.set_title(row[0])
ts_plt.plot(years, row[1], '.', color='gray')
tsa.graphics.plot_acf(row[1], lags=lags, ax=acf_plt, alpha=0.5)
tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
plt.savefig("ch7_cell26_sample_country_pop_acf.pdf", bbox_inches='tight', dpi=300)
plt
From all these graphs we see no impact on the PACF, so MA won't be needed. But we can see also a strong trend effect.
Let's try detrending with log (Cell #27).
# CELL 27
import statsmodels.tsa.api as tsa
import math
import random
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(lambda x:0 if x=="0" else math.log(int(x)), fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
to_show = data[:12]
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]
lags=list(range(30))
layout = ( 2 * len(to_show) // 3, 3 * 2 )
for idx, row in enumerate(to_show):
ts_plt = plt.subplot2grid(layout, (2 * (idx // 3), 2 * (idx % 3)), colspan=2)
acf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )
ts_plt.set_title(row[0])
ts_plt.plot(years, row[1], '.', color='gray')
tsa.graphics.plot_acf(row[1], lags=lags, ax=acf_plt, alpha=0.5)
tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
plt.savefig("ch7_cell27_sample_country_logpop_acf.pdf", bbox_inches='tight', dpi=300)
plt
That did not work, let's try differencing then (Cell #28).
# CELL 28
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(int,fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))
table = np.diff(table, n=1, axis=1) # the defaults, just for clarity
years.pop(0) # differencing losses one year
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]
lags = list(range(30))
layout = ( 2 * 12 // 3, 3 * 2 )
for idx in range(12):
row = ( data[idx][0], table[idx,:] )
ts_plt = plt.subplot2grid(layout, (2 * (idx // 3), 2 * (idx % 3)), colspan=2)
acf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )
ts_plt.set_title(row[0])
ts_plt.plot(years, row[1], '.', color='gray')
tsa.graphics.plot_acf(row[1], lags=lags, ax=acf_plt, alpha=0.5)
tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
plt.savefig("ch7_cell28_sample_country_pop_acf_diff1.pdf", bbox_inches='tight', dpi=300)
plt
That seems more informative, let us look at second order differencing (Cell #29).
# CELL 29
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(int,fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))
table = np.diff(table, n=2, axis=1) # the defaults, just for clarity
years = years[2:] # differencing losses two years
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 30]
lags = list(range(30))
layout = ( 2 * 12 // 3, 3 * 2 )
for idx in range(12):
row = ( data[idx][0], table[idx,:] )
ts_plt = plt.subplot2grid(layout, (2 * (idx // 3), 2 * (idx % 3)), colspan=2)
acf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3)) )
pacf_plt = plt.subplot2grid(layout, (2 * (idx // 3) + 1, 2 * (idx % 3) + 1) )
ts_plt.set_title(row[0])
ts_plt.plot(years, row[1], '.', color='gray')
tsa.graphics.plot_acf(row[1], lags=lags, ax=acf_plt, alpha=0.5)
tsa.graphics.plot_pacf(row[1], lags=lags, ax=pacf_plt, alpha=0.5)
plt.savefig("ch7_cell29_sample_country_pop_acf_diff2.pdf", bbox_inches='tight', dpi=300)
plt
Hmm? Let's move to test with ADF (Cell #30).
# CELL 30
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
from statsmodels.tsa.stattools import adfuller
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(lambda x:1 if x=="0" else int(x), fields[3:]))) )
rand = random.Random(42)
rand.shuffle(data)
table = np.array(list(map(lambda x:x[1], data)))
table_log = np.log(table) # the defaults, just for clarity
table_diff1 = np.diff(table, n=1, axis=1) # the defaults, just for clarity
lags=list(range(12))
for idx in range(12):
x = table[idx]
print("\n" + data[idx][0] + " " + str(idx))
for regression in [ 'nc', 'c', 'ct', 'ctt']:
adf, pvalue, crits, results = adfuller(x, regression=regression,autolag='AIC',store=True,regresults=True)
if pvalue < 0.1:
print("\n\t(reg={}) pvalue: {:1.5}".format(regression, pvalue))
print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))
adf, pvalue, crits, results = adfuller(table_diff1[idx],
regression=regression,autolag='AIC',store=True,regresults=True)
if pvalue < 0.1:
print("\n\t(reg={}) pvalue (diff 1): {:1.5}".format(regression, pvalue))
print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))
adf, pvalue, crits, results = adfuller(table_log[idx],
regression=regression,autolag='AIC',store=True,regresults=True)
if pvalue < 0.1:
print("\n\t(reg={}) pvalue (log): {:1.5}".format(regression, pvalue))
print("\t(reg={}) Lags used: {}".format(regression, results.usedlag))
We can see that many countries do not achieve stationarity with differencing or log nor liner or quadratic regression. For others, it depends to find the exact method. The simplest one is Turkmekistan that is stationary on its log, so let's try to predict it (Cell #31).
# CELL 31
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA( table_log[0,:-1], order=(1,0,0) )
ar_results = model.fit()
print("Log forecast: {}".format(ar_results.forecast(1)))
print("Log actual: {}".format(table_log[0,-1]))
print("Absolute error, predicted: {:,}".format(math.exp(ar_results.forecast(1)[0][0]) - math.exp(table_log[0,-1])))
print("Absolute error, previous: {:,}".format(math.exp(table_log[0,-2]) - math.exp(table_log[0,-1])))
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(table_log[0], color='black')
plt.plot(ar_results.fittedvalues, color='gray')
plt.title("{}: log pop AR(1)".format(data[0][0]))
plt.savefig("ch7_cell31_turmekistan_log_ar1.pdf", bbox_inches='tight', dpi=300)
plt
That looks nice, but there is a clear lag between the prediction and the actual value. Also, the prediction error is 3,000 people worse off than using the last known value of the sequence. It is a shame to do so much work to end up worse than were we started. But in Cell #30 the autolag found stationarity at lag 11, so let's try with an AR(11) process instead (Cell #32).
# CELL 32
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA( table_log[0,:-1], order=(11, 0, 0) )
ar_results = model.fit()
print("Log forecast: {}".format(ar_results.forecast(1)))
print("Log actual: {}".format(table_log[0,-1]))
log_forecast = ar_results.forecast(1)[0][0]
forecast = math.exp(log_forecast)
log_actual = table_log[0, -1]
actual = math.exp(log_actual)
previous = math.exp(table_log[0, -2])
previous2 = math.exp(table_log[0, -3])
print("Absolute error, predicted: {:,}".format(forecast - actual))
print("Absolute error, previous: {:,}".format(previous - actual))
print("Absolute error, previous + diff: {:,}".format(previous + (previous - previous2) - actual))
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]
plt.plot(table_log[0], color='black')
plt.plot(ar_results.fittedvalues, color='gray')
plt.title("{}: log pop AR(11)".format(data[0][0]))
plt.savefig("ch7_cell32_turmekistan_log_ar11.pdf", bbox_inches='tight', dpi=300)
plt
Now we are in business! The question is how to operationalize obtaining all these fitted models and predictions automatically. I discuss this with the third featurization, let us start with the baselines, no time series and using the time series as a training data.
Given the EDA, it looks likely the TS systems will underbehave compared to the baseline approaches.
Let us start by using only the number of relations (Cell #33)
# CELL 33
import math
import random
from sklearn.svm import SVR
import numpy as np
rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats)
header = header.strip().split("\t")
header = [ header[0], "log_" + header[1], "log_" + header[2], "logpop" ]
for line in feats:
fields = line.strip().split("\t")
name = fields[0]
pop = float(fields[-1])
out_rels = float(fields[1])
in_rels = float(fields[2])
if out_rels == 0:
out_rels = 1
row = ( [ math.log(out_rels, 10), math.log(in_rels, 10) ], math.log(pop, 10), name )
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
with open("ch7_cell33_feat1.tsv", "w") as feat:
feat.write("\t".join(header) + "\n")
for feats, logpop, name in train_data + test_data:
feat.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,feats)), logpop))
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
if xtrain_diff[idx] == 0.0:
xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling
ytrain_orig = ytrain.copy()
ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling
xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling
# train
print("Training on {:,} countries".format(len(xtrain)))
best_c = 100.0
best_epsilon = 0.05
SEARCH_TEST = 50
best_rmse = 1000
for c in [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 50.0, 100.0]:
svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
ytrain_pred_end *= 1.0/ytrain_scaling
ytrain_pred_end += ytrain_min
RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
print("C", c, "RMSE", RMSE)
if RMSE < best_rmse:
best_c = c
best_rmse = RMSE
print("Best C", best_c,"best RMSE", best_rmse)
best_rmse = 1000
for epsilon in [0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.0, 1.0, 10.0, 100.0]:
svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
ytrain_pred_end *= 1.0/ytrain_scaling
ytrain_pred_end += ytrain_min
RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
print("Epsilon", epsilon, "RMSE", RMSE)
if RMSE < best_rmse:
best_epsilon = epsilon
best_rmse = RMSE
print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell33_svr_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()
The RMSE is very small for the amount of training data and pauperity of the features. Let us add the TS as extra features.
I will also try to find the range that produces the best results (Cell #34).
# CELL 34
import math
import random
from sklearn.svm import SVR
import numpy as np
rand = random.Random(42)
train_data = list()
test_data = list()
header = None
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats)
logheader = list(map(lambda x:"log_"+x, header.strip().split("\t")))
header = [ header[0] ] + logheader[1:-1] + [ "logpop" ]
for line in feats:
fields = line.strip().split("\t")
name = fields[0]
pop = float(fields[-1])
feats = list(map(float, fields[1:-1]))
for idx, feat in enumerate(feats):
if feat == 0:
feats[idx] = 1
row = ( list(map(lambda x:math.log(x, 10), feats)), math.log(pop, 10), name )
if rand.random() < 0.2:
test_data.append(row)
else:
train_data.append(row)
test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))
full_xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
full_xtest = np.array(list(map(lambda t:t[0], test_data)))
ytest = np.array(list(map(lambda t:t[1], test_data)))
# SVRs need scaling
full_xtrain_min = full_xtrain.min(axis=0); full_xtrain_max = full_xtrain.max(axis=0)
# some can be zero if the column is constant in training
full_xtrain_diff = full_xtrain_max - full_xtrain_min
for idx in range(len(full_xtrain_diff)):
if full_xtrain_diff[idx] == 0.0:
full_xtrain_diff[idx] = 1.0
full_xtrain_scaling = 1.0 / full_xtrain_diff
full_xtrain -= full_xtrain_min; full_xtrain *= full_xtrain_scaling
ytrain_orig = ytrain.copy()
ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling
full_xtest_orig = full_xtest.copy()
full_xtest -= full_xtrain_min; full_xtest *= full_xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling
# train
print("Training on {:,} countries".format(len(full_xtrain)))
best_history = len(header) - 5
best_history_rmse = 1000
best_history_c = 100.0
best_history_epsilon = 0.05
for history in range(1, len(header) - 5):
print("History", history)
xtrain = np.zeros( (full_xtrain.shape[0], 2 + history))
xtrain[:,:2] = full_xtrain[:,:2] # rels
xtrain[:, 2:] = full_xtrain[:,-history:]
best_c = 100.0
best_epsilon = 0.05
SEARCH_TEST = 50
best_rmse = 1000
for c in [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 50.0, 100.0]:
svr_rbf = SVR(epsilon=0.05, C=c, gamma='auto')
svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
ytrain_pred_end *= 1.0/ytrain_scaling
ytrain_pred_end += ytrain_min
RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
if RMSE < best_rmse:
best_c = c
best_rmse = RMSE
print("Best C", best_c,"best RMSE", best_rmse)
best_rmse = 1000
for epsilon in [0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.0, 1.0, 10.0, 100.0]:
svr_rbf = SVR(epsilon=epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain[:-SEARCH_TEST,:], ytrain[:-SEARCH_TEST])
ytrain_pred_end = svr_rbf.predict(xtrain[-SEARCH_TEST:,:])
ytrain_pred_end *= 1.0/ytrain_scaling
ytrain_pred_end += ytrain_min
RMSE = math.sqrt(sum((ytrain_orig[-SEARCH_TEST:] - ytrain_pred_end)**2) / SEARCH_TEST)
if RMSE < best_rmse:
best_epsilon = epsilon
best_rmse = RMSE
print("Best epsilon", best_epsilon,"best RMSE", best_rmse)
if best_rmse < best_history_rmse:
best_history_rmse = best_rmse
best_history = history
best_history_c = best_c
best_history_epsilon = best_epsilon
print("Best history", best_history, "best c", best_c, "best epsilon", best_epsilon,"best RMSE", best_rmse)
xtrain = np.zeros( (full_xtrain.shape[0], 2 + best_history))
xtrain[:,:2] = full_xtrain[:,:2] # rels
xtrain[:, 2:] = full_xtrain[:,-best_history:]
xtest = np.zeros( (full_xtest.shape[0], 2 + best_history))
xtest[:,:2] = full_xtest[:,:2] # rels
xtest[:, 2:] = full_xtest[:,-best_history:]
svr_rbf = SVR(epsilon=best_history_epsilon, C=best_history_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)
# use previous year to predict current
RMSE = math.sqrt(sum((full_xtest_orig[:,-1] - ytest_pred)**2) / len(ytest))
print("Previous year RMSE", RMSE)
with open("ch7_cell34_feat2.tsv", "w") as feat:
feat.write("\t".join(header[:2] + header[-(best_history+1):]) + "\n")
for feats, logpop, name in train_data + test_data:
feat.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,feats[:2] + feats[-best_history:])), logpop))
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch7_cell34_svr_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()
That is going to be very difficult to improve upon but we can see that the error rate for just using the previous year is comparable with the SVR. Let us give the TS a try.
To get TS predictions, I will do the following:
In terms of detrending techniques I will try:
I will skip differencing and quadratic regression as it did not seem particularly useful in EDA and it makes the code more complicated. I use the simplest possible detrending, basically preferring the simplest possible model.
I will then train a model using the history from previous featurization (2 lags) and the prediction from the TS (if any, otherwise use the previous + previous_diff). Cell #35.
# CELL 35
import math
import random
import statsmodels.tsa.api as tsa
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
years = list()
data = list()
with open("ch7_cell23_countries_dev.tsv") as feats:
header = next(feats).strip().split("\t")
years = list(map(int,header[3:]))
for line in feats:
fields = line.strip().split("\t")
data.append( (fields[0], list(map(lambda x:1 if x=="0" else int(x), fields[3:])),
list(map(lambda x:1 if x=="0" else math.log(float(x), 10), fields[1:3]))) )
table = np.array(list(map(lambda x:x[1], data)))
table_log = np.log10(table)
conditions = [
( 'base', 'c', 50 ),
( 'base', 'c', 25 ),
( 'log', 'c', 50 ),
( 'log', 'c', 25 ),
( 'base', 'ct', 50 ),
( 'base', 'ct', 25 ),
( 'log', 'ct', 50 ),
( 'log', 'ct', 25 ),
]
predicted = list()
condition_used = list()
errors = 0
fitted_countries = set()
for country_idx in range(table.shape[0]):
print(country_idx, data[country_idx][0])
years = table[country_idx, : -2] # skip last year
target = table[country_idx, -2] # use previous to last for prediction
prediction = years[-1] + (years[-1] - years[-2])
best_prediction_se = (target - prediction)**2
best_condition = ( 'none', 'c', 2 )
for condition in conditions:
length = condition[-1]
if condition[0] == 'base':
this_table = table
else:
this_table = table_log
years = this_table[country_idx, -(length+2): -2] # skip last year
target = table[country_idx, -2] # use previous to last for prediction
regression = condition[1]
adf, pvalue, crits, results = adfuller(years, regression=regression,autolag='AIC',
store=True,regresults=True)
if pvalue < 0.1:
print("\t{}".format(condition))
# train and predict
model = ARIMA( results.resols.resid, order=(results.usedlag, 0, 0) )
try:
ar_results = model.fit()
except:
print("\t\te")
errors += 1
continue
fitted_countries.add(country_idx)
df_ols = int(results.resols.df_model)
predict = ar_results.forecast(1)[0][0]
ols_predict = results.resols.predict(np.diff(years[-(df_ols+2):]).reshape(1,df_ols+1))[0]
predict += years[-1] + ols_predict
if condition[0] == 'log':
predict = 10**predict
se = (target - predict)**2
if se <= best_prediction_se:
print("\t\ts {:,} and {:,}".format(int(se), best_prediction_se))
best_prediction_se = se
best_condition = condition
break
else:
print("\t\t{:,} vs {:,}".format(int(se), best_prediction_se))
diff_prediction = table[country_idx, -2] + (table[country_idx, -2] - table[country_idx, -3])
if best_condition[0] != 'none':
# compute final prediction
length = best_condition[-1]
if best_condition[0] == 'base':
this_table = table
else:
this_table = table_log
years = this_table[country_idx, -(length+1): -1] # skip last year
regression = best_condition[1]
adf, pvalue, crits, results = adfuller( years, regression=regression,autolag='AIC',
store=True,regresults=True)
if pvalue < 0.1:
# train and predict
model = ARIMA( results.resols.resid, order=(results.usedlag, 0, 0) )
in_error = False
try:
ar_results = model.fit()
except:
print("\t\t\tE")
in_error = True
errors += 1
if in_error:
best_condition = ( 'none', 'c', 2 )
prediction = diff_prediction
else:
df_ols = int(results.resols.df_model)
prediction = ar_results.forecast(1)[0][0]
ols_predict = results.resols.predict(np.diff(years[-(df_ols+2):]).reshape(1,df_ols+1))[0]
prediction += years[-1] + ols_predict
if best_condition[0] == 'log':
prediction = 10**prediction
else:
best_condition = ( 'none', 'c', 2 )
prediction = diff_prediction
else:
prediction = diff_prediction
predicted.append(prediction)
condition_used.append(best_condition)
print("Errors", errors)
print("Fitted", len(fitted_countries))
print("Using TS:", len(list(filter(lambda x:x[0] != 'none',condition_used))))
with open("ch7_cell35_ts_predicted.tsv", "w") as pred:
pred.write("name\tpredicted\tmethod\tregression\tsize\n")
for country_idx, pair in enumerate(zip(predicted, condition_used)):
prediction, condition = pair
pred.write("{}\t{}\t{}\t{}\t{}\n".format(data[country_idx][0], prediction, *condition))
As only 6 out of 163 predictors used the TS, there is no point to proceed with the remaining featurizations.