Chapter 8: Text¶

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License¶

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations¶

Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
Resource limitations:
- At most one day of running time per notebook.
- No GPU required.
- Minimal dependencies.
- At most 8Gb of RAM.
Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Chapter 8: Case Study on Textual Data¶

In this chapter, we will cover an expansion of the WikiCities dataset with textual descriptions and its impact on the population prediction task introduced in Chapter 6.

Text is a domain that exemplifies:

Large number of correlated features
A domain where ordering issues are important
Variable length feature vectors

Among the methods exemplified in text we will see:

Feature selection: dimensionality reduction
Feature weighting: TF-IDF
Computable features: morphological features

WikiCities Text Dataset¶

The text data we will be using in this notebook are the Wikipedia pages for the different cities in the dataset. Text extraction from Wikipedia is a computationally intense task and better catered by specialized tools. In this case, I used the excellent software wikiextractor by Giuseppe Attardi and produces cities1000_wikitext.tsv.bz2, with one city per row and text lines separated by tab characters. That 43,909,804 words and over 270,902,780 characters (average of 558 words per document, 3,445 characters) . For some of the experiments using document structure, I also kept the original markup in file cities1000_wikiraw.tsv.bz2, with markup the total number of characters climbs above 730 million.

First, many Wikipedia pages contain the population information mentioned within the text. Not necessarily all of them, but many do. At the Exploratory Data Analysis stage we might want to get an idea of how many do. Even for the ones that do, however, it might be indicated in many different ways, including punctuation (2,152,111 instead of 2152111) but most probably rounded up and expressed intermixing digits with words (like "a little over 2 million"). In that sense, this task is representative of the NLP subfield of Information Extraction.

While NLP this decade has been overtaken by Deep Learning approaches, particularly using Neurolanguage Models, this particular task most probably can still profit from non-deep learning techniques as we are looking for a very small piece of evidence within a large amount of data.

Following Chapter 6, it is clear that bigger cities will have longer pages so text length will most probably be a great feature. As base features, we will use the ch6_cell32_dev_feat_conservative.tsv with its 98 features.

With such an amount of data, aggressive feature selection will be needed, but let us start with some EDA.

Exploratory Data Analyis¶

Let us start by assembling a simple data set with an extra feature (the text length) and see whether it helps better predict the population (Cell #1)

# CELL 1
import random
import bz2
import re
import math
from sklearn.svm import SVR
import numpy as np

# read page lengths
text_lengths = dict()
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        text = cityline[tab:]
        text_lengths[name] = len(text)

# read base features
rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch8_cell1_dev_textlen.tsv", "w") as ch8:
    with open("ch6_cell32_dev_feat_conservative.tsv") as feats:
        header = next(feats)
        header = header.strip().split("\t")
        header.insert(-1, 'logtextlen')
        ch8.write("\t".join(header) + "\n")
        header.pop(0) # name
        header.pop() # population
        for line in feats:
            fields = line.strip().split("\t")
            if name not in text_lengths:
                raise Exception("City not found: " + name)
            fields.insert(-1, str(math.log(text_lengths[name], 10)))
            ch8.write("\t".join(fields) + "\n")
            logpop = float(fields[-1])
            name = fields[0]
            feats = list(map(float,fields[1:-1]))
            row = (feats, logpop, name)
            if rand.random() < 0.2:
                test_data.append(row) 
            else:
                train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c       = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred  = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell1_svr.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3434770677131145

<matplotlib.legend.Legend at 0x7fcfdb668450>

The resulting RMSE of 0.3434 is an improvement of the one from Chapter 6, 0.3578, which is encouraging, but it is above using the full Chapter 6 graph information at 0.3298.

Let us look at ten random cities to see whether their text descriptions include the population explicitly (Cell #2). Notice I have used a regular Wikipedia dump, not a Cirrus dump. Wikipedia in recent years have moved to include tags expanded from the Wikidata project and therefore the exact population number might be absent, with a tag indicating the template engine to fetch the number at rendering time.

# CELL 2
PARAM_USE_SET_FOR_BOOK = True

cities_and_pop = list()
with open("ch8_cell1_dev_textlen.tsv") as feats:
    first = True
    for line in feats:
        if first:
            first = False
        else:
            fields = line.split('\t')
            cities_and_pop.append( (fields[0], round(10**float(fields[-1]))) )

rand = random.Random(42)

# stable set for book
cities = [ ('Century,_Florida', 1698), ('Isseksi', 2000), ('Volda', 8827), ('Cournonsec', 2149), 
          ('Cape_Neddick,_Maine', 2568), ('Zhlobin', 80200), ('Hangzhou', 9018000), ('Gnosall', 4736), 
          ('Scorbé-Clairvaux', 2412), ('Arizona_City,_Arizona>', 10475) ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: ("<http://dbpedia.org/resource/" + x[0] + ">", x[1]), cities))
else:
    cities = set(rand.sample(sorted(cities_and_pop), 10))
    
to_print = set(map(lambda x:x[0], cities))
pops = { x[0]: x[1] for x in cities }

html = ''
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            text = text.replace('\t','<p>')
            html += "<h1>{}</h1><h2>Population: {}</h2>{}".format(name[1:-1], pops[name], text)
from IPython.display import HTML, display
display(HTML(html))

From here, most pages mention the actual number, albeit with punctuation.

City	Pop	Text
Arizona_City,_Arizona		The population was 10,475 at the 2010 census.
Century,_Florida		The population was 1,698 at the 2010 United States Census.
Cape_Neddick,_Maine		The population was 2,568 at the 2010 census.
Hangzhou		Hangzhou prefecture had a registered population of 9,018,000 in 2015.
Volda	8827	The new Volda municipality had 7,207 residents.
Gnosall		Gnosall Gnosall is a village and civil parish in the Borough of Stafford, Staffordshire, England, with a population of 4,736 across 2,048 households (2011 census).
Zhlobin		As of 2012, the population is 80.200.
Cournonsec	2149
Scorbé-Clairvaux	2412
Isseksi		At the time of the 2004 census, the commune had a total population of 2000 people living in 310 households.

So 8 out of 10 mention the population, with one case of a different population (7,207 vs. 8,827) and another with the wrong punctuation (80.200 instead of 80,200). Clearly there is value on the textual data. Only one case it has the number verbatim (without any punctuation).

Also, note that Volda is town with plenty of text and a rich history. The page itself describes its population changes over the years.

Let us see if these percentages carry on to the whole dataset (Cell #3).

# CELL 3
import bz2

cities_and_pop = dict()
with open("ch6_cell32_dev_feat_conservative.tsv") as feats:
    first = True
    for line in feats:
        if first:
            first = False
        else:
            fields = line.split('\t')
            cities_and_pop[ fields[0] ] = round(10**float(fields[-1]))

found_verbatim    = 0
found_with_commas = 0
found_with_dots   = 0
total = 0
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        text = cityline[tab:]
        if name in cities_and_pop:
            total += 1
            pop = cities_and_pop[name]
            pop_verbatim = str(pop)
            if pop_verbatim in text:
                found_verbatim += 1
            else:
                pop_commas = "{:,}".format(pop)
                if pop_commas in text:
                    found_with_commas += 1
                else:
                    pop_dots = pop_commas.replace(",",".")
                    if pop_dots in text:
                        found_with_dots += 1

print("Total cities:      {:,}".format(total))
print("Found verbatim:    {:,} ({:%})".format(found_verbatim, found_verbatim * 1.0 / total))
print("Found with commas: {:,} ({:%})".format(found_with_commas, found_with_commas * 1.0 / total))
print("Found with dots:   {:,} ({:%})".format(found_with_dots, found_with_dots * 1.0 / total))
found_either = found_verbatim + found_with_commas + found_with_dots
print("Found either:      {:,} ({:%})".format(found_either, found_either * 1.0 / total))

Total cities:      44,959
Found verbatim:    1,379 (3.067239%)
Found with commas: 22,647 (50.372562%)
Found with dots:   36 (0.080073%)
Found either:      24,062 (53.519874%)

Therefore, half the cities contain their population verbatim in the page. Using rule-based information extraction techniques (regular expressions and the like), for example using the Rule-based Text Annotation system would work. We will try more automated techniques based which might also apply for the other cities.

A question is what type of model to use. This might be a good moment to move away from SVRs, as they could have trouble with large number of features and their strong tendency against overfitting might fail when any of those features contain the target value, so let us see how it behaves with a target leak. (Cell #4).

# CELL 4
import random
import re
import math
from sklearn.svm import SVR
import numpy as np

# read base features
rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch8_cell1_dev_textlen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:])) # keep pop
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c       = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred  = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE with target leak", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell4_svr.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE with target leak 0.10537655348694651

<matplotlib.legend.Legend at 0x7fcfdeaa7610>

That looks actually very nice, sadly, when adding more features the training times get prohibitively long, so I moved to a Random Forest Regressor (Cell #5).

# CELL 5
import re
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np


# read base features
rand = random.Random(42)
header = None
train_data = list()
test_data  = list()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name) 
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell5_rf.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3547396128278879

<matplotlib.legend.Legend at 0x7fcfd95817d0>

It produces worse performance than SVR but it trains much faster so it will do. We can now proceed to our first featurization, where we will see the documents as bags of words.

First Featurization: Numbers-only¶

The bag of words approach represents each document as a fixed size vector with size equals to the whole vocabulary (as computed on training).

By far the most important function in a bag-of-words approach is the tokenization function. Good tokenization is key and it is language and subdomain specific (e.g., journalistic text vs. Twitter).

In our case, tokenizing numbers is key. In other domains it is important to find different variations of words (what is known as "morphology") but this problem presents a simpler case, just with numbers.

For the purpose of our regression problem, a difference between 12,001,112 and 12,001,442 constitutes a nuisance variation and needs to be addressed. We can replace each number with a pseudo-word, indicating, for example, how many digits the number has (think "TOKNUM1DIGIT", "TOKNUM2DIGIT", etc). That will produce about 10 tokens for all the population numbers we have. This might not be enough, instead we might want to distinguish the first digit of the numbers (1TOKNUM3DIGIT represents 1,000 to 1,999; 2TOKNUM3DIGIT represent 2,000 to 2,999 and so one), that will create about 90 tokens, which might be too many.

Instead, we can use the discretization data from Cell #27 in Chapter 6 and transform each number-like token into a TOKNUMSEG1 for 32 segments (Cell #5 below). To avoid having too many features, we are going to expand the feature vector to include only binary features indicating whether these features appear.

Word classes vs. word tokens¶

When operating with documents and vocabularies, it is important to distinguish the vocabulary size vs. the total document sizes. Both are measured in "words" but the term "word" means different things in each case. Therefore, in NLP we use the terms "word types" to refers to dictionary entries and "word tokens" to refer to document entries. You can think of the word types as a class in object oriented programming and a word token as an instance of the class.

We can now assemble the baseline system, where we are using BoW over the whole documents in the trainset. Because the vocabulary is fixed in the trainset, there will be many words missing in the devset. That is when smoothing techniques (like Good-Turing's smoothing) come handy.

# CELL 6
import re
import pickle
import random
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
                
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell6_tokenize(text):
    tokens = list(filter(lambda x:len(x)>0, 
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' ')))
    result = list()
    for tok in tokens:
        if tok[-1] in set([".", ",", "?", "!"]):
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                pass # too small
            elif num > boundaries[-1][2]:
                pass # too big
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
    return result

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
tok_to_col = dict()
for idx, segs in enumerate(boundaries):
    header.append("TOKNUMSEG{}-{}-{}".format(idx, segs[0], segs[-1]))
    tok_to_col["TOKNUMSEG{}".format(idx)] = idx
    
remaining = set(map(lambda x:x[-1], all_data))
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            remaining.remove(name)
            extra_feats = [0.0] * len(boundaries)
            text = cityline[tab:]
            for numtoken in cell6_tokenize(text):
                extra_feats[tok_to_col[numtoken]] = 1.0
            all_data[city_to_all_data[name]][0].extend(extra_feats)
            
for name in remaining:
    extra_feats = [0.0] * len(boundaries)
    all_data[city_to_all_data[name]][0].extend(extra_feats)

with open("ch8_cell6_dev_feat1.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",   color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell6_rf_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3437545061225639

<matplotlib.legend.Legend at 0x7fcfd4c4db50>

At 0.3437, that worked very well for adding only 32 new features, but we are still not at the level of the SVR (a SVR on this dataset takes 2h to train, with only 130 features and produce a RMSE of 0.3216).

Second Featurization: Bag-of-words¶

Let's try to add some more words to build a BoW representation, to avoid a large feature set, let's take the top 1000 words filtered by M.I. (Cell #7)

# CELL 7
import re
import pickle
import random
import bz2
import math
import numpy as np

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
                
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell7_tokenize(text):
    tokens = list(filter(lambda x:len(x)>0, 
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' ')))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] in set([".", ",", "?", "!", "\"", "'"]):
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(tok.lower())
    return result

# read base features
rand = random.Random(42)
city_pop = dict()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        city_pop[name] = logpop
cities = sorted(list(city_pop.keys()))
        
# vocabulary
all_vocab     = list()
vocab_to_idx  = dict()
city_tok_idxs = dict()

remaining = set(city_pop.keys())
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text = cityline[tab:]
            toks = set()
            for token in cell7_tokenize(text):
                idx = vocab_to_idx.get(token, None)
                if idx is None:
                    idx = len(all_vocab)
                    all_vocab.append(token)
                    vocab_to_idx[token] = idx
                toks.add(idx)
            city_tok_idxs[name] = sorted(list(toks))

for name in remaining:
    city_tok_idxs[name] = list()
    
print("Total vocabulary: {:,}".format(len(all_vocab)))

# drop tokens that appear in less than 200 documents
tok_docs = list()
for _ in range(len(all_vocab)):
    tok_docs.append([])
for doc_idx, name in enumerate(cities):
    tok_idxs = city_tok_idxs[name]
    for tok_idx in tok_idxs:
        tok_docs[tok_idx].append(doc_idx)
city_tok_idxs = None

threshold = 200
reduced_vocab = list()
for tok_idx in range(len(all_vocab)):
    if len(tok_docs[tok_idx]) >= threshold:
        reduced_vocab.append(tok_idx)
        
print("Reduced vocabulary: {:,} (reduction {:%})"
      .format(len(reduced_vocab), (len(all_vocab) - len(reduced_vocab)) / len(all_vocab)))    

ydata = np.array(list(map(lambda c:city_pop[c], cities)))

def cell7_adjudicate(data, segments):
    result = list()
    for val in data:
        idx = None
        if val < segments[0]['min']:
            idx = 0
        elif val > segments[-1]['max']:
            idx = len(segments) - 1
        else:
            for idx2, segment in enumerate(segments):
                if segment['min'] <= val and \
                    (idx2 == len(segments)-1 or val < segments[idx2+1]['min']):
                    idx = idx2
                    break
        result.append(idx)
    return np.array(result)

ydata = cell7_adjudicate(ydata, segments_at[2])

feature_utility = list()

xdata = np.zeros( ydata.shape )
for pos, tok_idx in enumerate(reduced_vocab):
    verbose = False
    if pos % int(len(reduced_vocab) / 100) == 0:
        print("Computing M.I. for {:>6} out of {:>6} tokens, token \"{}\""
              .format(pos, len(reduced_vocab), all_vocab[tok_idx]))
        #verbose = True

    xdata[:] = 0
    for idx in tok_docs[tok_idx]:
        xdata[idx] = 1.0

    # compute confusion table
    table = dict()
    for row in range(xdata.shape[0]):
        feat_val = int(xdata[row])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = dict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1

    feats = set()
    for row in table.values():
        feats.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feats }
    full_table = sum(cols.values())
    
    if verbose:
        print("\tTable:\n\t{}\n\tfull_table: {}\n\tCols: {}"
              .format(table, full_table, cols))
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n11={}, skipping"
                        .format(feat_val, target_val, n11))
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n10={} or n01={} or n00={} is zero, skipping"
                        .format(feat_val, target_val, n10, n01, n00))
                continue
            n1_ = n11 + n10
            n0_ = n01 + n00
            n_1 = n11 + n01
            n_0 = n10 + n00
            n = float(full_table)
            utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
               n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
               n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
               n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if verbose:
        print("\tbest_utility: {}".format(best_utility))
    if best_utility is not None:
        feature_utility.append( (all_vocab[tok_idx], best_utility) )
all_vocab = None # free memory
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

PARAM_KEEP_TOP = 1000
with open("ch8_cell7_vocab.tsv", "w") as kept:
    for row in feature_utility[:PARAM_KEEP_TOP]:
        kept.write("{}\t{}\n".format(*row))
        
table1 = ("<table><tr><th>Position</th><th>Token</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(feature_utility[:100])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(reversed(feature_utility[-100:]))))) +"</table>")

with open("ch8_cell7_dev_tokens.tsv", "w") as kept:
    kept.write("name\t" + "\t".join(map(lambda x:"token=" + x[0],feature_utility[:PARAM_KEEP_TOP]))+"\n")
    matrix = np.zeros( (ydata.shape[0], PARAM_KEEP_TOP) )
    for idx_tok, row in enumerate(feature_utility[:PARAM_KEEP_TOP]):
        tok = row[0]
        for idx_doc in tok_docs[vocab_to_idx[tok]]:
            matrix[idx_doc, idx_tok] = 1.0
    for idx_doc in range(matrix.shape[0]):
        kept.write(cities[idx_doc] + "\t" + "\t".join(map(str,matrix[idx_doc,:])) +"\n")
matrix       = None
tok_docs     = None
vocab_to_idx = None

from IPython.display import HTML, display
display(HTML("<h3>Top 100 tokens by MI</h3>" + table1 + 
             "<h3>Last 100 tokens by MI</h3>" + table2))

Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Total vocabulary: 408,793
Reduced vocabulary: 6,254 (reduction 98.470130%)
Computing M.I. for      0 out of   6254 tokens, token ","
Computing M.I. for     62 out of   6254 tokens, token "honey"
Computing M.I. for    124 out of   6254 tokens, token "that"
Computing M.I. for    186 out of   6254 tokens, token "gradually"
Computing M.I. for    248 out of   6254 tokens, token "trading"
Computing M.I. for    310 out of   6254 tokens, token "acts"
Computing M.I. for    372 out of   6254 tokens, token "park"
Computing M.I. for    434 out of   6254 tokens, token "high"
Computing M.I. for    496 out of   6254 tokens, token "council"
Computing M.I. for    558 out of   6254 tokens, token "eastern"
Computing M.I. for    620 out of   6254 tokens, token "resources"
Computing M.I. for    682 out of   6254 tokens, token "buildings"
Computing M.I. for    744 out of   6254 tokens, token "inland"
Computing M.I. for    806 out of   6254 tokens, token "electricity"
Computing M.I. for    868 out of   6254 tokens, token "automotive"
Computing M.I. for    930 out of   6254 tokens, token "virtue"
Computing M.I. for    992 out of   6254 tokens, token "sculpture"
Computing M.I. for   1054 out of   6254 tokens, token "campus"
Computing M.I. for   1116 out of   6254 tokens, token "popular"
Computing M.I. for   1178 out of   6254 tokens, token "enlarged"
Computing M.I. for   1240 out of   6254 tokens, token "ko"
Computing M.I. for   1302 out of   6254 tokens, token "associated"
Computing M.I. for   1364 out of   6254 tokens, token "performances"
Computing M.I. for   1426 out of   6254 tokens, token "mentioned"
Computing M.I. for   1488 out of   6254 tokens, token "list"
Computing M.I. for   1550 out of   6254 tokens, token "spain"
Computing M.I. for   1612 out of   6254 tokens, token "seaside"
Computing M.I. for   1674 out of   6254 tokens, token "build"
Computing M.I. for   1736 out of   6254 tokens, token "TOKNUMSEG14"
Computing M.I. for   1798 out of   6254 tokens, token "ever"
Computing M.I. for   1860 out of   6254 tokens, token "sense"
Computing M.I. for   1922 out of   6254 tokens, token "politics"
Computing M.I. for   1984 out of   6254 tokens, token "implemented"
Computing M.I. for   2046 out of   6254 tokens, token "category"
Computing M.I. for   2108 out of   6254 tokens, token "privately"
Computing M.I. for   2170 out of   6254 tokens, token "watch"
Computing M.I. for   2232 out of   6254 tokens, token "operated"
Computing M.I. for   2294 out of   6254 tokens, token "games"
Computing M.I. for   2356 out of   6254 tokens, token "historically"
Computing M.I. for   2418 out of   6254 tokens, token "opening"
Computing M.I. for   2480 out of   6254 tokens, token "31"
Computing M.I. for   2542 out of   6254 tokens, token "workshops"
Computing M.I. for   2604 out of   6254 tokens, token "provide"
Computing M.I. for   2666 out of   6254 tokens, token "breweries"
Computing M.I. for   2728 out of   6254 tokens, token "fog"
Computing M.I. for   2790 out of   6254 tokens, token "underground"
Computing M.I. for   2852 out of   6254 tokens, token "employers"
Computing M.I. for   2914 out of   6254 tokens, token "basilica"
Computing M.I. for   2976 out of   6254 tokens, token "membership"
Computing M.I. for   3038 out of   6254 tokens, token "me"
Computing M.I. for   3100 out of   6254 tokens, token "twelve"
Computing M.I. for   3162 out of   6254 tokens, token "australian"
Computing M.I. for   3224 out of   6254 tokens, token "problems"
Computing M.I. for   3286 out of   6254 tokens, token "elizabeth"
Computing M.I. for   3348 out of   6254 tokens, token "eighteen"
Computing M.I. for   3410 out of   6254 tokens, token "feast"
Computing M.I. for   3472 out of   6254 tokens, token "tour"
Computing M.I. for   3534 out of   6254 tokens, token "deemed"
Computing M.I. for   3596 out of   6254 tokens, token "53"
Computing M.I. for   3658 out of   6254 tokens, token "candidates"
Computing M.I. for   3720 out of   6254 tokens, token "manager"
Computing M.I. for   3782 out of   6254 tokens, token "reaching"
Computing M.I. for   3844 out of   6254 tokens, token "pools"
Computing M.I. for   3906 out of   6254 tokens, token "germanic"
Computing M.I. for   3968 out of   6254 tokens, token "garrison"
Computing M.I. for   4030 out of   6254 tokens, token "gives"
Computing M.I. for   4092 out of   6254 tokens, token "regained"
Computing M.I. for   4154 out of   6254 tokens, token "maurice"
Computing M.I. for   4216 out of   6254 tokens, token "vital"
Computing M.I. for   4278 out of   6254 tokens, token "journal"
Computing M.I. for   4340 out of   6254 tokens, token "contributed"
Computing M.I. for   4402 out of   6254 tokens, token "vs"
Computing M.I. for   4464 out of   6254 tokens, token "champion"
Computing M.I. for   4526 out of   6254 tokens, token "smith"
Computing M.I. for   4588 out of   6254 tokens, token "virtually"
Computing M.I. for   4650 out of   6254 tokens, token "recycling"
Computing M.I. for   4712 out of   6254 tokens, token "votes"
Computing M.I. for   4774 out of   6254 tokens, token "guided"
Computing M.I. for   4836 out of   6254 tokens, token "argentina"
Computing M.I. for   4898 out of   6254 tokens, token "grave"
Computing M.I. for   4960 out of   6254 tokens, token "varied"
Computing M.I. for   5022 out of   6254 tokens, token "aftermath"
Computing M.I. for   5084 out of   6254 tokens, token "generations"
Computing M.I. for   5146 out of   6254 tokens, token "actor"
Computing M.I. for   5208 out of   6254 tokens, token "exactly"
Computing M.I. for   5270 out of   6254 tokens, token "else"
Computing M.I. for   5332 out of   6254 tokens, token "popularly"
Computing M.I. for   5394 out of   6254 tokens, token "resting"
Computing M.I. for   5456 out of   6254 tokens, token "parkland"
Computing M.I. for   5518 out of   6254 tokens, token "internal"
Computing M.I. for   5580 out of   6254 tokens, token "grandfather"
Computing M.I. for   5642 out of   6254 tokens, token "italians"
Computing M.I. for   5704 out of   6254 tokens, token "khan"
Computing M.I. for   5766 out of   6254 tokens, token "brittany"
Computing M.I. for   5828 out of   6254 tokens, token "thick"
Computing M.I. for   5890 out of   6254 tokens, token "harvest"
Computing M.I. for   5952 out of   6254 tokens, token "artefacts"
Computing M.I. for   6014 out of   6254 tokens, token "shelters"
Computing M.I. for   6076 out of   6254 tokens, token "convert"
Computing M.I. for   6138 out of   6254 tokens, token "ceramic"
Computing M.I. for   6200 out of   6254 tokens, token "ranching"

The most informative tokens look quite helpful, particularly words like "capital". "major" or "international". Interestingly, not all the discretized numbers were chose but most of them (18 out of 32). That shows their value. The ones missing fall into a category that confuses themselves with years. The use of NER [cite] will help here, but I do not want to increase the running time so much to add it.

I have dropped capitalization to reduce the feature set plus some light punctuation removal. Further processing is possible by stemming (conflating "city" and "cities") and dropping stop words, but we'll see if that's an issue through error analysis.

Now we can try the new feature vector in Cell #8.

# CELL 8
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
with open("ch8_cell7_dev_tokens.tsv") as feats:
    extra_header = next(feats)
    extra_header = extra_header.strip().split("\t")
    extra_header.pop(0) # name
    header.extend(extra_header)
    for line in feats:
        fields = line.strip().split("\t")
        name = fields[0]
        all_data[city_to_all_data[name]][0].extend(list(map(float,fields[1:])))
        
with open("ch8_cell8_dev_feat2.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell8_rf_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3318826561100693

<matplotlib.legend.Legend at 0x7fcfd418e990>

That is an improvement, let us drill down with Error Analysis to see what worked and what did not

I will now proceed to do an Error Analysis looking at the documents that gained the most with the text and the ones that were more hurt (Cel #9)

# CELL 9
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from collections import OrderedDict

# read base features
rand = random.Random(42)
base_data         = list()
city_to_base_data = dict()
base_header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    base_header = next(f)
    base_header = base_header.strip().split("\t")
    base_header.pop(0) # name
    base_header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_base_data[name] = len(base_data)
        base_data.append( (feats, logpop, name) )
                
# read text features
mi_data         = list()
city_to_mi_data = dict()
mi_header = None
with open("ch8_cell8_dev_feat2.tsv") as mi:
    mi_header = next(mi)
    mi_header = mi_header.strip().split("\t")
    mi_header.pop(0) # name
    mi_header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_mi_data[name] = len(mi_data)
        mi_data.append( (feats, logpop, name) )

# split
base_train_data = list()
base_test_data  = list()
mi_train_data   = list()
mi_test_data    = list()
for row in base_data:
    if rand.random() < 0.2:
        base_test_data.append(row)
        mi_test_data.append(mi_data[city_to_mi_data[row[-1]]])
    else:
        base_train_data.append(row)
        mi_train_data.append(mi_data[city_to_mi_data[row[-1]]])
base_data = None
mi_data   = None

base_test_data = sorted(base_test_data, key=lambda t:t[1])
mi_test_data   = sorted(mi_test_data,   key=lambda t:t[1])
test_names     = list(map(lambda t:t[2], base_test_data))

base_xtrain = np.array(list(map(lambda t:t[0], base_train_data)))
ytrain      = np.array(list(map(lambda t:t[1], base_train_data)))
base_xtest  = np.array(list(map(lambda t:t[0], base_test_data)))
ytest       = np.array(list(map(lambda t:t[1], base_test_data)))
base_train_data = None
base_test_data  = None

mi_xtrain = np.array(list(map(lambda t:t[0], mi_train_data)))
mi_xtest  = np.array(list(map(lambda t:t[0], mi_test_data)))
mi_train_data = None
mi_test_data  = None

# train
print("Base training on {:,} cities".format(len(ytrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(base_xtrain, ytrain)
base_ytest_pred = rf.predict(base_xtest)
base_se = (base_ytest_pred - ytest)**2

print("M.I. training on {:,} cities".format(len(ytrain)))
rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(mi_xtrain, ytrain)
mi_ytest_pred = rf.predict(mi_xtest)
mi_se = (mi_ytest_pred - ytest)**2

# find the bigger winners and losers
se_ytest_diff = base_se - mi_se # small is better, it's error
named_se = list()
for idx in range(se_ytest_diff.shape[0]):
    named_se.append( (se_ytest_diff[idx], test_names[idx], idx) )

named_se = sorted(named_se, key=lambda x:x[0], reverse=True)

to_print = OrderedDict()
for idx, winner in enumerate(named_se[:10]):
    to_print[winner[1]] = { 
        'improv' : winner[0], 
        'base'   : int(round(10**base_ytest_pred[winner[2]])),
        'mi'     : int(round(10**mi_ytest_pred[winner[2]])),
        'pop'    : int(round(10**ytest[winner[2]])),
        'type'   : 'winner',
        'pos'    : idx }
for idx, loser in enumerate(named_se[-10:]):
    to_print[loser[1]] = { 
        'improv' : loser[0], 
        'base'   : int(round(10**base_ytest_pred[loser[2]])),
        'mi'     : int(round(10**mi_ytest_pred[loser[2]])),
        'pop'    : int(round(10**ytest[loser[2]])),
        'type'   : 'loser',
        'pos'    : (9-idx)}
    
kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell7_vocab.tsv").readlines()))

base_xtrain = None
base_xtest  = None
mi_xtrain   = None
mi_xtest    = None
                 
htmls = [""] * 20
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            tokens = list(filter(lambda tok: tok in kept_terms, cell7_tokenize(text)))
            text = text.replace('\t','<p>')
            entry = to_print[name]
            this_html = ("<h1>Top {} {}: {}</h1>"+
                     "<h2>Change: {:1.5} (base: {:,}, MI: {:,}). Population: {:,}. Text length: {:,}</h2>{}"+
                     "<h2>Tokens (length {:,})</h2>{}") \
                      .format((entry['pos'] + 1), entry['type'], name[1:-1], entry['improv'], entry['base'], entry['mi'],
                          entry['pop'], len(text), text[:1000], len(tokens), tokens[:100])
            if entry['type'] == 'winner':
                htmls[entry['pos']] = this_html
            else:
                htmls[10+entry['pos']] = this_html
html = "".join(htmls)
from IPython.display import HTML, display
display(HTML(html))

Base training on 35,971 cities
M.I. training on 35,971 cities

Analysis:

City	Comments
Bailadores	'town' does it
Villa Alvarez	No text
Koro	'agriculture' most probably is linked to smaller places
Curug	we got a toknumseg6 for 2010 Census and then the correct toknumseg30, the seg6 does it
Delgado	no usable terms
Madina	population of toknumseg30 should be gotten by a stop-words + bigram
Banha	I think if 'cities' were 'city' it will work
Dunmore	Small hamlet with lots of info in Wikipedia, the main signal "village" is not there
Demsa	the 'population toknumseg30' should do its magic
Xuanwu	no idea. capital ought to have worked

census appears in 21,414 cities but it is not picked as a top 1,000, but tons of stop words are, better clean up and also remove variants to see if we can accomodate more

village appears in 13,998 cities but it had a M.I. of 0.0025583964 (compare top M.I. for city of 0.1108237610) and it was at the bottom 100 at position 2881. It should have been selected, but it might be that at 4-way splitting it is not definite enough

conclusion: conflate and filter the terms and/or expand until census and village are added. Look into bigrams, then skip-grams. Let us start with filtering stop-words and doing stemming to include census and village (Cell #8).

Third Featurization: Morphological features¶

In some domains, it is useful to reduce the number of features by dropping the morphological variants for different words. For example, if we believe the word prior is useful in our domain, its plural variant priors might be equally useful but more rare. If we conflate both terms as the same feature, we could obtain better performance.

Larger text samples are needed to profit from this approach, though.

To obtain morphological roots for words, we can use a dictionary of root forms (a "lemmatizer" approach) or we can use simple approximation (a "stemmer" approach).

We will use a stemming approach using an implementation of the Porter stemmer (Cell #10).

Taboo features (stop words)¶

A common feature selection technique in natural language processing is to drop a small set of highly frequent function words with little semantic content for classification tasks. This is called "stop word removal", an approach shared with information retrieval. We will use the snowball list of stopwords.

# CELL 10
import re
import pickle
import random
import bz2
import math
import numpy as np
from stemming.porter2 import stem as porter_stem

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
            
stopwords = set()
with open("stop.txt") as s:
    for line in s:
        if '|' in line:
            line = line[:line.index('|')]
        line = line.strip()
        if len(line) > 0:
            stopwords.add(line)
            
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell10_tokenize(text):
    tokens = list(filter(lambda x: len(x)>0 and x not in stopwords,
                         map(lambda x: x.lower(),
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' '))))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] == ',':
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(porter_stem(tok))
    return result

# read base features
rand = random.Random(42)
city_pop = dict()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        city_pop[name] = logpop
cities = sorted(list(city_pop.keys()))
        
# vocabulary
all_vocab     = list()
vocab_to_idx  = dict()
city_tok_idxs = dict()

remaining = set(city_pop.keys())
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text = cityline[tab:]
            toks = set()
            for token in cell10_tokenize(text):
                idx = vocab_to_idx.get(token, None)
                if idx is None:
                    idx = len(all_vocab)
                    all_vocab.append(token)
                    vocab_to_idx[token] = idx
                toks.add(idx)
            city_tok_idxs[name] = sorted(list(toks))

for name in remaining:
    city_tok_idxs[name] = list()
    
print("Total vocabulary: {:,}".format(len(all_vocab)))

# drop tokens that appear in less that 200 documents
tok_docs = list()
for _ in range(len(all_vocab)):
    tok_docs.append([])
for doc_idx, name in enumerate(cities):
    tok_idxs = city_tok_idxs[name]
    for tok_idx in tok_idxs:
        tok_docs[tok_idx].append(doc_idx)
city_tok_idxs = None

threshold = 200
reduced_vocab = list()
for tok_idx in range(len(all_vocab)):
    if len(tok_docs[tok_idx]) >= threshold:
        reduced_vocab.append(tok_idx)
        
print("Reduced vocabulary: {:,} (reduction {:%})"
      .format(len(reduced_vocab), (len(all_vocab) - len(reduced_vocab)) / len(all_vocab)))    

ydata = np.array(list(map(lambda c:city_pop[c], cities)))

# use more classes here to see if we can pick 'village'
ydata = cell7_adjudicate(ydata, segments_at[4])

feature_utility = list()

xdata = np.zeros( ydata.shape )
for pos, tok_idx in enumerate(reduced_vocab):
    verbose = False
    if pos % int(len(reduced_vocab) / 100) == 0:
        print("Computing M.I. for {:>6} out of {:>6} tokens, token \"{}\""
              .format(pos, len(reduced_vocab), all_vocab[tok_idx]))
        #verbose = True

    xdata[:] = 0
    for idx in tok_docs[tok_idx]:
        xdata[idx] = 1.0

    # compute confusion table
    table = dict()
    for row in range(xdata.shape[0]):
        feat_val = int(xdata[row])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = dict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1

    feats = set()
    for row in table.values():
        feats.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feats }
    full_table = sum(cols.values())
    
    if verbose:
        print("\tTable:\n\t{}\n\tfull_table: {}\n\tCols: {}"
              .format(table, full_table, cols))
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n11={}, skipping"
                        .format(feat_val, target_val, n11))
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n10={} or n01={} or n00={} is zero, skipping"
                        .format(feat_val, target_val, n10, n01, n00))
                continue
            n1_ = n11 + n10
            n0_ = n01 + n00
            n_1 = n11 + n01
            n_0 = n10 + n00
            n = float(full_table)
            utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
               n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
               n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
               n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if verbose:
        print("\tbest_utility: {}".format(best_utility))
    if best_utility is not None:
        feature_utility.append( (all_vocab[tok_idx], best_utility) )
all_vocab = None # free memory
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

PARAM_KEEP_TOP = 1000
with open("ch8_cell10_vocab.tsv", "w") as kept:
    for row in feature_utility[:PARAM_KEEP_TOP]:
        kept.write("{}\t{}\n".format(*row))
        
table1 = ("<table><tr><th>Position</th><th>Stem</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(feature_utility[:100])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Stem</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(reversed(feature_utility[-100:]))))) +"</table>")

with open("ch8_cell10_dev_tokens.tsv", "w") as kept:
    kept.write("name\t" + "\t".join(map(lambda x:"token=" + x[0],feature_utility[:PARAM_KEEP_TOP]))+"\n")
    matrix = np.zeros( (ydata.shape[0], PARAM_KEEP_TOP) )
    for idx_tok, row in enumerate(feature_utility[:PARAM_KEEP_TOP]):
        tok = row[0]
        for idx_doc in tok_docs[vocab_to_idx[tok]]:
            matrix[idx_doc, idx_tok] = 1.0
    for idx_doc in range(matrix.shape[0]):
        kept.write(cities[idx_doc] + "\t" + "\t".join(map(str,matrix[idx_doc,:])) +"\n")
matrix       = None
tok_docs     = None
vocab_to_idx = None

from IPython.display import HTML, display
display(HTML("<h3>Top 100 tokens by MI</h3>" + table1 + 
             "<h3>Last 100 tokens by MI</h3>" + table2))

Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Total vocabulary: 356,591
Reduced vocabulary: 4,757 (reduction 98.665979%)
Computing M.I. for      0 out of   4757 tokens, token ","
Computing M.I. for     47 out of   4757 tokens, token "situat"
Computing M.I. for     94 out of   4757 tokens, token "rome"
Computing M.I. for    141 out of   4757 tokens, token "bc"
Computing M.I. for    188 out of   4757 tokens, token "hand"
Computing M.I. for    235 out of   4757 tokens, token "element"
Computing M.I. for    282 out of   4757 tokens, token "reason"
Computing M.I. for    329 out of   4757 tokens, token "use"
Computing M.I. for    376 out of   4757 tokens, token "unknown"
Computing M.I. for    423 out of   4757 tokens, token "treatment"
Computing M.I. for    470 out of   4757 tokens, token "seen"
Computing M.I. for    517 out of   4757 tokens, token "cathol"
Computing M.I. for    564 out of   4757 tokens, token "leav"
Computing M.I. for    611 out of   4757 tokens, token "28"
Computing M.I. for    658 out of   4757 tokens, token "fair"
Computing M.I. for    705 out of   4757 tokens, token "root"
Computing M.I. for    752 out of   4757 tokens, token "parti"
Computing M.I. for    799 out of   4757 tokens, token "alleg"
Computing M.I. for    846 out of   4757 tokens, token "restor"
Computing M.I. for    893 out of   4757 tokens, token "plane"
Computing M.I. for    940 out of   4757 tokens, token "sampl"
Computing M.I. for    987 out of   4757 tokens, token "mini"
Computing M.I. for   1034 out of   4757 tokens, token "ad"
Computing M.I. for   1081 out of   4757 tokens, token "director"
Computing M.I. for   1128 out of   4757 tokens, token "basketbal"
Computing M.I. for   1175 out of   4757 tokens, token "indic"
Computing M.I. for   1222 out of   4757 tokens, token "cotton"
Computing M.I. for   1269 out of   4757 tokens, token "philip"
Computing M.I. for   1316 out of   4757 tokens, token "spain"
Computing M.I. for   1363 out of   4757 tokens, token "seasid"
Computing M.I. for   1410 out of   4757 tokens, token "automobil"
Computing M.I. for   1457 out of   4757 tokens, token "eventu"
Computing M.I. for   1504 out of   4757 tokens, token "mere"
Computing M.I. for   1551 out of   4757 tokens, token "40"
Computing M.I. for   1598 out of   4757 tokens, token "say"
Computing M.I. for   1645 out of   4757 tokens, token "specul"
Computing M.I. for   1692 out of   4757 tokens, token "door"
Computing M.I. for   1739 out of   4757 tokens, token "extra"
Computing M.I. for   1786 out of   4757 tokens, token "creat"
Computing M.I. for   1833 out of   4757 tokens, token "ex"
Computing M.I. for   1880 out of   4757 tokens, token "escap"
Computing M.I. for   1927 out of   4757 tokens, token "deleg"
Computing M.I. for   1974 out of   4757 tokens, token "buse"
Computing M.I. for   2021 out of   4757 tokens, token "concept"
Computing M.I. for   2068 out of   4757 tokens, token "ss"
Computing M.I. for   2115 out of   4757 tokens, token "slope"
Computing M.I. for   2162 out of   4757 tokens, token "enact"
Computing M.I. for   2209 out of   4757 tokens, token "eros"
Computing M.I. for   2256 out of   4757 tokens, token "parish"
Computing M.I. for   2303 out of   4757 tokens, token "split"
Computing M.I. for   2350 out of   4757 tokens, token "13th"
Computing M.I. for   2397 out of   4757 tokens, token "cultiv"
Computing M.I. for   2444 out of   4757 tokens, token "movi"
Computing M.I. for   2491 out of   4757 tokens, token "news"
Computing M.I. for   2538 out of   4757 tokens, token "polic"
Computing M.I. for   2585 out of   4757 tokens, token "medic"
Computing M.I. for   2632 out of   4757 tokens, token "liber"
Computing M.I. for   2679 out of   4757 tokens, token "commod"
Computing M.I. for   2726 out of   4757 tokens, token "sunday"
Computing M.I. for   2773 out of   4757 tokens, token "instead"
Computing M.I. for   2820 out of   4757 tokens, token "oliv"
Computing M.I. for   2867 out of   4757 tokens, token "mascot"
Computing M.I. for   2914 out of   4757 tokens, token "ciudad"
Computing M.I. for   2961 out of   4757 tokens, token "pan"
Computing M.I. for   3008 out of   4757 tokens, token "tire"
Computing M.I. for   3055 out of   4757 tokens, token "von"
Computing M.I. for   3102 out of   4757 tokens, token "duke"
Computing M.I. for   3149 out of   4757 tokens, token "vital"
Computing M.I. for   3196 out of   4757 tokens, token "disrupt"
Computing M.I. for   3243 out of   4757 tokens, token "notr"
Computing M.I. for   3290 out of   4757 tokens, token "octagon"
Computing M.I. for   3337 out of   4757 tokens, token "irish"
Computing M.I. for   3384 out of   4757 tokens, token "sam"
Computing M.I. for   3431 out of   4757 tokens, token "leed"
Computing M.I. for   3478 out of   4757 tokens, token "rescu"
Computing M.I. for   3525 out of   4757 tokens, token "preschool"
Computing M.I. for   3572 out of   4757 tokens, token "96"
Computing M.I. for   3619 out of   4757 tokens, token "willow"
Computing M.I. for   3666 out of   4757 tokens, token "provision"
Computing M.I. for   3713 out of   4757 tokens, token "cabin"
Computing M.I. for   3760 out of   4757 tokens, token "cruz"
Computing M.I. for   3807 out of   4757 tokens, token "haut"
Computing M.I. for   3854 out of   4757 tokens, token "adequ"
Computing M.I. for   3901 out of   4757 tokens, token "daughter"
Computing M.I. for   3948 out of   4757 tokens, token "spectat"
Computing M.I. for   3995 out of   4757 tokens, token "toy"
Computing M.I. for   4042 out of   4757 tokens, token "monro"
Computing M.I. for   4089 out of   4757 tokens, token "superintend"
Computing M.I. for   4136 out of   4757 tokens, token "bomber"
Computing M.I. for   4183 out of   4757 tokens, token "porch"
Computing M.I. for   4230 out of   4757 tokens, token "eve"
Computing M.I. for   4277 out of   4757 tokens, token "leonard"
Computing M.I. for   4324 out of   4757 tokens, token "confin"
Computing M.I. for   4371 out of   4757 tokens, token "mango"
Computing M.I. for   4418 out of   4757 tokens, token "strait"
Computing M.I. for   4465 out of   4757 tokens, token "org"
Computing M.I. for   4512 out of   4757 tokens, token "albanian"
Computing M.I. for   4559 out of   4757 tokens, token "pike"
Computing M.I. for   4606 out of   4757 tokens, token "sat"
Computing M.I. for   4653 out of   4757 tokens, token "liaison"
Computing M.I. for   4700 out of   4757 tokens, token "chile"
Computing M.I. for   4747 out of   4757 tokens, token "corregimiento"

Tokenization now takes much more time. It is the reason that NLP is usually done in batch using multiple machines, using frameworks such as Apache UIMA [LINK] or Spark NLP. Adding NER will make it much more slow.

Also, as I add more complexity, the results are more difficult to understand (what type of tokens the stem 'civil' captures? 'civilization'? 'civilized'? intriguing.)

'village' is still not picked (top 1647). (Expanding to 2000 terms was tried but didn't help).

I'll now re-do training the training in Cell #11.

# CELL 11
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
with open("ch8_cell10_dev_tokens.tsv") as feats:
    extra_header = next(feats)
    extra_header = extra_header.strip().split("\t")
    extra_header.pop(0) # name
    header.extend(extra_header)
    for line in feats:
        fields = line.strip().split("\t")
        name = fields[0]
        all_data[city_to_all_data[name]][0].extend(list(map(float,fields[1:])))
        
with open("ch8_cell11_dev_feat3.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell11_rf_feat3.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Training on 35,971 cities
RMSE 0.3267168745861243

<matplotlib.legend.Legend at 0x7fcf470c7110>

At RMSE 0.3267, we're getting closer to the best performance of Chapter 6.

And now for the error analysis as before (Cell #12).

# CELL 12
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
base_data         = list()
city_to_base_data = dict()
base_header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    base_header = next(f)
    base_header = base_header.strip().split("\t")
    base_header.pop(0) # name
    base_header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_base_data[name] = len(base_data)
        base_data.append( (feats, logpop, name) )
                
# read text features
mi_data         = list()
city_to_mi_data = dict()
mi_header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    mi_header = next(mi)
    mi_header = mi_header.strip().split("\t")
    mi_header.pop(0) # name
    mi_header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_mi_data[name] = len(mi_data)
        mi_data.append( (feats, logpop, name) )
        
# split
base_train_data = list()
base_test_data  = list()
mi_train_data   = list()
mi_test_data    = list()
for row in base_data:
    if rand.random() < 0.2:
        base_test_data.append(row)
        mi_test_data.append(mi_data[city_to_mi_data[row[-1]]])
    else:
        base_train_data.append(row)
        mi_train_data.append(mi_data[city_to_mi_data[row[-1]]])
base_data = None
mi_data   = None

base_test_data = sorted(base_test_data, key=lambda t:t[1])
mi_test_data   = sorted(mi_test_data, key=lambda t:t[1])
test_names     = list(map(lambda t:t[2], base_test_data))

base_xtrain = np.array(list(map(lambda t:t[0], base_train_data)))
ytrain      = np.array(list(map(lambda t:t[1], base_train_data)))
base_xtest  = np.array(list(map(lambda t:t[0], base_test_data)))
ytest       = np.array(list(map(lambda t:t[1], base_test_data)))
base_train_data = None
base_test_data  = None


mi_xtrain = np.array(list(map(lambda t:t[0], mi_train_data)))
mi_xtest  = np.array(list(map(lambda t:t[0], mi_test_data)))
mi_train_data = None
mi_test_data  = None

# train
print("Base training on {:,} cities".format(len(base_xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(base_xtrain, ytrain)
base_ytest_pred = rf.predict(base_xtest)
base_se = (base_ytest_pred - ytest)**2

print("M.I. training on {:,} cities".format(len(mi_xtrain)))
rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(mi_xtrain, ytrain)
mi_ytest_pred = rf.predict(mi_xtest)
mi_se = (mi_ytest_pred - ytest)**2

# find the bigger winners and losers
se_ytest_diff = base_se - mi_se # small is better, it's error
named_se = list()
for idx in range(se_ytest_diff.shape[0]):
    named_se.append( (se_ytest_diff[idx], test_names[idx], idx) )

named_se = sorted(named_se, key=lambda x:x[0], reverse=True)

to_print = dict()
for idx, winner in enumerate(named_se[:10]):
    to_print[winner[1]] = { 
        'improv' : winner[0], 
        'base': int(round(10**base_ytest_pred[winner[2]])),
        'mi': int(round(10**mi_ytest_pred[winner[2]])),
        'pop': int(round(10**ytest[winner[2]])),
        'type': 'winner',
        'pos': idx }
for idx, loser in enumerate(named_se[-10:]):
    to_print[loser[1]] = { 
        'improv' : loser[0], 
        'base': int(round(10**base_ytest_pred[loser[2]])),
        'mi': int(round(10**mi_ytest_pred[loser[2]])),
        'pop': int(round(10**ytest[loser[2]])),
        'type': 'loser',
        'pos': (9-idx)}
    
kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

base_xtrain = None
base_xtest  = None
mi_xtrain   = None
mi_xtest    = None
                 
htmls = [""] * 20
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            tokens = list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text)))
            text = text.replace('\t','<p>')
            entry = to_print[name]
            this_html = ("<h1>Top {} {}: {}</h1>"+
                     "<h2>Change: {:1.5} (base: {:,}, MI: {:,}). Population: {:,}. Text length: {:,}</h2>{}"+
                     "<h2>Tokens (length {:,})</h2>{}") \
                      .format((entry['pos'] + 1), entry['type'], name[1:-1], entry['improv'], entry['base'], 
                              entry['mi'], entry['pop'], len(text), text[:1000], len(tokens), tokens[:100])
            if entry['type'] == 'winner':
                htmls[entry['pos']] = this_html
            else:
                htmls[10+entry['pos']] = this_html
html = "".join(htmls)
from IPython.display import HTML, display
display(HTML(html))

Base training on 35,971 cities
M.I. training on 35,971 cities

We can see some new winners and losers but many repeats. Now, on the sequence of tokens we see thinks like 'popul' 'toknumseg30' that ought to inform about the overall size, if the ML were made aware this words are contiguous, which bring us to the concept of bigrams and the fourth featurization.

Fourth Featurization: Words in context¶

To incorporate some ordering among the words, a common technique is to use bigrams, pairs of words in order. If we were ot use bigrams directly, this will increase the vocabulary size quite a bit, so I'll threshold their minimum occurrence (Cell #13).

# CELL 13
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read text features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

remaining = set(cities)
all_bigrams = list()
bigram_to_idx    = dict()
city_bigram_idxs = dict()
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing+bigrams {:>5} out of {:>5} cities, bigrams {:,} city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), len(all_bigrams), name))
            remaining.remove(name)
            text = cityline[tab:]
            bigrams = set()
            prev = '[PAD]'
            for token in list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text))):
                bigram = prev + '-' + token
                prev = token
                idx = bigram_to_idx.get(bigram, None)
                if idx is None:
                    idx = len(all_bigrams)
                    all_bigrams.append(bigram)
                    bigram_to_idx[bigram] = idx
                bigrams.add(idx)
            city_bigram_idxs[name] = sorted(list(toks))
bigram_to_idx = None

for name in remaining:
    city_bigram_idxs[name] = list()

print("Total bigrams: {:,}".format(len(all_bigrams)))

# drop bigrams that appear in less that 50 documents
bigram_docs = list()
for _ in range(len(all_bigrams)):
    bigram_docs.append([])
for doc_idx, name in enumerate(cities):
    bigram_idxs = city_bigram_idxs[name]
    for bigram_idx in bigram_idxs:
        bigram_docs[bigram_idx].append(doc_idx)
city_bigram_idxs = None

threshold = 50
reduced_bigrams = list()
for bigram_idx in range(len(all_bigrams)):
    if len(bigram_docs[bigram_idx]) >= threshold:
        reduced_bigrams.append(bigram_idx)
        
print("Reduced bigrams: {:,} (reduction {:%})"
      .format(len(reduced_bigrams), (len(all_bigrams) - len(reduced_bigrams)) / len(all_bigrams)))    

matrix = np.zeros( (len(cities), len(reduced_bigrams)) )
for idx, bigram_idx in enumerate(reduced_bigrams):
    header.append("bigram=" + all_bigrams[bigram_idx])

    for idx_doc in bigram_docs[bigram_idx]:
        matrix[idx_doc, idx] = 1.0

for idx_doc in range(len(cities)):
    all_data[city_to_all_data[cities[idx_doc]]][0].extend(matrix[idx_doc,:])
bigram_docs = None
matrix      = None 

with open("ch8_cell13_dev_feat4.tsv", "w") as f:
    f.write("name\t" + "\t".join(header)+"\tlogpop\n")
    for idx_doc in range(len(cities)):
        name = cities[idx_doc]
        entry = all_data[city_to_all_data[name]]
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,entry[0])), entry[1]))

# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

# free memory
all_bigrams = None
xtrain      = None
xtest       = None


import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell13_rf_feat4.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Tokenizing+bigrams     0 out of 44959 cities, bigrams 0 city "<http://dbpedia.org/resource/Ankara>"
Tokenizing+bigrams  4495 out of 44959 cities, bigrams 335,865 city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing+bigrams  8990 out of 44959 cities, bigrams 382,256 city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing+bigrams 13485 out of 44959 cities, bigrams 458,421 city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing+bigrams 17980 out of 44959 cities, bigrams 512,669 city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing+bigrams 22475 out of 44959 cities, bigrams 540,316 city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing+bigrams 26970 out of 44959 cities, bigrams 557,120 city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing+bigrams 31465 out of 44959 cities, bigrams 566,614 city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing+bigrams 35960 out of 44959 cities, bigrams 571,918 city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing+bigrams 40455 out of 44959 cities, bigrams 575,277 city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing+bigrams 44950 out of 44959 cities, bigrams 581,778 city "<http://dbpedia.org/resource/Babatorun>"
Total bigrams: 581,797
Reduced bigrams: 113 (reduction 99.980577%)
Training on 35,971 cities
RMSE 0.32621332109795326

<matplotlib.legend.Legend at 0x7fcfd42ffcd0>

That did worse. But what we were trying to accomplish (populatio-number) is not among the picked bigrams for any of the population numbers. Let's try skip-bigrams with hash encoding instead.

Fifth Featurization: Skip-bigrams¶

I will combine skip-bigrams with feature hashing to reduce the number of bigrams to a manageable size (Cell #14).

For hashing function, we will use Python's buit-in hashing function

# CELL 14
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

PARAM_HASH_SIZE=3000
PARAM_SKIP_SIZE=6
PARAM_STABLE_HASHES=True

def cell14_hash(x):
    hashed = 0
    if PARAM_STABLE_HASHES:
        # pure python FNV1a
        if x:
            hashed = 14695981039346656037
            for c in x:
                hashed = hashed ^ ord(c)
                hashed = (hashed * 1099511628211)
    else:
        # python after 3.3 use siphash24, which is better and the C implementation 
        # is faster but it is salted so unless you have to set PYTHONSEED environment 
        # variable to 0 (a bad idea) otherwise every run will produce different hashes
        hashed = hash(x)
    return abs(hashed) % PARAM_HASH_SIZE 
    
# read text features
rand   = random.Random(42)
all_data         = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

remaining = set(cities)
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing+skip-bigrams {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text    = cityline[tab:]
            bigrams = set()
            prev = [ '[PAD]' ] * PARAM_SKIP_SIZE
            feats = [ 0.0 ] * PARAM_HASH_SIZE
            for token in list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text))):
                for skip in prev:
                    bigram = skip + '-' + token
                    feats[cell14_hash(bigram)] = 1.0
                prev.pop(0)
                prev.append(token)
            all_data[city_to_all_data[name]][0].extend(feats)

for name in remaining:
    all_data[city_to_all_data[name]][0].extend([ 0.0 ] * PARAM_HASH_SIZE)

for idx in range(PARAM_HASH_SIZE):
    header.append("hashed_skip_bigram#" + str(idx))

with open("ch8_cell14_dev_feat5.tsv", "w") as f:
    f.write("name\t" + "\t".join(header)+"\tlogpop\n")
    for idx_doc in range(len(cities)):
        name = cities[idx_doc]
        entry = all_data[city_to_all_data[name]]
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,entry[0])), entry[1]))

# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell14_rf_feat5.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Tokenizing+skip-bigrams     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing+skip-bigrams  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing+skip-bigrams  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing+skip-bigrams 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing+skip-bigrams 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing+skip-bigrams 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing+skip-bigrams 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing+skip-bigrams 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing+skip-bigrams 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing+skip-bigrams 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing+skip-bigrams 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Training on 35,971 cities
RMSE 0.3261679716302875

<matplotlib.legend.Legend at 0x7fcfd4641350>

Sixth Featurization: Embeddings¶

Finally, we explore using word embeddings (Cell #15). Because the embeddings and TF*IDF scores might overfit, I'll compute them on the trainset only.

To use these embeddings, we take the weighted average embedding for the whole document. We can also take the maximum and minimum for each coordinate, over all entries in the document.

TF*IDF¶

Instead of using raw counts, we can perform a traditional feature weighting employed in NLP/IR: adjust the counts by the inverse of the frequency of the word type over the corpus.

Therefore, we replace the Term Frequency (term is synonym with word type in IR) in the document with the TF times the Inverse Document Frequency (IDF). To have more informed statistics, we can compute the IDF counts on a larger set (e.g., Wikipedia or in our case, the whole StackOverflow dump). In this example we will use the train set.

# CELL 15
import re
import pickle
import random
import bz2
import math
from collections import OrderedDict

import numpy as np
import gensim
from sklearn.ensemble import RandomForestRegressor
from stemming.porter2 import stem as porter_stem

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell15_tokenize(text):
    tokens = list(filter(lambda x: len(x)>0,
                         map(lambda x: x.lower(),
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' '))))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] == ',':
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(porter_stem(tok))
    return result

# read text features
rand = random.Random(42)
all_data         = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

# tokenize documents
city_tokenized = OrderedDict()
PARAM_COMPUTE_NEW_TOKENS = True
if PARAM_COMPUTE_NEW_TOKENS:
    remaining = set(city_to_all_data.keys())
    with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
        for byteline in wikitext:
            cityline = byteline.decode("utf-8")
            tab      = cityline.index('\t')
            name     = cityline[:tab]
            if name in remaining:
                if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                    print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                          .format((len(cities) - len(remaining)), len(cities), name))
                remaining.remove(name)
                text = cityline[tab:]
                city_tokenized[name] = cell15_tokenize(text)

    for name in remaining:
        city_tokenized[name] = list()

    print("Saving tokens...")
    with open("ch8_cell15_tokens.txt", "w") as f:
        for city, tokens in city_tokenized.items():
            f.write("{}\t{}\n".format(city, " ".join(tokens)))
else:
    print("Reading tokens...")
    with open("ch8_cell15_tokens.txt", "r") as f:
        for line in f:
            (city, toks) = line.split("\t")
            city_tokenized[city] = toks.split(" ")    
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

train_cities         = list(map(lambda x:x[-1], train_data))
tokenized_train_docs = list(map(lambda city:city_tokenized[city], train_cities))
print("Saving train split...")
with open("ch8_cell15_train_cities.txt", "w") as f:
    for city in city_tokenized.keys():
        f.write("{}\n".format(city))

PARAM_EMBEDDING_SIZE = 50
print("Training embeddings of size {}".format(PARAM_EMBEDDING_SIZE))
model   = gensim.models.Word2Vec(tokenized_train_docs, size=PARAM_EMBEDDING_SIZE)
tok2vec = dict(zip(model.wv.index2word, model.wv.vectors))

print("Trained ", len(tok2vec), " embeddings")
print("Saving embeddings...")
with open("ch8_cell15_embeddings.tsv", "w") as f:
    for token in model.wv.vocab.keys():
        f.write("{}\t{}\n".format(token, "\t".join(map(str,model.wv[token]))))

# compute idfs
df_tok = OrderedDict()
for tok_doc in tokenized_train_docs:
    seen = set()
    for tok in tok_doc:
        if tok not in seen:
            df_tok[tok] = df_tok.get(tok, 0) + 1
            seen.add(tok)
idf_tok = OrderedDict()    
for tok in df_tok:
    idf_tok[tok] = math.log(1 + len(tokenized_train_docs) * 1.0 / df_tok[tok])
print("Computed {:,} IDFs".format(len(idf_tok)))
print("Saving idfs...")
with open("ch8_cell15_idfs.tsv", "w") as f:
    for tok, idf in idf_tok.items():
        f.write("{}\t{}\n".format(tok, idf))

# plot
PARAM_PLOT_TSNE = True
if PARAM_PLOT_TSNE:
    print("Computing t-SNE...")
    from sklearn.manifold import TSNE
    vectors = []
    words = list(model.wv.vocab.keys())
    rand.shuffle(words)
    for word in words:
        vectors.append(model.wv[word])
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500, random_state=23)
    projected = tsne_model.fit_transform(vectors)

    x = []
    y = []
    for t in projected:
        x.append(t[0])
        y.append(t[1])

    print("Saving t-SNE points...")
    with open("ch8_cell15_tsne.tsv", "w") as embed:
        for idx in range(len(words)):
            embed.write("{}\t{}\t{}\n".format(words[idx], x[idx], y[idx]))
            
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.figure()
    plt.rcParams['figure.figsize'] = [20, 20]
    plotted_count   = 0
    plotted_section = set()
    # plot a meaningful, visible sample
    for idx in range(len(x)):
        if df_tok[words[idx]] < 200:
            continue # ensure meaningful
        section = str(int(x[idx] * 10 * 4)) + "-" + str(int(y[idx] * 10 * 4))
        if section in plotted_section:
            continue # ensure visible
        plotted_section.add(section)
        plotted_count += 1
        plt.scatter(x[idx] ,y[idx])
        plt.annotate(words[idx], xy=(x[idx], y[idx]), xytext=(5, 2), 
                     textcoords='offset points', ha='right', va='bottom')
        if plotted_count > 150:
            break # ensure visible

    plt.savefig("ch8_cell15_tsne.pdf", bbox_inches='tight', dpi=300)

# encode train_data and test_data
def encode(toks):
    result = np.zeros( (PARAM_EMBEDDING_SIZE,) )
    good_toks = list(filter(lambda t:t in tok2vec, toks))
    for tok in good_toks:
        tok_vect_scaled = np.copy(tok2vec[tok])
        tok_vect_scaled *= idf_tok[tok] / len(good_toks)
        result += tok_vect_scaled
    return result

for data in (train_data, test_data):
    for row in data:
        name = row[-1]
        row[0].extend(encode(city_tokenized[name]))

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data     = None
test_data      = None
idf_tok        = None
df_tok         = None
tok2vec        = None
city_tokenized = None
# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell15_rf_feat6.pdf", bbox_inches='tight', dpi=300)
plt.legend()

Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Saving tokens...
Saving train split...
Training embeddings of size 50
Trained  72361  embeddings
Saving embeddings...
Computed 309,291 IDFs
Saving idfs...
Computing t-SNE...
Saving t-SNE points...
Training on 35,971 cities
RMSE 0.32706006847260105

<matplotlib.legend.Legend at 0x7fcea35cab10>

# CELL 16

# running example
s = """Its population was 8,361,447 at the 2010 census whom 1,977,253 in the built-up 
(or "metro") area made of Zhanggong and Nankang, and Ganxian largely being urbanized.
"""
cell7_tokens = set(map(lambda x: x[len("token="):], next(open("ch8_cell7_dev_tokens.tsv")).split("\t")))
print("cell6",          cell6_tokenize(s))
print("cell7",          cell7_tokenize(s))
print("cell7-filtered", list(filter(lambda tok: tok in cell7_tokens, cell7_tokenize(s))))
print("cell10",         cell10_tokenize(s))
print("cell15",         cell15_tokenize(s))

cell6 ['TOKNUMSEG31', 'TOKNUMSEG6', 'TOKNUMSEG31']
cell7 ['its', 'population', 'was', 'TOKNUMSEG31', 'at', 'the', 'TOKNUMSEG6', 'census', 'whom', 'TOKNUMSEG31', 'in', 'the', 'built', 'up', 'or', 'metro', 'area', 'made', 'of', 'zhanggong', 'and', 'nankang', 'and', 'ganxian', 'largely', 'being', 'urbanized']
cell7-filtered ['its', 'population', 'was', 'at', 'the', 'TOKNUMSEG6', 'the', 'built', 'metro', 'area', 'of', 'and', 'and', 'largely', 'being']
cell10 ['popul', 'TOKNUMSEG31', 'TOKNUMSEG6', 'census', 'TOKNUMSEG31', 'built', 'metro', 'area', 'made', 'zhanggong', 'nankang', 'ganxian', 'larg', 'urban']
cell15 ['it', 'popul', 'was', 'TOKNUMSEG31', 'at', 'the', 'TOKNUMSEG6', 'census', 'whom', 'TOKNUMSEG31', 'in', 'the', 'built', 'up', 'or', 'metro', 'area', 'made', 'of', 'zhanggong', 'and', 'nankang', 'and', 'ganxian', 'larg', 'be', 'urban']

# memory check

import sys
l = list()
for v in dir():
    l.append( (int(eval("sys.getsizeof({})".format(v))), v) )
for c, v in sorted(l, reverse=True)[:20]:
    print("\t{:,}\t{}".format(c,v))

	2,621,552	text_lengths
	2,621,552	city_to_mi_data
	2,621,552	city_to_base_data
	2,621,552	city_to_all_data
	2,621,552	city_pop
	2,621,552	cities_and_pop
	2,097,384	remaining
	651,368	words
	651,360	y
	651,360	x
	651,360	vectors
	404,744	cities
	359,768	ydata
	359,768	xdata
	304,584	train_cities
	304,584	tokenized_train_docs
	287,864	ytrain
	131,304	seen
	81,008	named_se
	77,856	data

Position	Feat	Utility
0	wroc	0.0000003671
1	fork	0.0000010077
2	minnesota	0.0000027550
3	blacksmith	0.0000031540
4	lordship	0.0000061330
5	porch	0.0000070681
6	slate	0.0000117212
7	tennessee	0.0000135922
8	nord	0.0000150262
9	dublin	0.0000164071
10	marion	0.0000164326
11	locality	0.0000176525
12	sk	0.0000186769
13	norway	0.0000191735
14	yielded	0.0000200087
15	abbot	0.0000222070
16	graveyard	0.0000226601
17	pays	0.0000251598
18	bohemian	0.0000275182
19	gallo	0.0000286014
20	townsite	0.0000292720
21	vicar	0.0000293665
22	teau	0.0000299623
23	stra	0.0000305643
24	wi	0.0000308261
25	carolina	0.0000313822
26	gro	0.0000327638
27	dakota	0.0000339413
28	cherokee	0.0000342552
29	toledo	0.0000344686
30	texas	0.0000350814
31	mascot	0.0000360057
32	bypassed	0.0000368673
33	utah	0.0000370233
34	oregon	0.0000370328
35	trout	0.0000377955
36	louisiana	0.0000394990
37	nave	0.0000410843
38	gaelic	0.0000414329
39	indiana	0.0000415696
40	res	0.0000418519
41	comarca	0.0000433390
42	vineyard	0.0000450929
43	und	0.0000482406
44	normandy	0.0000494604
45	seine	0.0000520568
46	vermont	0.0000529229
47	les	0.0000534858
48	villagers	0.0000537598
49	kentucky	0.0000538278
50	missouri	0.0000539930
51	sawmills	0.0000541359
52	te	0.0000557702
53	lords	0.0000567686
54	salmon	0.0000581586
55	cumberland	0.0000586670
56	butler	0.0000591994
57	hay	0.0000592873
58	wisconsin	0.0000664782
59	alps	0.0000667388
60	rhein	0.0000682795
61	35	0.0000700897
62	fief	0.0000707893
63	bells	0.0000712849
64	walkers	0.0000713318
65	82	0.0000723083
66	quarries	0.0000750313
67	virginia	0.0000759667
68	alabama	0.0000791244
69	landowner	0.0000793196
70	croats	0.0000809431
71	benedictine	0.0000845939
72	78	0.0000868250
73	vineyards	0.0000880741
74	delaware	0.0000881804
75	pomerania	0.0000882109
76	im	0.0000882451
77	csb	0.0000894023
78	ne	0.0000902184
79	postmaster	0.0000910172
80	bohemia	0.0000913958
81	milan	0.0000916828
82	cottages	0.0000917960
83	gr	0.0000922161
84	yorkshire	0.0000931658
85	surveyor	0.0000948157
86	der	0.0000955069
87	monroe	0.0000985637
88	provence	0.0000986144
89	schoolhouse	0.0000987327
90	maps	0.0000996097
91	sawmill	0.0000996336
92	knight	0.0000998679
93	suffolk	0.0001005785
94	zip	0.0001013794
95	wesleyan	0.0001041994
96	sweden	0.0001056110
97	priory	0.0001064107
98	bakery	0.0001066078
99	welsh	0.0001075209

Position	Stem	Utility
0	TOKNUMSEG30	0.0537331244
1	citi	0.0513839709
2	capit	0.0482481137
3	airport	0.0445746900
4	temperatur	0.0420718228
5	climat	0.0420551462
6	univers	0.0404202701
7	intern	0.0398616893
8	TOKNUMSEG29	0.0371992907
9	urban	0.0370085142
10	import	0.0366114102
11	largest	0.0357752559
12	TOKNUMSEG18	0.0356157632
13	TOKNUMSEG19	0.0349490764
14	TOKNUMSEG20	0.0346995641
15	china	0.0338219118
16	institut	0.0322952533
17	major	0.0318239374
18	TOKNUMSEG22	0.0318067838
19	TOKNUMSEG14	0.0317591474
20	industri	0.0316884633
21	govern	0.0315258492
22	fachhochschul	0.0310251824
23	TOKNUMSEG16	0.0307546347
24	cultur	0.0306875697
25	TOKNUMSEG17	0.0304567335
26	hub	0.0298875857
27	TOKNUMSEG21	0.0298263176
28	svp	0.0294803771
29	connect	0.0291452869
30	level	0.0286748120
31	headquart	0.0285513073
32	chines	0.0282720126
33	ppen	0.0279834306
34	TOKNUMSEG7	0.0278148011
35	stadium	0.0272838925
36	transport	0.0272540607
37	TOKNUMSEG12	0.0267550674
38	also	0.0266796079
39	TOKNUMSEG28	0.0264290018
40	period	0.0264229278
41	well	0.0260936714
42	TOKNUMSEG15	0.0260672592
43	among	0.0260045697
44	mandatori	0.0259701499
45	mani	0.0255557024
46	million	0.0255028040
47	rainfal	0.0254292935
48	due	0.0252655026
49	TOKNUMSEG8	0.0251558691
50	countri	0.0251052745
51	fdp	0.0248824550
52	trade	0.0248340920
53	econom	0.0245166730
54	product	0.0244509987
55	like	0.0243475120
56	month	0.0243199686
57	technolog	0.0242229978
58	divid	0.0241811798
59	agnost	0.0241221597
60	dri	0.0239772716
61	dynasti	0.0239366088
62	scienc	0.0238046033
63	winter	0.0237821004
64	one	0.0237646216
65	billion	0.0237496517
66	nation	0.0236955184
67	influenc	0.0236604828
68	TOKNUMSEG10	0.0236347021
69	main	0.0235435461
70	modern	0.0234926280
71	TOKNUMSEG13	0.0233591576
72	famous	0.0232191202
73	develop	0.0230990125
74	hot	0.0230443182
75	administr	0.0229519698
76	asia	0.0228906767
77	number	0.0228499735
78	base	0.0227920955
79	air	0.0226375177
80	foreign	0.0226302182
81	monsoon	0.0225439540
82	TOKNUMSEG11	0.0225388462
83	flight	0.0225015609
84	provinc	0.0224542477
85	prefectur	0.0224296865
86	humid	0.0223838186
87	TOKNUMSEG31	0.0223162477
88	season	0.0221270289
89	switzerland	0.0219480403
90	annual	0.0217980806
91	atheist	0.0217704899
92	TOKNUMSEG9	0.0216862653
93	domest	0.0216463631
94	economi	0.0216171387
95	teenag	0.0215342943
96	relat	0.0215287150
97	unproduct	0.0214950798
98	rule	0.0214693340
99	swiss	0.0213665068

Position	Stem	Utility
0	nord	0.0000358565
1	blacksmith	0.0000368659
2	trout	0.0000408468
3	nave	0.0000412574
4	fork	0.0000425422
5	gaelic	0.0000451307
6	marion	0.0000505955
7	lordship	0.0000524189
8	dakota	0.0000524357
9	wroc	0.0000531035
10	townsit	0.0000536172
11	serb	0.0000544652
12	monro	0.0000562435
13	aisl	0.0000563639
14	minnesota	0.0000564375
15	oregon	0.0000594520
16	gro	0.0000624232
17	carolina	0.0000627931
18	texa	0.0000628176
19	utah	0.0000634911
20	albani	0.0000647898
21	wesleyan	0.0000655043
22	virginia	0.0000657541
23	provenc	0.0000668552
24	proprietor	0.0000688756
25	tennesse	0.0000707456
26	cumberland	0.0000728798
27	plough	0.0000734097
28	rhein	0.0000740731
29	welsh	0.0000754198
30	shaft	0.0000755003
31	somerset	0.0000766462
32	earl	0.0000773754
33	windmil	0.0000775958
34	kentucki	0.0000778309
35	alabama	0.0000781721
36	fief	0.0000796384
37	norway	0.0000805684
38	mississippi	0.0000831047
39	newport	0.0000841265
40	bohemian	0.0000843630
41	oxford	0.0000845648
42	mascot	0.0000852984
43	teau	0.0000853105
44	lancast	0.0000856941
45	comarca	0.0000865814
46	kansa	0.0000912376
47	footpath	0.0000912545
48	salmon	0.0000917557
49	tanneri	0.0000936350
50	butcher	0.0000947079
51	dublin	0.0000957067
52	rev	0.0000961479
53	graveyard	0.0000972356
54	bundesstra	0.0000976962
55	newcastl	0.0000990926
56	butler	0.0000991996
57	lumber	0.0000995749
58	bailey	0.0000996819
59	serbian	0.0001009010
60	csb	0.0001009558
61	georgia	0.0001010745
62	der	0.0001020411
63	surveyor	0.0001026748
64	pomerania	0.0001038291
65	res	0.0001039333
66	vermont	0.0001039356
67	arkansa	0.0001046994
68	beaver	0.0001048955
69	leonard	0.0001052659
70	benedictin	0.0001053385
71	krak	0.0001058145
72	holland	0.0001064828
73	pittsburgh	0.0001079832
74	cheroke	0.0001085113
75	burlington	0.0001087504
76	baroni	0.0001088929
77	missouri	0.0001099265
78	inn	0.0001103388
79	sawmil	0.0001105784
80	disus	0.0001107899
81	dale	0.0001108712
82	grist	0.0001122458
83	priori	0.0001136469
84	pike	0.0001142573
85	shire	0.0001146439
86	silesia	0.0001152157
87	suffolk	0.0001153128
88	normandi	0.0001153268
89	reverend	0.0001156959
90	gristmil	0.0001162304
91	croat	0.0001163792
92	perri	0.0001168874
93	bend	0.0001170574
94	farmhous	0.0001176658
95	cincinnati	0.0001177525
96	nors	0.0001185797
97	pub	0.0001199626
98	schleswig	0.0001203447
99	appalachian	0.0001204434

Position	Token	Utility
0	city	0.1108237610
1	capital	0.0679695527
2	cities	0.0676417294
3	largest	0.0606824459
4	also	0.0596949543
5	major	0.0593810897
6	airport	0.0581575258
7	international	0.0546523961
8	its	0.0512131164
9	one	0.0502754660
10	than	0.0499407253
11	most	0.0497764006
12	urban	0.0491910525
13	government	0.0487000936
14	are	0.0476354436
15	during	0.0464751543
16	into	0.0457037805
17	headquarters	0.0448551514
18	such	0.0447899121
19	important	0.0447317335
20	national	0.0442497759
21	many	0.0439325441
22	known	0.0437755166
23	well	0.0434096614
24	climate	0.0430365158
25	by	0.0425911903
26	this	0.0408151422
27	main	0.0404480615
28	university	0.0403195373
29	on	0.0400469068
30	like	0.0399242031
31	center	0.0397057255
32	be	0.0388905365
33	these	0.0384321065
34	areas	0.0380490553
35	due	0.0379562912
36	development	0.0378602063
37	among	0.0374609243
38	have	0.0373650612
39	some	0.0371564823
40	districts	0.0369598676
41	famous	0.0367834159
42	that	0.0366869422
43	their	0.0363332198
44	include	0.0359582978
45	period	0.0359366263
46	year	0.0359325697
47	TOKNUMSEG29	0.0356911021
48	large	0.0351644409
49	several	0.0346961961
50	but	0.0345412779
51	estimated	0.0341262058
52	industrial	0.0340581313
53	bus	0.0339183590
54	temperature	0.0339066858
55	around	0.0336944941
56	very	0.0335933319
57	number	0.0335447925
58	province	0.0332509809
59	where	0.0327465421
60	after	0.0323553222
61	became	0.0323550912
62	cultural	0.0323344373
63	companies	0.0320323035
64	india	0.0319812811
65	s	0.0318898285
66	divided	0.0318143270
67	stations	0.0317933826
68	not	0.0316156080
69	commercial	0.0316035171
70	based	0.0315314842
71	town	0.0313758148
72	world	0.0313056359
73	central	0.0310292741
74	level	0.0309108679
75	municipal	0.0309018917
76	founded	0.0307581288
77	however	0.0306725285
78	industries	0.0306075132
79	station	0.0305030795
80	they	0.0302203802
81	established	0.0300443521
82	to	0.0299554552
83	industry	0.0298852207
84	been	0.0298833624
85	second	0.0297793476
86	economy	0.0297729727
87	institutions	0.0297075802
88	while	0.0296185191
89	million	0.0295494997
90	annual	0.0293786859
91	country	0.0293670007
92	can	0.0293557741
93	higher	0.0292922646
94	region	0.0292572415
95	various	0.0291898767
96	being	0.0289820850
97	season	0.0289659138
98	april	0.0288233996
99	only	0.0287397055

Chapter 8: Text¶

MIT License¶

Limitations¶

Chapter 8: Case Study on Textual Data¶

WikiCities Text Dataset¶

Exploratory Data Analyis¶

http://dbpedia.org/resource/Century,_Florida

Population: 1698

http://dbpedia.org/resource/Cape_Neddick,_Maine

Population: 2568

http://dbpedia.org/resource/Hangzhou

Population: 9018000

http://dbpedia.org/resource/Volda

Population: 8827

http://dbpedia.org/resource/Gnosall

Population: 4736

http://dbpedia.org/resource/Zhlobin

Population: 80200

http://dbpedia.org/resource/Cournonsec

Population: 2149

http://dbpedia.org/resource/Scorbé-Clairvaux

Population: 2412

http://dbpedia.org/resource/Isseksi

Population: 2000

First Featurization: Numbers-only¶

Word classes vs. word tokens¶

Second Featurization: Bag-of-words¶

Top 100 tokens by MI

Last 100 tokens by MI

Top 1 winner: http://dbpedia.org/resource/Ganzhou

Change: 2.6674 (base: 179,493, MI: 3,808,862). Population: 8,368,447. Text length: 5,851

Tokens (length 545)

Top 2 winner: http://dbpedia.org/resource/Sharjah

Change: 2.5115 (base: 12,693, MI: 72,028). Population: 1,400,000. Text length: 12,841

Tokens (length 1,229)

Top 3 winner: http://dbpedia.org/resource/Az_Zubayr

Change: 2.1715 (base: 6,853, MI: 45,442). Population: 370,000. Text length: 1,634

Tokens (length 160)

Top 4 winner: http://dbpedia.org/resource/Qods,_Iran

Change: 2.0675 (base: 6,405, MI: 59,047). Population: 229,354. Text length: 618

Tokens (length 58)

Top 5 winner: http://dbpedia.org/resource/Farah,_Afghanistan

Change: 2.0083 (base: 10,479, MI: 59,115). Population: 540,000. Text length: 5,858

Tokens (length 589)

Top 6 winner: http://dbpedia.org/resource/Masina,_Kinshasa

Change: 1.928 (base: 8,701, MI: 42,346). Population: 485,167. Text length: 1,086

Tokens (length 89)

Top 7 winner: http://dbpedia.org/resource/Apodaca

Change: 1.6533 (base: 9,805, MI: 36,756). Population: 523,370. Text length: 1,525

Tokens (length 141)

Top 8 winner: http://dbpedia.org/resource/Mejicanos

Change: 1.639 (base: 9,174, MI: 64,981). Population: 224,661. Text length: 2,299

Tokens (length 229)

Top 9 winner: http://dbpedia.org/resource/Baidoa

Change: 1.5058 (base: 21,615, MI: 96,584). Population: 657,500. Text length: 7,220

Tokens (length 679)

Top 10 winner: http://dbpedia.org/resource/Borough_Park,_Brooklyn

Change: 1.4972 (base: 4,950, MI: 21,465). Population: 154,210. Text length: 13,446

Tokens (length 1,204)

Top 1 loser: http://dbpedia.org/resource/Xuanwu_District,_Nanjing

Change: -2.8686 (base: 19,344, MI: 3,383). Population: 634,000. Text length: 1,869

Tokens (length 165)

Top 2 loser: http://dbpedia.org/resource/Demsa

Change: -1.9819 (base: 14,391, MI: 2,956). Population: 180,251. Text length: 276

Tokens (length 21)

Top 3 loser: http://dbpedia.org/resource/Dunmore_East

Change: -1.9234 (base: 7,616, MI: 55,128). Population: 1,559. Text length: 5,293

Tokens (length 495)

Top 4 loser: http://dbpedia.org/resource/Banha

Change: -1.6199 (base: 53,574, MI: 7,173). Population: 165,906. Text length: 2,993

Tokens (length 281)

Top 5 loser: http://dbpedia.org/resource/Madina,_Ghana

Change: -1.5748 (base: 8,971, MI: 2,580). Population: 137,162. Text length: 476

Tokens (length 36)

Top 6 loser: http://dbpedia.org/resource/Delgado,_San_Salvador

Change: -1.4674 (base: 7,772, MI: 2,675). Population: 174,825. Text length: 145

Tokens (length 6)

Top 7 loser: http://dbpedia.org/resource/Curug,_Tangerang

Change: -1.4375 (base: 7,352, MI: 2,580). Population: 165,812. Text length: 178

Tokens (length 11)