Chapter 8: Text

Accompanying code for the book The Art of Feature Engineering.

This notebook plus notebooks for the other chapters are available online at https://github.com/DrDub/artfeateng

MIT License

Copyright 2019 Pablo Duboue

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Limitations

  • Simple python intented for people coming from other languages that plan to use the ideas described in the book outside of Python.
  • Many of these techniques are available as library calls. They are spelled out as for teaching purposes.
  • Resource limitations:
    • At most one day of running time per notebook.
    • No GPU required.
    • Minimal dependencies.
    • At most 8Gb of RAM.
  • Due to resource limitations, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
  • To help readers try variants of some cells in isolation, the cells are easily executable without having to re-run the whole notebook. As such, most cells read everything they need from disk and write all their results back into disk, which is unnecessary with normal notebooks. The code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program.
  • I dislike Pandas so these notebooks are Pandas-free, which might seem unusual to some.

Chapter 8: Case Study on Textual Data

In this chapter, we will cover an expansion of the WikiCities dataset with textual descriptions and its impact on the population prediction task introduced in Chapter 6.

Text is a domain that exemplifies:

  • Large number of correlated features
  • A domain where ordering issues are important
  • Variable length feature vectors

Among the methods exemplified in text we will see:

  • Feature selection: dimensionality reduction
  • Feature weighting: TF-IDF
  • Computable features: morphological features

WikiCities Text Dataset

The text data we will be using in this notebook are the Wikipedia pages for the different cities in the dataset. Text extraction from Wikipedia is a computationally intense task and better catered by specialized tools. In this case, I used the excellent software wikiextractor by Giuseppe Attardi and produces cities1000_wikitext.tsv.bz2, with one city per row and text lines separated by tab characters. That 43,909,804 words and over 270,902,780 characters (average of 558 words per document, 3,445 characters) . For some of the experiments using document structure, I also kept the original markup in file cities1000_wikiraw.tsv.bz2, with markup the total number of characters climbs above 730 million.

First, many Wikipedia pages contain the population information mentioned within the text. Not necessarily all of them, but many do. At the Exploratory Data Analysis stage we might want to get an idea of how many do. Even for the ones that do, however, it might be indicated in many different ways, including punctuation (2,152,111 instead of 2152111) but most probably rounded up and expressed intermixing digits with words (like "a little over 2 million"). In that sense, this task is representative of the NLP subfield of Information Extraction.

While NLP this decade has been overtaken by Deep Learning approaches, particularly using Neurolanguage Models, this particular task most probably can still profit from non-deep learning techniques as we are looking for a very small piece of evidence within a large amount of data.

Following Chapter 6, it is clear that bigger cities will have longer pages so text length will most probably be a great feature. As base features, we will use the ch6_cell32_dev_feat_conservative.tsv with its 98 features.

With such an amount of data, aggressive feature selection will be needed, but let us start with some EDA.

Exploratory Data Analyis

Let us start by assembling a simple data set with an extra feature (the text length) and see whether it helps better predict the population (Cell #1)

In [1]:
# CELL 1
import random
import bz2
import re
import math
from sklearn.svm import SVR
import numpy as np

# read page lengths
text_lengths = dict()
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        text = cityline[tab:]
        text_lengths[name] = len(text)

# read base features
rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch8_cell1_dev_textlen.tsv", "w") as ch8:
    with open("ch6_cell32_dev_feat_conservative.tsv") as feats:
        header = next(feats)
        header = header.strip().split("\t")
        header.insert(-1, 'logtextlen')
        ch8.write("\t".join(header) + "\n")
        header.pop(0) # name
        header.pop() # population
        for line in feats:
            fields = line.strip().split("\t")
            if name not in text_lengths:
                raise Exception("City not found: " + name)
            fields.insert(-1, str(math.log(text_lengths[name], 10)))
            ch8.write("\t".join(fields) + "\n")
            logpop = float(fields[-1])
            name = fields[0]
            feats = list(map(float,fields[1:-1]))
            row = (feats, logpop, name)
            if rand.random() < 0.2:
                test_data.append(row) 
            else:
                train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c       = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred  = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell1_svr.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3434770677131145
Out[1]:
<matplotlib.legend.Legend at 0x7fcfdb668450>

The resulting RMSE of 0.3434 is an improvement of the one from Chapter 6, 0.3578, which is encouraging, but it is above using the full Chapter 6 graph information at 0.3298.

Let us look at ten random cities to see whether their text descriptions include the population explicitly (Cell #2). Notice I have used a regular Wikipedia dump, not a Cirrus dump. Wikipedia in recent years have moved to include tags expanded from the Wikidata project and therefore the exact population number might be absent, with a tag indicating the template engine to fetch the number at rendering time.

In [2]:
# CELL 2
PARAM_USE_SET_FOR_BOOK = True

cities_and_pop = list()
with open("ch8_cell1_dev_textlen.tsv") as feats:
    first = True
    for line in feats:
        if first:
            first = False
        else:
            fields = line.split('\t')
            cities_and_pop.append( (fields[0], round(10**float(fields[-1]))) )

rand = random.Random(42)

# stable set for book
cities = [ ('Century,_Florida', 1698), ('Isseksi', 2000), ('Volda', 8827), ('Cournonsec', 2149), 
          ('Cape_Neddick,_Maine', 2568), ('Zhlobin', 80200), ('Hangzhou', 9018000), ('Gnosall', 4736), 
          ('Scorbé-Clairvaux', 2412), ('Arizona_City,_Arizona>', 10475) ]

if PARAM_USE_SET_FOR_BOOK:
    cities = list(map(lambda x: ("<http://dbpedia.org/resource/" + x[0] + ">", x[1]), cities))
else:
    cities = set(rand.sample(sorted(cities_and_pop), 10))
    
to_print = set(map(lambda x:x[0], cities))
pops = { x[0]: x[1] for x in cities }

html = ''
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            text = text.replace('\t','<p>')
            html += "<h1>{}</h1><h2>Population: {}</h2>{}".format(name[1:-1], pops[name], text)
from IPython.display import HTML, display
display(HTML(html))

http://dbpedia.org/resource/Century,_Florida

Population: 1698

Century, Florida

Century, Florida Century is a town in Escambia County, Florida, United States. The population was 1,698 at the 2010 United States Census. It is part of the Pensacola–Ferry Pass–Brent Metropolitan Statistical Area. Century was founded in 1901 as a sawmill company town, and named after the fact the year 1901 was the first year of the 20th century. A post office has been in operation at Century since 1901. On February 15, 2016, the town was hit by an EF3 tornado, heavily damaging and destroying homes, and injuring three people. Century is located at (30.977648, -87.261500). According to the United States Census Bureau, the town has a total area of , of which is land and , or 3.69%, is water. Century is located in the Western Highlands of Florida. This physiographic province of the northern Gulf Coast region is made up of sand, silt, and clay hills. These highlands are deeply incised by creeks and rivers. Century is located on the western edge of the Escambia River floodplain. A small portion of the town (the eastern side) is within the floodplain itself. Most of the community, however, is located above the floodplain on level to gently sloping hillsides. Century's roadway network is highly irregular. It does not conform to the state of Florida's section, township and range survey system, for two reasons. The first is because Spanish land grants were issued along the Escambia River in the 16th and 17th centuries. These boundaries established a unique survey system that contorted east-to-west survey boundaries once Florida became a state and a state survey system was adopted, at which time previously existing survey systems were "grandfathered" in. The second reason for an irregular roadway and property boundary system is due to the community originally being built around the Louisville and Nashville Railroad (now the CSX railway). Automobile highways were eventually constructed, and closely paralleled the railway. A more modern highway (US 29) was constructed and moved many of the commercial operations west of the small original core of the community (now mostly located within the Alger-Sullivan Lumber Company Residential Historic District). U.S. Route 29 is used by residents of Escambia County to reach points north. Alabama State Route 113 leads north from the state line to Interstate 65 and provides the area with a route to Montgomery, Birmingham and Atlanta. From a southbound perspective, Century is en route between these major cities and the coastal beaches at Pensacola Beach and Perdido Key in Florida. Century is the western terminus of State Road 4, which leads east to the communities of Jay, Munson, Baker, and Milligan. Fresh water supplies are abundant, with water being withdrawn as groundwater from the local sand and gravel aquifer. In 1970, oil was discovered in the nearby community of Jay. Oil was also discovered near the town of Century, especially to its northeast. Oil became important to the local economy during the latter quarter of the 20th century. Gravel and sand is mined in open pits in and near Century. These natural mineral deposits are essential to supporting the construction industries in nearby Pensacola and Mobile, especially for use as aggregate materials in concrete. Timber and pulpwood are other valuable natural commodities of the area. Nearby papermills at Cantonment, Florida, and Brewton, Alabama, provide a market for cut pulpwood. Timber processing is conducted by another industry about ten miles south of Century. As of the census of 2000, there were 1,714 people, 680 households, and 448 families residing in the town. The population density was 522.5 inhabitants per square mile (201.8/km²). There were 800 housing units at an average density of 243.9 per square mile (94.2/km²). The racial makeup of the town was 39.67% White, 56.65% African American, 0.58% Native American, 0.64% Asian, 0.06% Pacific Islander, 0.35% from other races, and 2.04% from two or more races. Hispanic or Latino of any race were 1.63% of the population. There were 680 households out of which 29.0% had children under the age of 18 living with them, 36.5% were married couples living together, 25.9% had a female householder with no husband present, and 34.1% were non-families. 31.8% of all households were made up of individuals and 18.1% had someone living alone who was 65 years of age or older. The average household size was 2.52 and the average family size was 3.21. In the town the population was spread out with 30.0% under the age of 18, 6.9% from 18 to 24, 24.0% from 25 to 44, 21.9% from 45 to 64, and 17.2% who were 65 years of age or older. The median age was 37 years. For every 100 females there were 80.6 males. For every 100 females age 18 and over, there were 75.3 males. The median income for a household in the town was $20,703, and the median income for a family was $28,241. Males had a median income of $26,932 versus $17,390 for females. The per capita income for the town was $10,412. About 24.5% of families and 30.1% of the population were below the poverty line, including 41.4% of those under age 18 and 26.1% of those age 65 or over. A Florida prison known as Century Correctional Institution is the only major employer in the region. This facility has employs a full-time staff of 401, which is almost 25% of the entire population of Century. Residents of Century and the surrounding area in Escambia County are served by the Escambia County School District. Century is within the zones of the following schools:

http://dbpedia.org/resource/Cape_Neddick,_Maine

Population: 2568

Cape Neddick, Maine

Cape Neddick, Maine Cape Neddick is a census-designated place (CDP) in the town of York in York County, Maine, United States. The population was 2,568 at the 2010 census. It is part of the Portland–South Portland–Biddeford, Maine Metropolitan Statistical Area. Cape Neddick is located at (43.169023, -70.617341). The CDP as defined includes all of the physical peninsula known as Cape Neddick, plus all of the unincorporated community of York Beach, which consists of two beaches, one on either side of Cape Neddick. The northern limit of the CDP is the town of Ogunquit. The western boundary is unclear though it is generally thought that it includes Firetown and to Old Post Road (Maggy Nason Road).The southern boundary abuts the northern edge of the York Harbor CDP. The eastern edge of the CDP is the shoreline of the Atlantic Ocean. According to the United States Census Bureau, the CDP has a total area of , of which is land and of it, or 6.03%, is water. Cape Neddick Light, also known as Nubble Lighthouse, is the most distinctive feature of the community. Construction began in 1876 and cost $15,000. It was first illuminated on July 1, 1879. The lighthouse was originally red but has been painted white since 1902. The distinctive red house was also built in 1902. The tower stands tall. The lighthouse became automated in 1987. As of the census of 2000, there were 2,997 people, 1,340 households, and 897 families residing in the CDP. The population density was 801.5 people per square mile (309.4/km²). There were 3,424 housing units at an average density of 915.7/sq mi (353.5/km²). The racial makeup of the CDP was 98.26% White, 0.27% African American, 0.47% Asian, 0.10% Pacific Islander, 0.20% from other races, and 0.70% from two or more races. Hispanic or Latino of any race were 0.90% of the population. There were 1,340 households out of which 22.3% had children under the age of 18 living with them, 55.7% were married couples living together, 8.2% had a female householder with no husband present, and 33.0% were non-families. 27.2% of all households were made up of individuals and 11.2% had someone living alone who was 65 years of age or older. The average household size was 2.24 and the average family size was 2.70. In the CDP the population was spread out with 18.2% under the age of 18, 4.4% from 18 to 24, 23.9% from 25 to 44, 33.9% from 45 to 64, and 19.6% who were 65 years of age or older. The median age was 47 years. For every 100 females there were 88.4 males. For every 100 females age 18 and over, there were 86.7 males. The median income for a household in the CDP was $45,500, and the median income for a family was $52,796. Males had a median income of $42,386 versus $30,800 for females. The per capita income for the CDP was $33,788. About 2.2% of families and 5.6% of the population were below the poverty line, including 4.6% of those under age 18 and 5.2% of those age 65 or over. There are two listings on the National Register of Historic Places for Cape Neddick. One is St. Peter's By-The-Sea Protestant Episcopal Church, and the other is Cape Neddick Light just off the coast. Wriggley bridge adjacent to The Lobster Pound has a jump of 15-16' depending on high tide height. No more, definitely not 25'. Rene Cape Neddick Country Club offers golfing on an 18 hole course designed by Donald Ross Before 1655 Cape Neddick was inhabited by John Gooch, Peter Weare, Edward Wanton, Sylvester Stover and Thomas Wheelwright and their families.

http://dbpedia.org/resource/Hangzhou

Population: 9018000

Hangzhou

Hangzhou Hangzhou (), formerly romanized as Hangchow, is the capital and most populous city of Zhejiang Province in east China. It sits at the head of Hangzhou Bay, which separates Shanghai and Ningbo. Hangzhou grew to prominence as the southern terminus of the Grand Canal and has been one of the most renowned and prosperous cities in China for much of the last millennium, due in part to its beautiful natural scenery. The city's West Lake is its best-known attraction. Hangzhou is classified as a sub-provincial city and forms the core of the Hangzhou metropolitan area, the fourth-largest in China. During the 2010 Chinese census, the metropolitan area held 21.102 million people over an area of . Hangzhou prefecture had a registered population of 9,018,000 in 2015. In September 2015, Hangzhou was awarded the 2022 Asian Games. It will be the third Chinese city to play host to the Asian Games after Beijing 1990 and Guangzhou 2010. On November 16, 2015, paramount leader Xi Jinping announced that Hangzhou would host the eleventh G-20 summit on September 4–5, 2016. The celebrated neolithic culture of Hemudu is known to have inhabited Yuyao, south-east of Hangzhou, as far back as seven thousand years ago. It was during this time that rice was first cultivated in southeast China. Excavations have established that the jade-carving Liangzhu culture (named for its type site just northwest of Hangzhou) inhabited the area immediately around the present city around five thousand years ago. The first of Hangzhou's present neighborhoods to appear in written records was Yuhang, which probably preserves an old Baiyue name. Hangzhou was made the seat of the "zhou" (very roughly, "county") of Hang in , entitling it to a city wall which was constructed two years later. By a longstanding convention also seen in other cities like Guangzhou and Fuzhou, the city took on the name of the area it administered and became known as Hangzhou. Hangzhou was at the southern end of China's Grand Canal which extends to Beijing. The canal evolved over centuries but reached its full length by 609. In the Tang dynasty, Bai Juyi was appointed governor of Hangzhou. Already an accomplished and famous poet, his deeds at Hangzhou have led to his being praised as a great governor. He noticed that the farmland nearby depended on the water of West Lake, but due to the negligence of previous governors, the old dyke had collapsed, and the lake so dried out that the local farmers were suffering from severe drought. He ordered the construction of a stronger and taller dyke, with a dam to control the flow of water, thus providing water for irrigation and mitigating the drought problem. The livelihood of local people of Hangzhou improved over the following years. Bai Juyi used his leisure time to enjoy the beauty of West Lake, visiting it almost daily. He also ordered the construction of a causeway connecting Broken Bridge with Solitary Hill to allow walking, instead of requiring a boat. He then had willows and other trees planted along the dyke, making it a beautiful landmark. This causeway was later named "Bai Causeway", in his honor. It is listed as one of the Seven Ancient Capitals of China. It was first the capital of the Wuyue Kingdom from 907 to 978 during the Five Dynasties and Ten Kingdoms Period. Named Xifu at the time, it was one of the three great bastions of culture in southern China during the tenth century, along with Nanjing and Chengdu. Leaders of Wuyue were noted patrons of the arts, particularly of Buddhist temple architecture and artwork. Hangzhou also became a cosmopolitan center, drawing scholars from throughout China and conducting diplomacy with neighboring Chinese states, and also with Japan, Korea, and the Khitan Liao dynasty. In 1089, while another renowned poet Su Shi (Su Dongpo) was the city's governor, he used 200,000 workers to construct a long causeway across West Lake, which the Qianlong Emperor considered particularly attractive in the early morning of the spring time. The lake was once a lagoon tens of thousands of years ago. Silt then blocked the way to the sea and the lake was formed. A drill in the lake-bed in 1975 found the sediment of the sea, which confirmed its origin. Artificial preservation prevented the lake from evolving into a marshland. The Su Causeway built by Su Shi, and the Bai Causeway built by Bai Juyi, a Tang dynasty poet who was once the governor of Hangzhou, were both built out of mud dredged from the lake bottom. The lake is surrounded by hills on the northern and western sides. The Baochu Pagoda sits on the Baoshi Hill to the north of the lake. Arab merchants lived in Hangzhou during the Song dynasty, due to the fact that the oceangoing trade passages took precedence over land trade during this time. There were also Arabic inscriptions from the 13th century and 14th century. During the later period of the Yuan dynasty, Muslims were persecuted through the banning of their traditions, and they participated in revolts against the Mongols. The Fenghuangshi mosque was constructed by an Egyptian trader who moved to Hangzhou. Ibn Battuta is known to have visited the city of Hangzhou in 1345; he noted its charm and described how the city sat on a beautiful lake and was surrounded by gentle green hills. During his stay at Hangzhou, he was particularly impressed by the large number of well-crafted and well-painted Chinese wooden ships with colored sails and silk awnings in the canals. He attended a banquet held by Qurtai, the Yuan Mongol administrator of the city, who according to Ibn Battuta, was fond of the skills of local Chinese conjurers. Hangzhou was chosen as the new capital of the Southern Song dynasty in 1132, when most of northern China had been conquered by the Jurchens in the Jin–Song wars. The Song court had retreated south to the city in 1129 from its original capital in Kaifeng, after it was captured by the Jurchens in the Jingkang Incident of 1127. From Kaifeng they moved to Nanjing, modern Shangqiu, then to Yangzhou in 1128. The government of the Song intended it to be a temporary capital. However, over the decades Hangzhou grew into a major commercial and cultural center of the Song dynasty. It rose from a middling city of no special importance to one of the world's largest and most prosperous. Once the prospect of retaking northern China had diminished, government buildings in Hangzhou were extended and renovated to better befit its status as an imperial capital and not just a temporary one. The imperial palace in Hangzhou, modest in size, was expanded in 1133 with new roofed alleyways, and in 1148 with an extension of the palace walls. From the early 12th century until the Mongol invasion of 1276, Hangzhou remained the capital and was known as Lin'an (臨安). It served as the seat of the imperial government, a center of trade and entertainment, and the nexus of the main branches of the civil service. During that time the city was a gravitational center of Chinese civilization: what used to be considered "central China" in the north was taken by the Jin, an ethnic minority dynasty ruled by Jurchens. Numerous philosophers, politicians, and men of literature, including some of the most celebrated poets in Chinese history such as Su Shi, Lu You, and Xin Qiji came here to live and die. Hangzhou is also the birthplace and final resting place of the scientist Shen Kuo (1031–1095 AD), his tomb being located in the Yuhang district. During the Southern Song dynasty, commercial expansion, an influx of refugees from the conquered north, and the growth of the official and military establishments, led to a corresponding population increase and the city developed well outside its 9th-century ramparts. According to the "Encyclopædia Britannica", Hangzhou had a population of over 2 million at that time, while historian Jacques Gernet has estimated that the population of Hangzhou numbered well over one million by 1276. (Official Chinese census figures from the year 1270 listed some 186,330 families in residence and probably failed to count non-residents and soldiers.) It is believed that Hangzhou was the largest city in the world from 1180 to 1315 and from 1348 to 1358. Because of the large population and densely crowded (often multi-story) wooden buildings, Hangzhou was particularly vulnerable to fires. Major conflagrations destroyed large sections of the city in 1132, 1137, 1208, 1229, 1237, and 1275 while smaller fires occurred nearly every year. The 1237 fire alone was recorded to have destroyed 30,000 dwellings. To combat this threat, the government established an elaborate system for fighting fires, erected watchtowers, devised a system of lantern and flag signals to identify the source of the flames and direct the response, and charged more than 3,000 soldiers with the task of putting out fire. The city of Hangzhou was besieged and captured by the advancing Mongol armies of Kublai Khan in 1276, three years before the final collapse of the empire. The capital of the new Yuan Dynasty was established in the city of Dadu (Beijing). The Venetian merchant Marco Polo supposedly visited Hangzhou in the late 13th century. In his book, he records that the city was "greater than any in the world". He called the city Quinsai, a name that—like Odoric of Pordenone's Cansay—derived from its Southern Song nickname Xingzai, meaning "Temporary Residence". Marco Polo wrote of the city: "The number and wealth of the merchants, and the amount of goods that passed through their hands, was so enormous that no man could form a just estimate thereof." Polo may have exaggerated, describing the city as over one hundred miles in diameter (although if he had meant Chinese mile it would be smaller at 3/8 of the measurement in Italian mile), and had 12,000 stone bridges, although some argued that this may have been a mistake and exaggeration by a copyist who turned the "12 gates" of the city into "12,000 bridges". The renowned 14th-century Moroccan explorer Ibn Battuta said it was "the biggest city I have ever seen on the face of the earth." The city remained an important port until the middle of the Ming dynasty era, when its harbor slowly silted up. Under the Qing, it was the site of an imperial army garrison. In 1856 and 1860, the Taiping Heavenly Kingdom occupied Hangzhou and caused heavy damage to the city. Hangzhou was ruled by the Republic of China government under the Kuomintang from 1928 to 1949. On May 3, 1949, the People's Liberation Army entered Hangzhou and the city came under Communist control. After Deng Xiaoping's reformist policies began in 1978, Hangzhou took advantage of being situated in the Yangtze River Delta to bolster its development. It is now one of China's most prosperous major cities. Hangzhou is located in northwestern Zhejiang province, at the southern end of the Grand Canal of China, which runs to Beijing, in the south-central portion of the Yangtze River Delta. Its administrative area (sub-provincial city) extends west to the mountainous parts of Anhui province, and east to the coastal plain near Hangzhou Bay. The city center is built around the eastern and northern sides of the West Lake, just north of the Qiantang River. Hangzhou's climate is humid subtropical (Köppen "Cfa") with four distinctive seasons, characterised by long, very hot, humid summers and chilly, cloudy and drier winters (with occasional snow). The mean annual temperature is , with monthly daily averages ranging from in January to in July. The city receives an average annual rainfall of and is affected by the plum rains of the Asian monsoon in June. In late summer (August to September), Hangzhou suffers typhoon storms, but typhoons seldom strike it directly. Generally they make landfall along the southern coast of Zhejiang, and affect the area with strong winds and stormy rains. Extremes since 1951 have ranged from on 6 February 1969 up to on 9 August 2013; unofficial readings have reached , set on 29 December 1912 and 24 January 1916, up to , set on 10 August 1930. With monthly percent possible sunshine ranging from 30% in March to 51% in August, the city receives 1,709.4 hours of sunshine annually. The sub-provincial city of Hangzhou comprises 9 districts, 2 county-level cities, and 2 counties. The six central urban districts occupy and have 3,560,400 people. The three suburban districts occupy and have 3,399,300 people. Hangzhou's economy has rapidly developed since its opening up in 1992. It is an industrial city with many diverse sectors such as light industry, agriculture, and textiles. It is considered an important manufacturing base and logistics hub for coastal China. The 2001 GDP of Hangzhou was RMB ¥156.8 billion, which ranked second among all of the provincial capitals after Guangzhou. The city has more than tripled its GDP since then, increasing from RMB ¥156.8 billion in 2001 to RMB ¥1.0054 trillion in 2015 and GDP per capita increasing from US$3,025 to US$18,025. The city has developed many new industries, including medicine, information technology, heavy equipment, automotive components, household electrical appliances, electronics, telecommunication, fine chemicals, chemical fibre and food processing. Hangzhou is renowned for its historic relics and natural beauty. It is known as one of the most beautiful cities in China, also ranking as one of the most scenic cities. Although Hangzhou has been through many recent urban developments, it still retains its historical and cultural heritage. Today, tourism remains an important factor for Hangzhou's economy. One of Hangzhou's most popular sights is West Lake, a UNESCO World Heritage Site. The West Lake Cultural Landscape covers an area of and includes some of Hangzhou's most notable historic and scenic places. Adjacent to the lake is a scenic area which includes historical pagodas, cultural sites, as well as the natural beauty of the lake and hills, including Phoenix Mountain. There are two causeways across the lake. Other places of interest: In March 2013 The Hangzhou Tourism Commission started an online campaign via Facebook, the 'Modern Marco Polo' campaign. Over the next year nearly 26,000 participants applied from around the globe, in the hopes of becoming Hangzhou's first foreign tourism ambassador. In a press conference in Hangzhou on May 20, 2014, Liam Bates was announced as the successful winner. The 26-year-old won a €40,000 contract and is the first foreigner ever to be appointed by China's government in such an official role. In 1848, during the Qing dynasty, Hangzhou was described as the "stronghold" of Islam in China, the city containing several mosques with Arabic inscriptions. A Hui from Ningbo also told an Englishman that Hanzhou was the "stronghold" of Islam in Zhejiang province, containing multiple mosques, compared to his small congregation of around 30 families in Ningbo for his mosque. Within the city of Hangzhou are two notable mosques: the Great Mosque of Hangzhou and the Phoenix Mosque. As late as the latter part of the 16th and early 17th centuries, the city was an important center of Chinese Jewry, and may have been the original home of the better-known Kaifeng Jewish community. There was formerly a Jewish synagogue in Ningbo, as well as one in Hangzhou, but no traces of them are now discoverable, and the only Jews known to exist in China were in Kaifeng. Two of the Three Pillars of Chinese Catholicism were from Hangzhou. There was persecution of Christians in the early 21st century in the city. The native residents of Hangzhou, like those of Zhejiang and southern Jiangsu, speak Hangzhou dialect, which is a Wu dialect. However, Wu Chinese varies throughout the area where it is spoken, hence, Hangzhou's dialect differs from regions in southern Zhejiang and southern Jiangsu. As the official language defined by China's central government, Mandarin is the dominant spoken language. There are several museums located in Hangzhou with regional and national importance. China National Silk Museum (), located near the West Lake, is one of the first state-level museums in China and the largest silk museum in the world. China National Tea Museum () is a national museum with special subjects as tea and its culture. Zhejiang Provincial Museum () features collection of integrated human studies, exhibition and research with its over 100,000 collected cultural relics. Hangzhou's local cuisine is often considered to be representative of Zhejiang provincial cuisine, which is claimed as one of China's eight fundamental cuisines. The locally accepted consensus among Hangzhou's natives defines dishes prepared in this style to be "fresh, tender, soft, and smooth, with a mellow fragrance." Dishes such as Pian Er Chuan Noodles (), West Lake Vinegar Fish (), Dongpo Pork (), Longjing Shrimp (), Beggar's Chicken (), Steamed Rice and Pork Wrapped by Lotus Leaves(), Braised Bamboo Shoots (), Lotus Root Pudding () and Sister Song's Fish Soup () are some of the better-known examples of Hangzhou's regional cuisine. There are lots of theaters in Hangzhou showing performance of opera shows. Shaoxing opera, originated from Shengzhou, Zhejiang Province, is the second-largest opera form in China. Also, there are several big shows themed with the history and culture of Hangzhou like Impression West Lake and the Romance of Song Dynasty. Tea is an important part of Hangzhou's economy and culture. Hangzhou is best known for originating Longjing, a notable variety of green tea, the most notable type being Xi Hu Long Jing. Known as the best type of Long Jing tea, Xi Hu Long Jing is grown in Longjing village near Xi Hu in Hangzhou, hence its name. The local government of Hangzhou heavily invests in promoting tourism and the arts, with emphasis placed upon silk production, umbrellas, and Chinese hand-held folding fans. Hangzhou is served by the Hangzhou Xiaoshan International Airport, which provides direct service to many international destinations such as Thailand, Japan, South Korea, Malaysia, Vietnam, Singapore, Taiwan, Netherlands, Qatar, and the United States. Regional routes reach Hong Kong and Macau. It has an extensive domestic route network within the PRC and is consistently ranked top 10 in passenger traffic among Chinese airports. Hangzhou Xiaoshan International Airport has two terminals, Terminal A and Terminal B. The smaller Terminal A serves all international and regional flights while the larger Terminal B solely handles domestic traffic. The airport is located just outside the city in the Xiaoshan District with direct bus service linking the airport with Downtown Hangzhou. The ambitious expansion project will see the addition of a second runway and a third terminal which will dramatically increase capacity of the fast-growing airport that serves as a secondary hub of Air China. A new elevated airport express highway is under construction on top of the existing highway between the airport and downtown Hangzhou. The second phase of Hangzhou Metro Line 1 has a planned extension to the airport. Hangzhou sits on the intersecting point of some of the busiest rail corridors in China. The city's main station is Hangzhou East Railway Station (colloquially "East Station" ). It is one of the biggest rail traffic hubs in China, consisting of 15 platforms that house the High Speed CRH service to Shanghai, Nanjing, Changsha, Ningbo, and beyond. The subway station beneath the rail complex building is a stop along the Hangzhou Metro Line 1 and Line 4. There are frequent departures for Shanghai with approximately 20-minute headways from 6:00 to 21:00. Non-stop CRH high-speed service between Hangzhou and Shanghai takes 50 minutes and leaves every hour (excluding a few early morning/late night departures) from both directions. Other CRH high-speed trains that stop at one or more stations along the route complete the trip in 59 to 75 minutes. Most other major cities in China can also be reached by direct train service from Hangzhou. The Hangzhou Railway Station (colloquially the "City Station" ) was closed for renovation in mid 2013 but has recently opened again. Direct trains link Hangzhou with more than 50 main cities, including 12 daily services to Beijing and more than 100 daily services to Shanghai; they reach as far as Ürümqi. The China Railway High-Speed service inaugurated on October 26, 2010. The service is operated by the CRH 380A(L), CRH 380B(L) and CRH380CL train sets which travel at a maximum speed of , shortening the duration of the trip to only 45 minutes. The construction of the Shanghai–Hangzhou Maglev Train Line has been debated for several years. On August 18, 2008, Beijing authorities gave the project the go-ahead to start construction in 2010. Transrapid has been contracted to construct the line; however, , construction has not yet started. Central (to the east of the city centre, taking the place of the former east station), north, south, and west long-distance bus stations offer frequent coach service to nearby cities/towns within Zhejiang province, as well as surrounding provinces. Hangzhou has an efficient public transportation network, consisting of a modern fleet of regular diesel bus, trolley bus, hybrid diesel-electric bus and taxi. The first subway line entered into service in late 2012. Hangzhou is known for its extensive Bus Rapid Transit network expanding from downtown to many suburban areas through dedicated bus lanes on some of the busiest streets in the city. Bicycles and electric scooters are very popular, and major streets have dedicated bike lanes throughout the city. Hangzhou has an extensive free public bike rental system, the Hangzhou Public Bicycle system. Taxis are also popular in the city, with the newest line of Hyundai Sonatas and Volkswagen Passats, and tight regulations. In early 2011, 30 electric taxis were deployed in Hangzhou; 15 were Zotye Langyues and the other 15 were Haima Freemas. In April, however, one Zoyte Langyue caught fire, and all of the electric taxis were taken off the roads later that day. The city still intends to have a fleet of 200 electric taxis by the end of 2011. In 2014, a large number of new electric taxis produced by Xihu-BYD (Xihu (westlake) is a local company which is famous for television it produced in the past) were deployed. The Hangzhou Metro began construction in March 2006, and the first line opened on November 24, 2012. Line 1 connects downtown Hangzhou with suburban areas of the city from Xianghu to Wenze Road and Linping. By June 2015, the southeast part of Line 2 (starts in Xiaoshan District, ends to the south of the city centre) and a short part of Line 4 (fewer than 10 stations, connecting Line 1 & Line 2) were completed. The system is expected to have 10 lines upon completion; most lines are still under construction. The extensions of Line 2 (Xihu District) and Line 4 (east of Bingjiang) are expected to be finished in 2016. Hangzhou has a large student population with many higher education institutions based in the city. Public universities include Zhejiang University, Zhejiang University of Technology, and Hangzhou Normal University etc. Xiasha, located near the east end of the city, and Xiaoheshan, located near the west end of the city, are college towns with a cluster of several universities and colleges. "Note: Institutions without full-time bachelor programs are not listed." The most famous high schools in Hangzhou are: Hangzhou International School and the Hangzhou Japanese School (杭州日本人学校) (nihonjin gakko) serve the local expat population in Hangzhou. Hangzhou is twinned with: Fishers, Indiana is in the exploration process of becoming sister cities with Hangzhou. A common Chinese saying about Hangzhou and Suzhou is: This phrase has a similar meaning to the English phrases "Heaven on Earth". Marco Polo in his accounts described Suzhou as "city of the earth" while Hangzhou is "city of the Heaven". The city present itself as "Paradise on Earth" during the G20 summit held in the city in 2016. Another popular saying about Hangzhou is: The meaning here lies in the fact that Suzhou was renowned for its beautiful and highly civilized and educated citizens, Hangzhou for its scenery, Guangzhou for its food, and Liuzhou (of Guangxi) for its wooden coffins which supposedly halted the decay of the body (likely made from the camphor tree).

http://dbpedia.org/resource/Volda

Population: 8827

Volda

Volda Volda is a municipality in Møre og Romsdal county, Norway. It is part of the Sunnmøre region. The administrative centre is the village of Volda. Other villages in the municipality include Dravlaus, Folkestad, Fyrde, Lauvstad, and Straumshamn. The municipality is located about south of the city of Ålesund. The municipality of "Volden" was established on 1 January 1838 (see formannskapsdistrikt). The original municipality was the same as the parish (prestegjeld) of Volden, including the sub-parishes of Ørsta and Dalsfjord. On 1 August 1883, the sub-parish of Ørsta was separated from Volden to form a new municipality of its own. This left Volden with 3,485 residents. On 1 January 1893, the Ytrestølen farm in Ørsta municipality (population: 13) was transferred to Volden municipality. In 1918, the name was changed from "Volden" to "Volda". On 1 July 1924, the sub-parish of Dalsfjord was separated from Volda to become a municipality of its own. This left Volda with 4,715 residents. On 1 January 1964, the municipalities of Dalsfjord and Volda were merged back together. The new Volda municipality had 7,207 residents. The municipality is named after the Voldsfjorden (Old Norse: "Vǫld"). The name is probably derived from an old word meaning "wave". (Compare with the German: "Welle" which means "wave".) Before 1918, the name was written "Volden". The coat-of-arms is from modern times. They were granted on 19 June 1987. The arms show a silver-colored tip of a fountain pen on a blue background. This is a symbol for the long history of education in Volda. The Church of Norway has four parishes "(sokn)" within the municipality of Volda. It is part of the Søre Sunnmøre deanery in the Diocese of Møre. Volda's main geographical feature is the Voldsfjorden which branches off into the Austefjorden, Kilsfjorden, and Dalsfjorden. It is also mountainous, particularly southeast of the fjords, with the Sunnmørsalpene mountains surrounding the region. The tall mountain Eidskyrkja is located in the southeastern part of the municipality. Volda is bordered by Vanylven Municipality to the south-west/west, the municipalities of Herøy and Ulstein (only by sea) to the west, and Ørsta Municipality to the north and east. To the south it is adjacent to the municipalities of Hornindal and Eid in Sogn og Fjordane county. The dominant centre, both in terms of population and administration, is the village of Volda, in the northernmost part of the municipality. Other population concentrations include Mork, Ekset, Folkestad, Fyrde, Steinsvika, Lauvstad, Bjørkedal, and Straumshamn. Volda is primarily known for strong cultural heritage and academic traditions. A private library at Egset, the first rural of its kind in Norway, is said to have inspired the young Ivar Aasen in the 19th century. Martin Ulvestad, Norwegian–American author who published an English-Danish-Norwegian dictionary in 1895, ("Engelsk-Dansk-Norsk Ordbog med fuldstændig Udtalebetegnelse") was born in Volda. The "Norsk Landboeblad" newspaper was based in Volda in the 1800s. "Volda landsgymnas" (established 1910) was the first Norwegian secondary school outside a major city. Among the most important institutions today is the Volda University College. Volda University College is one of 25 university colleges in Norway. Volda University College enrolls about 3,000 students and specializes in education of teachers, animators, and journalists. There is a thriving creative community in the town, with several animation companies, as well as the Norsk Animasjonsentrum/Norwegian Animation Centre and a yearly animation festival, run in cooperation with Volda University College. Volda also hosts a national documentary film festival as well as an annual student festival. The festival, Den Norske Dokumentarfilmfestivalen is usually held in late April. The national ski festival X2 is also held in Volda during April every year. The Volda TI sports club includes a Third Division association football team that competes in Volda. As a logical consequence of the huge influx of students, as well as a county hospital, public services are by far the most dominant sector, representing almost 50% of economic life in Volda. Industry and agriculture are also prevalent. Bjørkedalen is noted for its tradition in building wooden boats. Volda and its environs are featured prominently in the film "Troll Hunter" (2010). The Ørsta-Volda Airport, Hovden is located in neighbouring Ørsta Municipality, just north of the village of Volda. The European route E39 highway passes north through the municipality on its way to the city of Ålesund. As noted, the municipality is criss-crossed by fjords; therefore, both Lauvstad and Folkestad are linked to the population centre Volda by ferry. In February 2008, the underwater Eiksund Tunnel connected the municipalities of Ulstein, Hareid, Herøy, and Sande to Ørsta and Volda. The tunnel is the deepest undersea tunnel in the world. The new Kviven Tunnel was completed in 2012, connecting Fyrde in eastern Volda to the village of Grodås in Hornindal Municipality to the south (in Sogn og Fjordane county).

http://dbpedia.org/resource/Gnosall

Population: 4736

Gnosall

Gnosall Gnosall is a village and civil parish in the Borough of Stafford, Staffordshire, England, with a population of 4,736 across 2,048 households (2011 census). It lies on the A518, approximately halfway between the towns of Newport (in Shropshire) and the county town of Staffordshire, Stafford. Gnosall Heath lies immediately south-west of the main village, joined by Station Road and separated by Doley Brook. Other nearby villages include Woodseaves, Knightley, Cowley, Ranton, Church Eaton, Bromstead Heath, Moreton and Haughton. The village was mentioned in the Domesday Book, in which it was named "Geneshale". It is listed there as having a population of 12 households. The Stafford to Shrewsbury railway line once ran through the village. Gnosall's railway station opened on and closed on . The line was built by the Shropshire Union Railways and Canal Company, which also managed the Shropshire Union Canal which runs through the village. A footpath, the Way for the Millennium, now follows its route. Landmarks of interest include: There are also several old, privately owned, buildings such as the building on the High Street that was previously the Duke's Head, a public house. With a thatched roof, and herring-bone brick pattern between faded, unpainted wooden beams, it is generally regarded as one of the most picturesque scenes in the village, certainly on the High Street. The large primary school was previously Heron Brook High School, but is now St. Lawrence CE (C) Primary School. It was originally designed to look attractive from the railway that passes close by it; however the only people who see its intended front now are walkers, staff and pupils. Gnosall is fairly self-contained in terms of shops and amenities, with its own fire station, supermarket, doctor's surgery, dental practice, two fuel stations, police station, cricket club, take aways, pubs, post office and historic high street. Many of the village's ancient traditions are still honoured today, notably the carnival, where children dress up in themed costumes, and a parade complete with custom made floats and a brass band that runs to the St Lawrence School field from the Royal Oak, another pub. A large health centre was completed in 2006 at the opposite end of Gnosall from the old doctor's surgery by the fire station; tribute to the rapid increase in population of recent years. The village has a community first responder group, a charity consisting of trained local people who provide emergency cover on behalf of West Midlands Ambulance Service in response to 999 calls and administer basic life support, oxygen therapy, defibrillation and first aid whilst an ambulance is "en route". The village's newspaper; "GPN" (Gnosall Parish News), is produced and sold in the village, and serves as a local advertiser of services and events, as well as publishing articles of interest to the local community. Despite there being controversy over the legality, fishing is popular and fruitful on the canal. The Rev. Adam Blakeman, the Puritan minister who founded the early American town of Stratford, Connecticut, was born in Gnosall in 1596.

http://dbpedia.org/resource/Zhlobin

Population: 80200

Zhlobin

Zhlobin Zhlobin (; ) is a city in the Zhlobin District of Gomel Region of Belarus, on the Dnieper river. As of 2012, the population is 80.200. The city is notable for being the location where steelmaker BMZ was established. BMZ is one of the largest companies in Belarus, and an important producer in the worldwide markets of steel wires and cord. The company is the main sustainer of the town's economy. In 1939, Jews formed 19% of the town's population. During WWII, Germans kept them imprisoned in 2 different ghettos, where they suffered from starvation, diseases and abuses. On April 12, 1942, 1,200 Jews of the ghettos were murdered.

http://dbpedia.org/resource/Cournonsec

Population: 2149

Cournonsec

Cournonsec Cournonsec is a commune in the Hérault department in southern France.

http://dbpedia.org/resource/Scorbé-Clairvaux

Population: 2412

Scorbé-Clairvaux

Scorbé-Clairvaux Scorbé-Clairvaux is a commune in the Vienne department in the Poitou-Charentes region in western France.

http://dbpedia.org/resource/Isseksi

Population: 2000

Isseksi

Isseksi Isseksi is a small town and rural commune in Azilal Province of the Tadla-Azilal region of Morocco. At the time of the 2004 census, the commune had a total population of 2000 people living in 310 households.

From here, most pages mention the actual number, albeit with punctuation.

CityPopText
Arizona_City,_ArizonaThe population was 10,475 at the 2010 census.
Century,_FloridaThe population was 1,698 at the 2010 United States Census.
Cape_Neddick,_MaineThe population was 2,568 at the 2010 census.
HangzhouHangzhou prefecture had a registered population of 9,018,000 in 2015.
Volda8827The new Volda municipality had 7,207 residents.
GnosallGnosall Gnosall is a village and civil parish in the Borough of Stafford, Staffordshire, England, with a population of 4,736 across 2,048 households (2011 census).
ZhlobinAs of 2012, the population is 80.200.
Cournonsec2149
Scorbé-Clairvaux2412
IsseksiAt the time of the 2004 census, the commune had a total population of 2000 people living in 310 households.

So 8 out of 10 mention the population, with one case of a different population (7,207 vs. 8,827) and another with the wrong punctuation (80.200 instead of 80,200). Clearly there is value on the textual data. Only one case it has the number verbatim (without any punctuation).

Also, note that Volda is town with plenty of text and a rich history. The page itself describes its population changes over the years.

Let us see if these percentages carry on to the whole dataset (Cell #3).

In [3]:
# CELL 3
import bz2

cities_and_pop = dict()
with open("ch6_cell32_dev_feat_conservative.tsv") as feats:
    first = True
    for line in feats:
        if first:
            first = False
        else:
            fields = line.split('\t')
            cities_and_pop[ fields[0] ] = round(10**float(fields[-1]))

found_verbatim    = 0
found_with_commas = 0
found_with_dots   = 0
total = 0
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        text = cityline[tab:]
        if name in cities_and_pop:
            total += 1
            pop = cities_and_pop[name]
            pop_verbatim = str(pop)
            if pop_verbatim in text:
                found_verbatim += 1
            else:
                pop_commas = "{:,}".format(pop)
                if pop_commas in text:
                    found_with_commas += 1
                else:
                    pop_dots = pop_commas.replace(",",".")
                    if pop_dots in text:
                        found_with_dots += 1

print("Total cities:      {:,}".format(total))
print("Found verbatim:    {:,} ({:%})".format(found_verbatim, found_verbatim * 1.0 / total))
print("Found with commas: {:,} ({:%})".format(found_with_commas, found_with_commas * 1.0 / total))
print("Found with dots:   {:,} ({:%})".format(found_with_dots, found_with_dots * 1.0 / total))
found_either = found_verbatim + found_with_commas + found_with_dots
print("Found either:      {:,} ({:%})".format(found_either, found_either * 1.0 / total))
Total cities:      44,959
Found verbatim:    1,379 (3.067239%)
Found with commas: 22,647 (50.372562%)
Found with dots:   36 (0.080073%)
Found either:      24,062 (53.519874%)

Therefore, half the cities contain their population verbatim in the page. Using rule-based information extraction techniques (regular expressions and the like), for example using the Rule-based Text Annotation system would work. We will try more automated techniques based which might also apply for the other cities.

A question is what type of model to use. This might be a good moment to move away from SVRs, as they could have trouble with large number of features and their strong tendency against overfitting might fail when any of those features contain the target value, so let us see how it behaves with a target leak. (Cell #4).

In [4]:
# CELL 4
import random
import re
import math
from sklearn.svm import SVR
import numpy as np

# read base features
rand = random.Random(42)
train_data = list()
test_data  = list()
header = None
with open("ch8_cell1_dev_textlen.tsv") as feats:
    header = next(feats)
    header = header.strip().split("\t")
    header.pop(0) # name
    for line in feats:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:])) # keep pop
        row = (feats, logpop, name)
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# SVRs need scaling
xtrain_min = xtrain.min(axis=0); xtrain_max = xtrain.max(axis=0)
# some can be zero if the column is constant in training
xtrain_diff = xtrain_max - xtrain_min
for idx in range(len(xtrain_diff)):
    if xtrain_diff[idx] == 0.0:
        xtrain_diff[idx] = 1.0
xtrain_scaling = 1.0 / xtrain_diff
xtrain -= xtrain_min; xtrain *= xtrain_scaling

ytrain_min = ytrain.min(); ytrain_max = ytrain.max()
ytrain_scaling = 1.0 / (ytrain_max - ytrain_min)
ytrain -= ytrain_min; ytrain *= ytrain_scaling

xtest -= xtrain_min; xtest *= xtrain_scaling
ytest_orig = ytest.copy()
ytest -= ytrain_min; ytest *= ytrain_scaling

# train
print("Training on {:,} cities".format(len(xtrain)))

best_c       = 100.0
best_epsilon = 0.05
svr_rbf = SVR(epsilon=best_epsilon, C=best_c, gamma='auto')
svr_rbf.fit(xtrain, ytrain)
ytest_pred  = svr_rbf.predict(xtest)
ytest_pred *= 1.0/ytrain_scaling
ytest_pred += ytrain_min
RMSE = math.sqrt(sum((ytest_orig - ytest_pred)**2) / len(ytest))
print("RMSE with target leak", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest_orig, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell4_svr.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE with target leak 0.10537655348694651
Out[4]:
<matplotlib.legend.Legend at 0x7fcfdeaa7610>

That looks actually very nice, sadly, when adding more features the training times get prohibitively long, so I moved to a Random Forest Regressor (Cell #5).

In [5]:
# CELL 5
import re
import random
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np


# read base features
rand = random.Random(42)
header = None
train_data = list()
test_data  = list()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        row = (feats, logpop, name) 
        if rand.random() < 0.2:
            test_data.append(row) 
        else:
            train_data.append(row)

test_data = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell5_rf.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3547396128278879
Out[5]:
<matplotlib.legend.Legend at 0x7fcfd95817d0>

It produces worse performance than SVR but it trains much faster so it will do. We can now proceed to our first featurization, where we will see the documents as bags of words.

First Featurization: Numbers-only

The bag of words approach represents each document as a fixed size vector with size equals to the whole vocabulary (as computed on training).

By far the most important function in a bag-of-words approach is the tokenization function. Good tokenization is key and it is language and subdomain specific (e.g., journalistic text vs. Twitter).

In our case, tokenizing numbers is key. In other domains it is important to find different variations of words (what is known as "morphology") but this problem presents a simpler case, just with numbers.

For the purpose of our regression problem, a difference between 12,001,112 and 12,001,442 constitutes a nuisance variation and needs to be addressed. We can replace each number with a pseudo-word, indicating, for example, how many digits the number has (think "TOKNUM1DIGIT", "TOKNUM2DIGIT", etc). That will produce about 10 tokens for all the population numbers we have. This might not be enough, instead we might want to distinguish the first digit of the numbers (1TOKNUM3DIGIT represents 1,000 to 1,999; 2TOKNUM3DIGIT represent 2,000 to 2,999 and so one), that will create about 90 tokens, which might be too many.

Instead, we can use the discretization data from Cell #27 in Chapter 6 and transform each number-like token into a TOKNUMSEG1 for 32 segments (Cell #5 below). To avoid having too many features, we are going to expand the feature vector to include only binary features indicating whether these features appear.

Word classes vs. word tokens

When operating with documents and vocabularies, it is important to distinguish the vocabulary size vs. the total document sizes. Both are measured in "words" but the term "word" means different things in each case. Therefore, in NLP we use the terms "word types" to refers to dictionary entries and "word tokens" to refer to document entries. You can think of the word types as a class in object oriented programming and a word token as an instance of the class.

We can now assemble the baseline system, where we are using BoW over the whole documents in the trainset. Because the vocabulary is fixed in the trainset, there will be many words missing in the devset. That is when smoothing techniques (like Good-Turing's smoothing) come handy.

In [6]:
# CELL 6
import re
import pickle
import random
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
                
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell6_tokenize(text):
    tokens = list(filter(lambda x:len(x)>0, 
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' ')))
    result = list()
    for tok in tokens:
        if tok[-1] in set([".", ",", "?", "!"]):
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                pass # too small
            elif num > boundaries[-1][2]:
                pass # too big
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
    return result

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
tok_to_col = dict()
for idx, segs in enumerate(boundaries):
    header.append("TOKNUMSEG{}-{}-{}".format(idx, segs[0], segs[-1]))
    tok_to_col["TOKNUMSEG{}".format(idx)] = idx
    
remaining = set(map(lambda x:x[-1], all_data))
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            remaining.remove(name)
            extra_feats = [0.0] * len(boundaries)
            text = cityline[tab:]
            for numtoken in cell6_tokenize(text):
                extra_feats[tok_to_col[numtoken]] = 1.0
            all_data[city_to_all_data[name]][0].extend(extra_feats)
            
for name in remaining:
    extra_feats = [0.0] * len(boundaries)
    all_data[city_to_all_data[name]][0].extend(extra_feats)

with open("ch8_cell6_dev_feat1.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",   color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell6_rf_feat1.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3437545061225639
Out[6]:
<matplotlib.legend.Legend at 0x7fcfd4c4db50>

At 0.3437, that worked very well for adding only 32 new features, but we are still not at the level of the SVR (a SVR on this dataset takes 2h to train, with only 130 features and produce a RMSE of 0.3216).

Second Featurization: Bag-of-words

Let's try to add some more words to build a BoW representation, to avoid a large feature set, let's take the top 1000 words filtered by M.I. (Cell #7)

In [7]:
# CELL 7
import re
import pickle
import random
import bz2
import math
import numpy as np

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
                
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell7_tokenize(text):
    tokens = list(filter(lambda x:len(x)>0, 
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' ')))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] in set([".", ",", "?", "!", "\"", "'"]):
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(tok.lower())
    return result

# read base features
rand = random.Random(42)
city_pop = dict()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        city_pop[name] = logpop
cities = sorted(list(city_pop.keys()))
        
# vocabulary
all_vocab     = list()
vocab_to_idx  = dict()
city_tok_idxs = dict()

remaining = set(city_pop.keys())
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text = cityline[tab:]
            toks = set()
            for token in cell7_tokenize(text):
                idx = vocab_to_idx.get(token, None)
                if idx is None:
                    idx = len(all_vocab)
                    all_vocab.append(token)
                    vocab_to_idx[token] = idx
                toks.add(idx)
            city_tok_idxs[name] = sorted(list(toks))

for name in remaining:
    city_tok_idxs[name] = list()
    
print("Total vocabulary: {:,}".format(len(all_vocab)))

# drop tokens that appear in less than 200 documents
tok_docs = list()
for _ in range(len(all_vocab)):
    tok_docs.append([])
for doc_idx, name in enumerate(cities):
    tok_idxs = city_tok_idxs[name]
    for tok_idx in tok_idxs:
        tok_docs[tok_idx].append(doc_idx)
city_tok_idxs = None

threshold = 200
reduced_vocab = list()
for tok_idx in range(len(all_vocab)):
    if len(tok_docs[tok_idx]) >= threshold:
        reduced_vocab.append(tok_idx)
        
print("Reduced vocabulary: {:,} (reduction {:%})"
      .format(len(reduced_vocab), (len(all_vocab) - len(reduced_vocab)) / len(all_vocab)))    

ydata = np.array(list(map(lambda c:city_pop[c], cities)))

def cell7_adjudicate(data, segments):
    result = list()
    for val in data:
        idx = None
        if val < segments[0]['min']:
            idx = 0
        elif val > segments[-1]['max']:
            idx = len(segments) - 1
        else:
            for idx2, segment in enumerate(segments):
                if segment['min'] <= val and \
                    (idx2 == len(segments)-1 or val < segments[idx2+1]['min']):
                    idx = idx2
                    break
        result.append(idx)
    return np.array(result)

ydata = cell7_adjudicate(ydata, segments_at[2])

feature_utility = list()

xdata = np.zeros( ydata.shape )
for pos, tok_idx in enumerate(reduced_vocab):
    verbose = False
    if pos % int(len(reduced_vocab) / 100) == 0:
        print("Computing M.I. for {:>6} out of {:>6} tokens, token \"{}\""
              .format(pos, len(reduced_vocab), all_vocab[tok_idx]))
        #verbose = True

    xdata[:] = 0
    for idx in tok_docs[tok_idx]:
        xdata[idx] = 1.0

    # compute confusion table
    table = dict()
    for row in range(xdata.shape[0]):
        feat_val = int(xdata[row])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = dict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1

    feats = set()
    for row in table.values():
        feats.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feats }
    full_table = sum(cols.values())
    
    if verbose:
        print("\tTable:\n\t{}\n\tfull_table: {}\n\tCols: {}"
              .format(table, full_table, cols))
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n11={}, skipping"
                        .format(feat_val, target_val, n11))
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n10={} or n01={} or n00={} is zero, skipping"
                        .format(feat_val, target_val, n10, n01, n00))
                continue
            n1_ = n11 + n10
            n0_ = n01 + n00
            n_1 = n11 + n01
            n_0 = n10 + n00
            n = float(full_table)
            utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
               n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
               n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
               n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if verbose:
        print("\tbest_utility: {}".format(best_utility))
    if best_utility is not None:
        feature_utility.append( (all_vocab[tok_idx], best_utility) )
all_vocab = None # free memory
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

PARAM_KEEP_TOP = 1000
with open("ch8_cell7_vocab.tsv", "w") as kept:
    for row in feature_utility[:PARAM_KEEP_TOP]:
        kept.write("{}\t{}\n".format(*row))
        
table1 = ("<table><tr><th>Position</th><th>Token</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(feature_utility[:100])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Feat</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(reversed(feature_utility[-100:]))))) +"</table>")

with open("ch8_cell7_dev_tokens.tsv", "w") as kept:
    kept.write("name\t" + "\t".join(map(lambda x:"token=" + x[0],feature_utility[:PARAM_KEEP_TOP]))+"\n")
    matrix = np.zeros( (ydata.shape[0], PARAM_KEEP_TOP) )
    for idx_tok, row in enumerate(feature_utility[:PARAM_KEEP_TOP]):
        tok = row[0]
        for idx_doc in tok_docs[vocab_to_idx[tok]]:
            matrix[idx_doc, idx_tok] = 1.0
    for idx_doc in range(matrix.shape[0]):
        kept.write(cities[idx_doc] + "\t" + "\t".join(map(str,matrix[idx_doc,:])) +"\n")
matrix       = None
tok_docs     = None
vocab_to_idx = None

from IPython.display import HTML, display
display(HTML("<h3>Top 100 tokens by MI</h3>" + table1 + 
             "<h3>Last 100 tokens by MI</h3>" + table2))
Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Total vocabulary: 408,793
Reduced vocabulary: 6,254 (reduction 98.470130%)
Computing M.I. for      0 out of   6254 tokens, token ","
Computing M.I. for     62 out of   6254 tokens, token "honey"
Computing M.I. for    124 out of   6254 tokens, token "that"
Computing M.I. for    186 out of   6254 tokens, token "gradually"
Computing M.I. for    248 out of   6254 tokens, token "trading"
Computing M.I. for    310 out of   6254 tokens, token "acts"
Computing M.I. for    372 out of   6254 tokens, token "park"
Computing M.I. for    434 out of   6254 tokens, token "high"
Computing M.I. for    496 out of   6254 tokens, token "council"
Computing M.I. for    558 out of   6254 tokens, token "eastern"
Computing M.I. for    620 out of   6254 tokens, token "resources"
Computing M.I. for    682 out of   6254 tokens, token "buildings"
Computing M.I. for    744 out of   6254 tokens, token "inland"
Computing M.I. for    806 out of   6254 tokens, token "electricity"
Computing M.I. for    868 out of   6254 tokens, token "automotive"
Computing M.I. for    930 out of   6254 tokens, token "virtue"
Computing M.I. for    992 out of   6254 tokens, token "sculpture"
Computing M.I. for   1054 out of   6254 tokens, token "campus"
Computing M.I. for   1116 out of   6254 tokens, token "popular"
Computing M.I. for   1178 out of   6254 tokens, token "enlarged"
Computing M.I. for   1240 out of   6254 tokens, token "ko"
Computing M.I. for   1302 out of   6254 tokens, token "associated"
Computing M.I. for   1364 out of   6254 tokens, token "performances"
Computing M.I. for   1426 out of   6254 tokens, token "mentioned"
Computing M.I. for   1488 out of   6254 tokens, token "list"
Computing M.I. for   1550 out of   6254 tokens, token "spain"
Computing M.I. for   1612 out of   6254 tokens, token "seaside"
Computing M.I. for   1674 out of   6254 tokens, token "build"
Computing M.I. for   1736 out of   6254 tokens, token "TOKNUMSEG14"
Computing M.I. for   1798 out of   6254 tokens, token "ever"
Computing M.I. for   1860 out of   6254 tokens, token "sense"
Computing M.I. for   1922 out of   6254 tokens, token "politics"
Computing M.I. for   1984 out of   6254 tokens, token "implemented"
Computing M.I. for   2046 out of   6254 tokens, token "category"
Computing M.I. for   2108 out of   6254 tokens, token "privately"
Computing M.I. for   2170 out of   6254 tokens, token "watch"
Computing M.I. for   2232 out of   6254 tokens, token "operated"
Computing M.I. for   2294 out of   6254 tokens, token "games"
Computing M.I. for   2356 out of   6254 tokens, token "historically"
Computing M.I. for   2418 out of   6254 tokens, token "opening"
Computing M.I. for   2480 out of   6254 tokens, token "31"
Computing M.I. for   2542 out of   6254 tokens, token "workshops"
Computing M.I. for   2604 out of   6254 tokens, token "provide"
Computing M.I. for   2666 out of   6254 tokens, token "breweries"
Computing M.I. for   2728 out of   6254 tokens, token "fog"
Computing M.I. for   2790 out of   6254 tokens, token "underground"
Computing M.I. for   2852 out of   6254 tokens, token "employers"
Computing M.I. for   2914 out of   6254 tokens, token "basilica"
Computing M.I. for   2976 out of   6254 tokens, token "membership"
Computing M.I. for   3038 out of   6254 tokens, token "me"
Computing M.I. for   3100 out of   6254 tokens, token "twelve"
Computing M.I. for   3162 out of   6254 tokens, token "australian"
Computing M.I. for   3224 out of   6254 tokens, token "problems"
Computing M.I. for   3286 out of   6254 tokens, token "elizabeth"
Computing M.I. for   3348 out of   6254 tokens, token "eighteen"
Computing M.I. for   3410 out of   6254 tokens, token "feast"
Computing M.I. for   3472 out of   6254 tokens, token "tour"
Computing M.I. for   3534 out of   6254 tokens, token "deemed"
Computing M.I. for   3596 out of   6254 tokens, token "53"
Computing M.I. for   3658 out of   6254 tokens, token "candidates"
Computing M.I. for   3720 out of   6254 tokens, token "manager"
Computing M.I. for   3782 out of   6254 tokens, token "reaching"
Computing M.I. for   3844 out of   6254 tokens, token "pools"
Computing M.I. for   3906 out of   6254 tokens, token "germanic"
Computing M.I. for   3968 out of   6254 tokens, token "garrison"
Computing M.I. for   4030 out of   6254 tokens, token "gives"
Computing M.I. for   4092 out of   6254 tokens, token "regained"
Computing M.I. for   4154 out of   6254 tokens, token "maurice"
Computing M.I. for   4216 out of   6254 tokens, token "vital"
Computing M.I. for   4278 out of   6254 tokens, token "journal"
Computing M.I. for   4340 out of   6254 tokens, token "contributed"
Computing M.I. for   4402 out of   6254 tokens, token "vs"
Computing M.I. for   4464 out of   6254 tokens, token "champion"
Computing M.I. for   4526 out of   6254 tokens, token "smith"
Computing M.I. for   4588 out of   6254 tokens, token "virtually"
Computing M.I. for   4650 out of   6254 tokens, token "recycling"
Computing M.I. for   4712 out of   6254 tokens, token "votes"
Computing M.I. for   4774 out of   6254 tokens, token "guided"
Computing M.I. for   4836 out of   6254 tokens, token "argentina"
Computing M.I. for   4898 out of   6254 tokens, token "grave"
Computing M.I. for   4960 out of   6254 tokens, token "varied"
Computing M.I. for   5022 out of   6254 tokens, token "aftermath"
Computing M.I. for   5084 out of   6254 tokens, token "generations"
Computing M.I. for   5146 out of   6254 tokens, token "actor"
Computing M.I. for   5208 out of   6254 tokens, token "exactly"
Computing M.I. for   5270 out of   6254 tokens, token "else"
Computing M.I. for   5332 out of   6254 tokens, token "popularly"
Computing M.I. for   5394 out of   6254 tokens, token "resting"
Computing M.I. for   5456 out of   6254 tokens, token "parkland"
Computing M.I. for   5518 out of   6254 tokens, token "internal"
Computing M.I. for   5580 out of   6254 tokens, token "grandfather"
Computing M.I. for   5642 out of   6254 tokens, token "italians"
Computing M.I. for   5704 out of   6254 tokens, token "khan"
Computing M.I. for   5766 out of   6254 tokens, token "brittany"
Computing M.I. for   5828 out of   6254 tokens, token "thick"
Computing M.I. for   5890 out of   6254 tokens, token "harvest"
Computing M.I. for   5952 out of   6254 tokens, token "artefacts"
Computing M.I. for   6014 out of   6254 tokens, token "shelters"
Computing M.I. for   6076 out of   6254 tokens, token "convert"
Computing M.I. for   6138 out of   6254 tokens, token "ceramic"
Computing M.I. for   6200 out of   6254 tokens, token "ranching"

Top 100 tokens by MI

PositionTokenUtility
0city0.1108237610
1capital0.0679695527
2cities0.0676417294
3largest0.0606824459
4also0.0596949543
5major0.0593810897
6airport0.0581575258
7international0.0546523961
8its0.0512131164
9one0.0502754660
10than0.0499407253
11most0.0497764006
12urban0.0491910525
13government0.0487000936
14are0.0476354436
15during0.0464751543
16into0.0457037805
17headquarters0.0448551514
18such0.0447899121
19important0.0447317335
20national0.0442497759
21many0.0439325441
22known0.0437755166
23well0.0434096614
24climate0.0430365158
25by0.0425911903
26this0.0408151422
27main0.0404480615
28university0.0403195373
29on0.0400469068
30like0.0399242031
31center0.0397057255
32be0.0388905365
33these0.0384321065
34areas0.0380490553
35due0.0379562912
36development0.0378602063
37among0.0374609243
38have0.0373650612
39some0.0371564823
40districts0.0369598676
41famous0.0367834159
42that0.0366869422
43their0.0363332198
44include0.0359582978
45period0.0359366263
46year0.0359325697
47TOKNUMSEG290.0356911021
48large0.0351644409
49several0.0346961961
50but0.0345412779
51estimated0.0341262058
52industrial0.0340581313
53bus0.0339183590
54temperature0.0339066858
55around0.0336944941
56very0.0335933319
57number0.0335447925
58province0.0332509809
59where0.0327465421
60after0.0323553222
61became0.0323550912
62cultural0.0323344373
63companies0.0320323035
64india0.0319812811
65s0.0318898285
66divided0.0318143270
67stations0.0317933826
68not0.0316156080
69commercial0.0316035171
70based0.0315314842
71town0.0313758148
72world0.0313056359
73central0.0310292741
74level0.0309108679
75municipal0.0309018917
76founded0.0307581288
77however0.0306725285
78industries0.0306075132
79station0.0305030795
80they0.0302203802
81established0.0300443521
82to0.0299554552
83industry0.0298852207
84been0.0298833624
85second0.0297793476
86economy0.0297729727
87institutions0.0297075802
88while0.0296185191
89million0.0295494997
90annual0.0293786859
91country0.0293670007
92can0.0293557741
93higher0.0292922646
94region0.0292572415
95various0.0291898767
96being0.0289820850
97season0.0289659138
98april0.0288233996
99only0.0287397055

Last 100 tokens by MI

PositionFeatUtility
0wroc0.0000003671
1fork0.0000010077
2minnesota0.0000027550
3blacksmith0.0000031540
4lordship0.0000061330
5porch0.0000070681
6slate0.0000117212
7tennessee0.0000135922
8nord0.0000150262
9dublin0.0000164071
10marion0.0000164326
11locality0.0000176525
12sk0.0000186769
13norway0.0000191735
14yielded0.0000200087
15abbot0.0000222070
16graveyard0.0000226601
17pays0.0000251598
18bohemian0.0000275182
19gallo0.0000286014
20townsite0.0000292720
21vicar0.0000293665
22teau0.0000299623
23stra0.0000305643
24wi0.0000308261
25carolina0.0000313822
26gro0.0000327638
27dakota0.0000339413
28cherokee0.0000342552
29toledo0.0000344686
30texas0.0000350814
31mascot0.0000360057
32bypassed0.0000368673
33utah0.0000370233
34oregon0.0000370328
35trout0.0000377955
36louisiana0.0000394990
37nave0.0000410843
38gaelic0.0000414329
39indiana0.0000415696
40res0.0000418519
41comarca0.0000433390
42vineyard0.0000450929
43und0.0000482406
44normandy0.0000494604
45seine0.0000520568
46vermont0.0000529229
47les0.0000534858
48villagers0.0000537598
49kentucky0.0000538278
50missouri0.0000539930
51sawmills0.0000541359
52te0.0000557702
53lords0.0000567686
54salmon0.0000581586
55cumberland0.0000586670
56butler0.0000591994
57hay0.0000592873
58wisconsin0.0000664782
59alps0.0000667388
60rhein0.0000682795
61350.0000700897
62fief0.0000707893
63bells0.0000712849
64walkers0.0000713318
65820.0000723083
66quarries0.0000750313
67virginia0.0000759667
68alabama0.0000791244
69landowner0.0000793196
70croats0.0000809431
71benedictine0.0000845939
72780.0000868250
73vineyards0.0000880741
74delaware0.0000881804
75pomerania0.0000882109
76im0.0000882451
77csb0.0000894023
78ne0.0000902184
79postmaster0.0000910172
80bohemia0.0000913958
81milan0.0000916828
82cottages0.0000917960
83gr0.0000922161
84yorkshire0.0000931658
85surveyor0.0000948157
86der0.0000955069
87monroe0.0000985637
88provence0.0000986144
89schoolhouse0.0000987327
90maps0.0000996097
91sawmill0.0000996336
92knight0.0000998679
93suffolk0.0001005785
94zip0.0001013794
95wesleyan0.0001041994
96sweden0.0001056110
97priory0.0001064107
98bakery0.0001066078
99welsh0.0001075209

The most informative tokens look quite helpful, particularly words like "capital". "major" or "international". Interestingly, not all the discretized numbers were chose but most of them (18 out of 32). That shows their value. The ones missing fall into a category that confuses themselves with years. The use of NER [cite] will help here, but I do not want to increase the running time so much to add it.

I have dropped capitalization to reduce the feature set plus some light punctuation removal. Further processing is possible by stemming (conflating "city" and "cities") and dropping stop words, but we'll see if that's an issue through error analysis.

Now we can try the new feature vector in Cell #8.

In [8]:
# CELL 8
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
with open("ch8_cell7_dev_tokens.tsv") as feats:
    extra_header = next(feats)
    extra_header = extra_header.strip().split("\t")
    extra_header.pop(0) # name
    header.extend(extra_header)
    for line in feats:
        fields = line.strip().split("\t")
        name = fields[0]
        all_data[city_to_all_data[name]][0].extend(list(map(float,fields[1:])))
        
with open("ch8_cell8_dev_feat2.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell8_rf_feat2.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3318826561100693
Out[8]:
<matplotlib.legend.Legend at 0x7fcfd418e990>

That is an improvement, let us drill down with Error Analysis to see what worked and what did not

I will now proceed to do an Error Analysis looking at the documents that gained the most with the text and the ones that were more hurt (Cel #9)

In [9]:
# CELL 9
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from collections import OrderedDict

# read base features
rand = random.Random(42)
base_data         = list()
city_to_base_data = dict()
base_header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    base_header = next(f)
    base_header = base_header.strip().split("\t")
    base_header.pop(0) # name
    base_header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_base_data[name] = len(base_data)
        base_data.append( (feats, logpop, name) )
                
# read text features
mi_data         = list()
city_to_mi_data = dict()
mi_header = None
with open("ch8_cell8_dev_feat2.tsv") as mi:
    mi_header = next(mi)
    mi_header = mi_header.strip().split("\t")
    mi_header.pop(0) # name
    mi_header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_mi_data[name] = len(mi_data)
        mi_data.append( (feats, logpop, name) )

# split
base_train_data = list()
base_test_data  = list()
mi_train_data   = list()
mi_test_data    = list()
for row in base_data:
    if rand.random() < 0.2:
        base_test_data.append(row)
        mi_test_data.append(mi_data[city_to_mi_data[row[-1]]])
    else:
        base_train_data.append(row)
        mi_train_data.append(mi_data[city_to_mi_data[row[-1]]])
base_data = None
mi_data   = None

base_test_data = sorted(base_test_data, key=lambda t:t[1])
mi_test_data   = sorted(mi_test_data,   key=lambda t:t[1])
test_names     = list(map(lambda t:t[2], base_test_data))

base_xtrain = np.array(list(map(lambda t:t[0], base_train_data)))
ytrain      = np.array(list(map(lambda t:t[1], base_train_data)))
base_xtest  = np.array(list(map(lambda t:t[0], base_test_data)))
ytest       = np.array(list(map(lambda t:t[1], base_test_data)))
base_train_data = None
base_test_data  = None

mi_xtrain = np.array(list(map(lambda t:t[0], mi_train_data)))
mi_xtest  = np.array(list(map(lambda t:t[0], mi_test_data)))
mi_train_data = None
mi_test_data  = None

# train
print("Base training on {:,} cities".format(len(ytrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(base_xtrain, ytrain)
base_ytest_pred = rf.predict(base_xtest)
base_se = (base_ytest_pred - ytest)**2

print("M.I. training on {:,} cities".format(len(ytrain)))
rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(mi_xtrain, ytrain)
mi_ytest_pred = rf.predict(mi_xtest)
mi_se = (mi_ytest_pred - ytest)**2

# find the bigger winners and losers
se_ytest_diff = base_se - mi_se # small is better, it's error
named_se = list()
for idx in range(se_ytest_diff.shape[0]):
    named_se.append( (se_ytest_diff[idx], test_names[idx], idx) )

named_se = sorted(named_se, key=lambda x:x[0], reverse=True)

to_print = OrderedDict()
for idx, winner in enumerate(named_se[:10]):
    to_print[winner[1]] = { 
        'improv' : winner[0], 
        'base'   : int(round(10**base_ytest_pred[winner[2]])),
        'mi'     : int(round(10**mi_ytest_pred[winner[2]])),
        'pop'    : int(round(10**ytest[winner[2]])),
        'type'   : 'winner',
        'pos'    : idx }
for idx, loser in enumerate(named_se[-10:]):
    to_print[loser[1]] = { 
        'improv' : loser[0], 
        'base'   : int(round(10**base_ytest_pred[loser[2]])),
        'mi'     : int(round(10**mi_ytest_pred[loser[2]])),
        'pop'    : int(round(10**ytest[loser[2]])),
        'type'   : 'loser',
        'pos'    : (9-idx)}
    
kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell7_vocab.tsv").readlines()))

base_xtrain = None
base_xtest  = None
mi_xtrain   = None
mi_xtest    = None
                 
htmls = [""] * 20
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            tokens = list(filter(lambda tok: tok in kept_terms, cell7_tokenize(text)))
            text = text.replace('\t','<p>')
            entry = to_print[name]
            this_html = ("<h1>Top {} {}: {}</h1>"+
                     "<h2>Change: {:1.5} (base: {:,}, MI: {:,}). Population: {:,}. Text length: {:,}</h2>{}"+
                     "<h2>Tokens (length {:,})</h2>{}") \
                      .format((entry['pos'] + 1), entry['type'], name[1:-1], entry['improv'], entry['base'], entry['mi'],
                          entry['pop'], len(text), text[:1000], len(tokens), tokens[:100])
            if entry['type'] == 'winner':
                htmls[entry['pos']] = this_html
            else:
                htmls[10+entry['pos']] = this_html
html = "".join(htmls)
from IPython.display import HTML, display
display(HTML(html))
Base training on 35,971 cities
M.I. training on 35,971 cities

Top 1 winner: http://dbpedia.org/resource/Ganzhou

Change: 2.6674 (base: 179,493, MI: 3,808,862). Population: 8,368,447. Text length: 5,851

Ganzhou

Ganzhou Ganzhou (), formerly romanized as Kanchow, is a prefecture-level city in southern Jiangxi, China, bordering Fujian to the east, Guangdong to the south, and Hunan to the west. Its administrative seat is at Zhanggong District. Its population was 8,361,447 at the 2010 census whom 1,977,253 in the built-up (or "metro") area made of Zhanggong and Nankang, and Ganxian largely being urbanized. In 201, Emperor Gaozu of Han established a county in the territory of modern Ganzhou. In those early years, Han Chinese settlement and authority in the area was minimal and largely restricted to the Gan River basin. The river, a tributary of the Yangtze via Poyang Lake, provided a route of communication from the north as well as irrigation for rice farming. During the Sui dynasty, the county administration was promoted to prefecture status and the area called Qianzhou (). During the Song, immigration from the north bolstered the local population and drove local aboriginal tribes f

Tokens (length 545)

[',', 'formerly', 'as', 'prefecture', 'level', 'city', 'southern', 'china', 'to', 'the', 'east', 'to', 'the', 'south', 'and', 'to', 'the', 'west', 'its', 'administrative', 'at', 'its', 'population', 'was', 'at', 'the', 'TOKNUMSEG6', 'the', 'built', 'metro', 'area', 'of', 'and', 'and', 'largely', 'being', 'TOKNUMSMALL', 'emperor', 'of', 'established', 'county', 'the', 'territory', 'of', 'modern', 'early', 'chinese', 'and', 'authority', 'the', 'area', 'was', 'and', 'largely', 'to', 'the', 'river', 'the', 'river', 'of', 'the', 'via', 'provided', 'of', 'from', 'the', 'north', 'as', 'well', 'as', 'for', 'rice', 'during', 'the', 'dynasty', 'the', 'county', 'administration', 'was', 'to', 'prefecture', 'status', 'and', 'the', 'area', 'called', 'during', 'the', 'from', 'the', 'north', 'the', 'local', 'population', 'and', 'local', 'tribes', 'further', 'into', 'the']

Top 2 winner: http://dbpedia.org/resource/Sharjah

Change: 2.5115 (base: 12,693, MI: 72,028). Population: 1,400,000. Text length: 12,841

Sharjah

Sharjah Sharjah (; "") is the third largest and third most populous city in the United Arab Emirates, forming part of the Dubai-Sharjah-Ajman metropolitan area. It is located along the southern coast of the Persian Gulf on the Arabian Peninsula. Sharjah is the capital of the emirate of Sharjah. Sharjah shares legal, political, military and economic functions with the other emirates of the UAE within a federal framework, although each emirate has jurisdiction over some functions such as civil law enforcement and provision and upkeep of local facilities. Sharjah has been ruled by the Al Qasimi dynasty since the 18th century. The city is a centre for culture and industry, and alone contributes 7.4% of the GDP of the United Arab Emirates. The city covers an approximate area of 235 km² and has a population of over 800,000 (2008). The sale or consumption of alcoholic beverages is prohibited in the emirate of Sharjah without possession of an alcohol licence and alcohol is not s

Tokens (length 1,229)

['the', 'third', 'largest', 'and', 'third', 'most', 'populous', 'city', 'the', 'part', 'of', 'the', 'metropolitan', 'area', 'it', 'located', 'along', 'the', 'southern', 'coast', 'of', 'the', 'on', 'the', 'the', 'capital', 'of', 'the', 'of', 'political', 'military', 'and', 'economic', 'with', 'the', 'other', 'of', 'the', 'within', 'although', 'each', 'has', 'jurisdiction', 'some', 'such', 'as', 'law', 'and', 'and', 'of', 'local', 'facilities', 'has', 'been', 'ruled', 'by', 'the', 'al', 'dynasty', 'since', 'the', 'century', 'the', 'city', 'centre', 'for', 'culture', 'and', 'industry', 'and', 'alone', 'of', 'the', 'gdp', 'of', 'the', 'the', 'city', 'an', 'area', 'of', 'TOKNUMSMALL', 'and', 'has', 'population', 'of', 'TOKNUMSEG6', 'the', 'of', 'the', 'of', 'of', 'an', 'and', 'not', 'served', 'restaurants', 'other', 'due', 'to']

Top 3 winner: http://dbpedia.org/resource/Az_Zubayr

Change: 2.1715 (base: 6,853, MI: 45,442). Population: 370,000. Text length: 1,634

Az Zubayr

Az Zubayr Az Zubayr () is a town in Basra Governorate in Iraq, just south of Basra. The name can also refer to the old Emirate of Zubair. The name is also sometimes written Az Zubair, Zubair, Zoubair, El Zubair, or Zobier. The city was named al-Zubair because one of the Sahaba (companions) of the Prophet Muhammad, Zubayr ibn al-Awwam, was buried there. During the Ottoman times, the city was a self-ruling emirate ruled by an Emir (or Sheikh) from Najdi families, such as Al Zuhair, Al Meshary, and Al Ibrahim families. Like other Emirates under the Ottoman Empire, the Emirate of Zubair used to pay dues and receive protection from the Ottomans. In the 19th century, the city of Zubair witnessed relatively large migrations from Najd. Up until the 1970s and 1980s, the town was predominantly populated by people of Najdi origin. Now only a few families remain of the old inhabitants. Most of them moved back to their homeland of Najd and other regions of Saudi Arabia and to Kuwai

Tokens (length 160)

['town', 'just', 'south', 'of', 'the', 'name', 'can', 'also', 'to', 'the', 'old', 'of', 'the', 'name', 'also', 'sometimes', 'el', 'the', 'city', 'was', 'named', 'al', 'because', 'one', 'of', 'the', 'of', 'the', 'al', 'was', 'there', 'during', 'the', 'times', 'the', 'city', 'was', 'ruled', 'by', 'an', 'from', 'such', 'as', 'al', 'al', 'and', 'al', 'like', 'other', 'under', 'the', 'empire', 'the', 'of', 'used', 'to', 'and', 'from', 'the', 'the', '19th', 'century', 'the', 'city', 'of', 'relatively', 'large', 'from', 'until', 'the', '1970s', 'and', '1980s', 'the', 'town', 'was', 'populated', 'by', 'of', 'now', 'only', 'few', 'of', 'the', 'old', 'most', 'of', 'moved', 'back', 'to', 'their', 'of', 'and', 'other', 'regions', 'of', 'and', 'to', 'the', 'period']

Top 4 winner: http://dbpedia.org/resource/Qods,_Iran

Change: 2.0675 (base: 6,405, MI: 59,047). Population: 229,354. Text length: 618

Qods, Iran

Qods, Iran Qods (, also known as Shahr-e Qods, meaning "City of Qods"; formerly, Karaj, Qal‘eh Hasan, and Qal‘eh-ye Ḩasan Khān) is a city in and the capital of Qods County, Tehran Province, Iran. At the 2006 census, its population was 229,354, in 60,331 families. Before Qods officially became a municipality in 1989 it was named Qal‘eh Hasan. The Persian Gulf Pro League team Paykan plays in the city at Shahre Qods Stadium. The city has three universities: Islamic Azad University, Shahr-eQods Branch, University of Applied Science and Technology, Shahr-e-Qods Branch and Payam-e-Nour university.

Tokens (length 58)

[',', 'also', 'known', 'as', 'e', 'meaning', 'city', 'of', 'formerly', 'and', 'n', 'city', 'and', 'the', 'capital', 'of', 'county', 'province', 'at', 'the', 'TOKNUMSEG6', 'its', 'population', 'was', 'TOKNUMSEG30', 'TOKNUMSEG28', 'before', 'officially', 'became', 'TOKNUMSEG6', 'it', 'was', 'named', 'the', 'league', 'team', 'plays', 'the', 'city', 'at', 'stadium', 'the', 'city', 'has', 'three', 'universities', 'university', 'branch', 'university', 'of', 'science', 'and', 'technology', 'e', 'branch', 'and', 'e', 'university']

Top 5 winner: http://dbpedia.org/resource/Farah,_Afghanistan

Change: 2.0083 (base: 10,479, MI: 59,115). Population: 540,000. Text length: 5,858

Farah, Afghanistan

Farah, Afghanistan Farah (Pashto/Dari Persian: فراه) is the capital of Farah Province, located in western Afghanistan. It has a population of about 540000, and is mainly ethnic Pashtun people. It is about the 16th-largest city of the country in terms of population. The Farah Airport is located in the area. Farah is located in western Afghanistan, close to Herat and Iran, although it lacks a direct road connection with the latter. Farah has a very clear grid of roads distributed through the higher density residential areas. However barren land (35%) and vacant plots (25%) are the largest land uses and combine for 60% of total land use. The Citadel at Farah is probably one of a series of fortresses constructed by Alexander the Great, the city being an intermediate stop between Herat, the location of another of Alexander's fortresses, and Kandahar. The ‘Alexandria’ prefix was added to the city’s name when Alexander came in 330 BC. Under the Parthian Empire, Farah

Tokens (length 589)

['the', 'capital', 'of', 'province', 'located', 'western', 'it', 'has', 'population', 'of', 'and', 'mainly', 'ethnic', 'it', 'the', 'largest', 'city', 'of', 'the', 'country', 'terms', 'of', 'population', 'the', 'airport', 'located', 'the', 'area', 'located', 'western', 'close', 'to', 'and', 'although', 'it', 'direct', 'road', 'with', 'the', 'has', 'very', 'of', 'roads', 'through', 'the', 'higher', 'residential', 'areas', 'however', 'and', 'are', 'the', 'largest', 'and', 'for', 'of', 'use', 'the', 'at', 'one', 'of', 'series', 'of', 'constructed', 'by', 'the', 'great', 'the', 'city', 'being', 'an', 'between', 'the', 'location', 'of', 'another', 'of', 's', 'and', 'the', 'was', 'to', 'the', 'city', 's', 'name', 'when', 'came', 'TOKNUMSMALL', 'under', 'the', 'empire', 'under', 'the', 'of', 'and', 'was', 'one', 'of', 'its']

Top 6 winner: http://dbpedia.org/resource/Masina,_Kinshasa

Change: 1.928 (base: 8,701, MI: 42,346). Population: 485,167. Text length: 1,086

Masina, Kinshasa

Masina, Kinshasa Masina is a municipality ("commune") in the Tshangu district of Kinshasa, the capital city of the Democratic Republic of the Congo. It is bordered by the Pool Malebo in the north and "Boulevard Lumbumba" to the south. Masina shelters within it the "Marché de la Liberté "M’Zee Laurent-Désiré Kabila"”, one of the largest markets of Kinshasa, which was built under the presidency of Laurent-Désiré Kabila to repay the inhabitants of the district of Tshangu who had resisted the rebels in August 1998. The area is known by the nickname "Chine Populaire" ("People's China"). Masina, together with the communes of Ndjili and Kimbanseke, belong to the district of Tshangu 20 km east of central Kinshasa. Most of the municipality is occupied by a wetland bordering the Pool Malebo, which explains the low population density relative of the municipality. The urban area, stretching along Boulevard Lumumba, however, reaches population densities comparable to those

Tokens (length 89)

['the', 'of', 'the', 'capital', 'city', 'of', 'the', 'of', 'the', 'it', 'by', 'the', 'the', 'north', 'and', 'boulevard', 'to', 'the', 'south', 'within', 'it', 'the', 'march', 'de', 'la', 'm', ',', 'one', 'of', 'the', 'largest', 'of', 'which', 'was', 'built', 'under', 'the', 'of', 'to', 'the', 'of', 'the', 'of', 'the', 'august', 'TOKNUMSEG6', 'the', 'area', 'known', 'by', 'the', 's', 'china', 'with', 'the', 'of', 'and', 'to', 'the', 'of', 'east', 'of', 'central', 'most', 'of', 'the', 'occupied', 'by', 'the', 'which', 'the', 'low', 'population', 'of', 'the', 'the', 'urban', 'area', 'along', 'boulevard', 'however', 'population', 'to', 'of', 'other', 'the', 'heart', 'of', 'TOKNUMSEG28']

Top 7 winner: http://dbpedia.org/resource/Apodaca

Change: 1.6533 (base: 9,805, MI: 36,756). Population: 523,370. Text length: 1,525

Apodaca

Apodaca Apodaca () is a city and its surrounding municipality that is part of Monterrey Metropolitan area. It lies in the northeastern part of the metropolitan area. As of the 2005 census, the city had a population of 393,195 and the municipality had a population of 418,784. The municipality has an area of 183.5 km. The fourth-largest city in the state (behind Monterrey, Guadalupe, and San Nicolás de los Garza), Apodaca is one of the fastest-growing cities in Nuevo León and an important industrial center. Two airports, General Mariano Escobedo International Airport (IATA: MTY) and Del Norte International Airport (IATA: NTR), are located in Apodaca. VivaAerobus Airline and Grupo Aeroportuario Centro Norte have their corporate headquarters on the grounds of Escobedo Airport. The municipality of Apodaca is one of the major industrial centers of the state of Nuevo León. Apodaca's economy is founded basically in manufacturing operations and services. American companies such a

Tokens (length 141)

['city', 'and', 'its', 'surrounding', 'that', 'part', 'of', 'metropolitan', 'area', 'it', 'the', 'part', 'of', 'the', 'metropolitan', 'area', 'as', 'of', 'the', 'TOKNUMSEG6', 'the', 'city', 'population', 'of', 'and', 'the', 'population', 'of', 'the', 'has', 'an', 'area', 'of', 'TOKNUMSMALL', 'the', 'fourth', 'largest', 'city', 'the', 'state', 'and', 'san', 's', 'de', 'los', ',', 'one', 'of', 'the', 'fastest', 'growing', 'cities', 'n', 'and', 'an', 'important', 'industrial', 'center', 'airports', 'general', 'international', 'airport', 'and', 'del', 'international', 'airport', ',', 'are', 'located', 'and', 'have', 'their', 'headquarters', 'on', 'the', 'of', 'airport', 'the', 'of', 'one', 'of', 'the', 'major', 'industrial', 'centers', 'of', 'the', 'state', 'of', 'n', 's', 'economy', 'founded', 'manufacturing', 'operations', 'and', 'services', 'companies', 'such', 'as']

Top 8 winner: http://dbpedia.org/resource/Mejicanos

Change: 1.639 (base: 9,174, MI: 64,981). Population: 224,661. Text length: 2,299

Mejicanos

Mejicanos Mejicanos is a San Salvador suburb in the San Salvador department of El Salvador. Mejicanos is a city located in San Salvador, El Salvador. At the 2009 estimate it had 160,751 inhabitants. It has been characterized by its typical food "Yuca Frita con Merienda". It has a municipal market, where the local citizens can buy groceries, vegetables, dairy products, meat, pupusas, etc. Many of the things available in the local market are produced in the surrounding villages, like vegetables. It is located in a strategic point because is in the main route to other towns or municipalities, like Cuscatancingo, Mariona, San Ramon, San Salvador, etc. However it has-single lane roads, which accounts for the frequent traffic jams in the center of the city. It has always been characterized by being a disorganized city, like other cities in El Salvador. Even though it has a municipal market people use the streets to sell their products. Because the average altitude of the cit

Tokens (length 229)

['san', 'the', 'san', 'of', 'el', 'city', 'located', 'san', 'el', 'at', 'the', 'TOKNUMSEG6', 'estimate', 'it', 'TOKNUMSEG30', 'it', 'has', 'been', 'by', 'its', 'food', 'it', 'has', 'municipal', 'market', 'where', 'the', 'local', 'citizens', 'can', 'products', 'etc', 'many', 'of', 'the', 'available', 'the', 'local', 'market', 'are', 'produced', 'the', 'surrounding', 'like', 'it', 'located', 'strategic', 'point', 'because', 'the', 'main', 'to', 'other', 'towns', 'like', 'san', 'san', 'etc', 'however', 'it', 'has', 'roads', 'which', 'for', 'the', 'frequent', 'traffic', 'the', 'center', 'of', 'the', 'city', 'it', 'has', 'been', 'by', 'being', 'city', 'like', 'other', 'cities', 'el', 'even', 'though', 'it', 'has', 'municipal', 'market', 'use', 'the', 'streets', 'to', 'their', 'products', 'because', 'the', 'of', 'the', 'city', 'around']

Top 9 winner: http://dbpedia.org/resource/Baidoa

Change: 1.5058 (base: 21,615, MI: 96,584). Population: 657,500. Text length: 7,220

Baidoa

Baidoa Baidoa (), is capital in the southwestern Bay region of Somalia.During the Middle Ages, Baidoa and its surrounding area was part of the Ajuran Sultanate. in 2005 Transitional Federal Government established temporary headquarters in Baidoa before an eventual relocation of government offices to Mogadishu. In 2012, it was made the capital of the Southwestern State of Somalia, a prospective Federal Member State. Baidoa and the broader Bay region is home to a number of important ancient sites. Archaeologists have found pre-historic rock art on the city's outskirts, in Buur Heybe. During the Middle Ages, Baidoa and its surrounding area was part of the Ajuran Sultanate. The influential polity covered much of southern Somalia and eastern Ethiopia, with its domain extending from Mareeg in the north, to Qelafo in the west, to Kismayo in the south. In the early modern period, the Baidoa area was ruled by the Geledi Sultanate. The kingdom was eventually incorporated into Ital

Tokens (length 679)

[',', 'capital', 'the', 'region', 'of', 'during', 'the', 'middle', 'and', 'its', 'surrounding', 'area', 'was', 'part', 'of', 'the', 'TOKNUMSEG6', 'government', 'established', 'headquarters', 'before', 'an', 'of', 'government', 'offices', 'to', 'TOKNUMSEG6', 'it', 'was', 'the', 'capital', 'of', 'the', 'state', 'of', 'member', 'state', 'and', 'the', 'region', 'home', 'to', 'number', 'of', 'important', 'ancient', 'have', 'found', 'art', 'on', 'the', 'city', 's', 'during', 'the', 'middle', 'and', 'its', 'surrounding', 'area', 'was', 'part', 'of', 'the', 'the', 'much', 'of', 'southern', 'and', 'eastern', 'with', 'its', 'from', 'the', 'north', 'to', 'the', 'west', 'to', 'the', 'south', 'the', 'early', 'modern', 'period', 'the', 'area', 'was', 'ruled', 'by', 'the', 'the', 'kingdom', 'was', 'into', 'TOKNUMSEG6', 'and', 'TOKNUMSEG6', 'with', 'the']

Top 10 winner: http://dbpedia.org/resource/Borough_Park,_Brooklyn

Change: 1.4972 (base: 4,950, MI: 21,465). Population: 154,210. Text length: 13,446

Borough Park, Brooklyn

Borough Park, Brooklyn Borough Park (also spelled Boro Park) is a neighborhood in the southwestern part of the borough of Brooklyn, in New York City in the United States. The neighborhood covers an extensive grid of streets between Bensonhurst to the south, Bay Ridge to the southwest, Sunset Park to the west, Kensington to the northeast, Flatbush to the east, and Midwood to the southeast. Borough Park is home to one of the largest Orthodox Jewish communities outside of Israel, with one of the largest concentrations of Jews in the United States, and Orthodox traditions rivaling many insular communities. As the average number of children in Hasidic and Hareidi families is 6.72, Borough Park is experiencing a sharp growth in population. It is an economically diverse neighborhood. Originally, the area was called Blythebourne, a small hamlet composed of cottages built and developed in 1887 by Electus Litchfield, and then expanded with more housing by developer

Tokens (length 1,204)

['park', 'park', 'park', 'also', 'park', 'neighborhood', 'the', 'part', 'of', 'the', 'of', 'new', 'city', 'the', 'the', 'neighborhood', 'an', 'extensive', 'of', 'streets', 'between', 'to', 'the', 'south', 'to', 'the', 'park', 'to', 'the', 'west', 'to', 'the', 'to', 'the', 'east', 'and', 'to', 'the', 'park', 'home', 'to', 'one', 'of', 'the', 'largest', 'communities', 'outside', 'of', 'with', 'one', 'of', 'the', 'largest', 'of', 'the', 'and', 'many', 'communities', 'as', 'the', 'number', 'of', 'and', 'park', 'growth', 'population', 'it', 'an', 'diverse', 'neighborhood', 'the', 'area', 'was', 'called', 'small', 'of', 'built', 'and', 'developed', 'TOKNUMSEG6', 'by', 'and', 'then', 'expanded', 'with', 'by', 'it', 'was', 'served', 'by', 'the', 'and', 'island', 'that', 'today', 's', 'west', 'end', 'the', 'from']

Top 1 loser: http://dbpedia.org/resource/Xuanwu_District,_Nanjing

Change: -2.8686 (base: 19,344, MI: 3,383). Population: 634,000. Text length: 1,869

Xuanwu District, Nanjing

Xuanwu District, Nanjing Xuanwu District () is one of 11 districts of Nanjing, the capital of Jiangsu province, China. Xuanwu District is an urban centre located in the north-eastern part of Nanjing. It is the site of the Nanjing Municipal Government. The main industries in the district are the leisure and tourism, information technology, retail and services. Its economy is primarily based upon the delivery of services. Industry zones include the Changjiang Road Cultural Area, Xinjiekou Central Economic Area, and Xuzhuang Software Industry Base. The district has attracted multi-national corporations, such as 3M, American Express, Siemens, Hyundai, Samsung, NYK Line, and Cathay Life Insurance. There are more than 40 colleges, universities and research institutes in the district, including Southeast University, Nanjing University of Science and Technology, Nanjing Agricultural University, Nanjing Forestry University and Jiangsu Academy of Agricultural Scie

Tokens (length 165)

['one', 'of', 'districts', 'of', 'the', 'capital', 'of', 'province', 'china', 'an', 'urban', 'centre', 'located', 'the', 'north', 'eastern', 'part', 'of', 'it', 'the', 'site', 'of', 'the', 'municipal', 'government', 'the', 'main', 'industries', 'the', 'are', 'the', 'and', 'tourism', 'technology', 'retail', 'and', 'services', 'its', 'economy', 'primarily', 'based', 'the', 'of', 'services', 'industry', 'include', 'the', 'road', 'cultural', 'area', 'central', 'economic', 'area', 'and', 'industry', 'base', 'the', 'has', 'attracted', 'national', 'such', 'as', 'express', 'and', 'life', 'there', 'are', 'than', 'colleges', 'universities', 'and', 'research', 'institutes', 'the', 'university', 'university', 'of', 'science', 'and', 'technology', 'agricultural', 'university', 'university', 'and', 'academy', 'of', 'agricultural', 'sciences', 'there', 'are', 'from', 'the', 'chinese', 'academy', 'of', 'engineering', 'and', 'chinese', 'academy', 'of']

Top 2 loser: http://dbpedia.org/resource/Demsa

Change: -1.9819 (base: 14,391, MI: 2,956). Population: 180,251. Text length: 276

Demsa

Demsa Demsa is a Local Government Area of Adamawa State, Nigeria with headquarters located in Demsa. Demsa lies on the Benue River. Population is 180,251. It is inhabited by ethnic groups such as the Bachama, Batta, Yandang, Bille, Mbula, Maya, Bare and fulani.

Tokens (length 21)

['local', 'government', 'area', 'of', 'state', 'with', 'headquarters', 'located', 'on', 'the', 'river', 'population', 'TOKNUMSEG30', 'it', 'by', 'ethnic', 'groups', 'such', 'as', 'the', 'and']

Top 3 loser: http://dbpedia.org/resource/Dunmore_East

Change: -1.9234 (base: 7,616, MI: 55,128). Population: 1,559. Text length: 5,293

Dunmore East

Dunmore East Dunmore East () is a popular tourist and fishing village in County Waterford, Ireland. Situated on the west side of Waterford Harbour on Ireland's southeastern coast, it lies within the barony of Gaultier ("Gáll Tír" – "foreigners' land" in Irish): a reference to the influx of Viking and Norman settlers in the area. Iron Age people established a promontory fort overlooking the sea at Shanoon (referred to in 1832 as meaning the 'Old Camp' but more likely Canon Power's Sean Uaimh, 'Old Cave') at a point known for centuries as Black Nobb, where the old pilot station now stands, and underneath which a cave runs. Henceforth the place was referred to as Dun Mor, the Great Fort. Fish was an important part of the people's diet, and for hundreds of years a fishing community lived here. In 1640, Lord Power of Curraghmore, who owned a large amount of property in the area, built a castle on the cliff overlooking the strand about two hundred metres from St. Andrew's

Tokens (length 495)

['east', 'east', 'east', 'popular', 'tourist', 'and', 'county', 'situated', 'on', 'the', 'west', 'side', 'of', 'on', 's', 'coast', 'it', 'within', 'the', 'of', 't', 'to', 'the', 'of', 'and', 'the', 'area', 'established', 'fort', 'the', 'sea', 'at', 'referred', 'to', 'as', 'meaning', 'the', 'old', 'but', 'power', 's', 'old', 'at', 'point', 'known', 'for', 'centuries', 'as', 'where', 'the', 'old', 'station', 'now', 'and', 'which', 'runs', 'the', 'place', 'was', 'referred', 'to', 'as', 'the', 'great', 'fort', 'was', 'an', 'important', 'part', 'of', 'the', 's', 'and', 'for', 'of', 'here', 'power', 'of', 'owned', 'large', 'of', 'the', 'area', 'built', 'on', 'the', 'the', 'metres', 'from', 's', 'the', 'was', 'into', 'by', 'the', 'middle', 'of', 'the', 'next', 'century']

Top 4 loser: http://dbpedia.org/resource/Banha

Change: -1.6199 (base: 53,574, MI: 7,173). Population: 165,906. Text length: 2,993

Banha

Banha Banha (also spelled "Benha ";  , ) is the capital of the Qalyubia Governorate in north-eastern Egypt. Located between the capital of Cairo and Alexandria, Banha is an important transport hub in the Nile Delta, as rail lines from Cairo to various cities in the Nile Delta pass through Banha. Egyptians call it "Banhā el-'asal", which means "Sweet like honey"; the nomenclature originally comes from when Muhammad sent his message to Muqawqis, ruler of Egypt, to convert to Islam; he replied by sending him gifts; two were slave girls Maria and her sister Sirīn, who were from Upper Egypt, and jar of honey. After Muhammad tasted it, he asked, "Where is it from?" They replied, "From Benha". Muhammad then said, "God bless Benha and its honey" It is located 48 km (30 mins) north of Cairo. located on the east bank of the Damietta Branch of the Nile River in the rich farmland of the southern part of the river's delta. Well-irrigated by canals leading off the Delta Barrage, a dam 3

Tokens (length 281)

['also', ',', 'the', 'capital', 'of', 'the', 'north', 'eastern', 'located', 'between', 'the', 'capital', 'of', 'and', 'an', 'important', 'transport', 'hub', 'the', 'as', 'rail', 'lines', 'from', 'to', 'various', 'cities', 'the', 'through', 'it', 'el', ',', 'which', 'means', 'like', 'the', 'from', 'when', 'his', 'to', 'ruler', 'of', 'to', 'to', 'islam', 'he', 'by', 'and', 'sister', 'n', 'from', 'and', 'of', 'after', 'it', 'he', 'where', 'it', 'from', 'they', 'from', 'then', 'said', 'and', 'its', 'it', 'located', 'north', 'of', 'located', 'on', 'the', 'east', 'bank', 'of', 'the', 'branch', 'of', 'the', 'river', 'the', 'rich', 'of', 'the', 'southern', 'part', 'of', 'the', 'river', 's', 'well', 'by', 'leading', 'the', 'the', 'surrounding', 'and', 'long', 'since', 'ancient', 'times']

Top 5 loser: http://dbpedia.org/resource/Madina,_Ghana

Change: -1.5748 (base: 8,971, MI: 2,580). Population: 137,162. Text length: 476

Madina, Ghana

Madina, Ghana Madina is a suburb of Accra and in the La-Nkwantanang-Madina Municipal Assembly, a district in the Greater Accra Region of southeastern Ghana. Madina is next to the University of Ghana and houses the Institute of Local Government. Madina is the twelfth most populous settlement in Ghana, in terms of population, with a population of 137,162 people. Madina is contained in the Abokobi-Madina electoral constituency of the republic of Ghana.

Tokens (length 36)

['of', 'and', 'the', 'la', 'municipal', 'assembly', 'the', 'greater', 'region', 'of', 'next', 'to', 'the', 'university', 'of', 'and', 'houses', 'the', 'institute', 'of', 'local', 'government', 'the', 'most', 'populous', 'terms', 'of', 'population', 'with', 'population', 'of', 'TOKNUMSEG30', 'the', 'of', 'the', 'of']

Top 6 loser: http://dbpedia.org/resource/Delgado,_San_Salvador

Change: -1.4674 (base: 7,772, MI: 2,675). Population: 174,825. Text length: 145

Delgado, San Salvador

Delgado, San Salvador Delgado, or Ciudad Delgado, is a municipality in the San Salvador department of El Salvador.

Tokens (length 6)

['san', 'san', 'the', 'san', 'of', 'el']

Top 7 loser: http://dbpedia.org/resource/Curug,_Tangerang

Change: -1.4375 (base: 7,352, MI: 2,580). Population: 165,812. Text length: 178

Curug, Tangerang

Curug, Tangerang Curug is a district within Tangerang Regency in the province of Banten, Java, Indonesia. The population at the 2010 Census was 165,812.

Tokens (length 11)

['within', 'the', 'province', 'of', 'the', 'population', 'at', 'the', 'TOKNUMSEG6', 'was', 'TOKNUMSEG30']

Top 8 loser: http://dbpedia.org/resource/Koro,_Mali

Change: -1.4224 (base: 12,727, MI: 2,619). Population: 62,681. Text length: 79

Koro, Mali

Koro, Mali Agriculture is the main income activity in Koro.

Tokens (length 4)

['agriculture', 'the', 'main', 'activity']

Top 9 loser: http://dbpedia.org/resource/Villa_de_Álvarez,_Colima

Change: -1.2677 (base: 8,086, MI: 2,831). Population: 117,600. Text length: 58

Villa de Álvarez, Colima

Villa de Álvarez, Colima

Tokens (length 2)

['de', 'de']

Top 10 loser: http://dbpedia.org/resource/Bailadores

Change: -1.2614 (base: 160,900, MI: 23,298). Population: 345,489. Text length: 316

Bailadores

Bailadores Bailadores is a town in the western part of the Mérida State of Venezuela and is the capital of the Rivas Dávila Municipality. It was founded by Captain Luis Martín Martín, September 14, 1601, by appointment of founder Peter Sandes Court, from the Real Audiencia de Santa Fe de Bogotá.

Tokens (length 30)

['town', 'the', 'western', 'part', 'of', 'the', 'm', 'state', 'of', 'and', 'the', 'capital', 'of', 'the', 'it', 'was', 'founded', 'by', 'n', 'n', 'september', 'by', 'of', 'court', 'from', 'the', 'real', 'de', 'santa', 'de']

Analysis:

CityComments
Bailadores'town' does it
Villa AlvarezNo text
Koro'agriculture' most probably is linked to smaller places
Curugwe got a toknumseg6 for 2010 Census and then the correct toknumseg30, the seg6 does it
Delgadono usable terms
Madinapopulation of toknumseg30 should be gotten by a stop-words + bigram
BanhaI think if 'cities' were 'city' it will work
DunmoreSmall hamlet with lots of info in Wikipedia, the main signal "village" is not there
Demsathe 'population toknumseg30' should do its magic
Xuanwuno idea. capital ought to have worked

census appears in 21,414 cities but it is not picked as a top 1,000, but tons of stop words are, better clean up and also remove variants to see if we can accomodate more

village appears in 13,998 cities but it had a M.I. of 0.0025583964 (compare top M.I. for city of 0.1108237610) and it was at the bottom 100 at position 2881. It should have been selected, but it might be that at 4-way splitting it is not definite enough

conclusion: conflate and filter the terms and/or expand until census and village are added. Look into bigrams, then skip-grams. Let us start with filtering stop-words and doing stemming to include census and village (Cell #8).

Third Featurization: Morphological features

In some domains, it is useful to reduce the number of features by dropping the morphological variants for different words. For example, if we believe the word prior is useful in our domain, its plural variant priors might be equally useful but more rare. If we conflate both terms as the same feature, we could obtain better performance.

Larger text samples are needed to profit from this approach, though.

To obtain morphological roots for words, we can use a dictionary of root forms (a "lemmatizer" approach) or we can use simple approximation (a "stemmer" approach).

We will use a stemming approach using an implementation of the Porter stemmer (Cell #10).

Taboo features (stop words)

A common feature selection technique in natural language processing is to drop a small set of highly frequent function words with little semantic content for classification tasks. This is called "stop word removal", an approach shared with information retrieval. We will use the snowball list of stopwords.

In [10]:
# CELL 10
import re
import pickle
import random
import bz2
import math
import numpy as np
from stemming.porter2 import stem as porter_stem

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
            
stopwords = set()
with open("stop.txt") as s:
    for line in s:
        if '|' in line:
            line = line[:line.index('|')]
        line = line.strip()
        if len(line) > 0:
            stopwords.add(line)
            
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell10_tokenize(text):
    tokens = list(filter(lambda x: len(x)>0 and x not in stopwords,
                         map(lambda x: x.lower(),
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' '))))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] == ',':
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(porter_stem(tok))
    return result

# read base features
rand = random.Random(42)
city_pop = dict()
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        city_pop[name] = logpop
cities = sorted(list(city_pop.keys()))
        
# vocabulary
all_vocab     = list()
vocab_to_idx  = dict()
city_tok_idxs = dict()

remaining = set(city_pop.keys())
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text = cityline[tab:]
            toks = set()
            for token in cell10_tokenize(text):
                idx = vocab_to_idx.get(token, None)
                if idx is None:
                    idx = len(all_vocab)
                    all_vocab.append(token)
                    vocab_to_idx[token] = idx
                toks.add(idx)
            city_tok_idxs[name] = sorted(list(toks))

for name in remaining:
    city_tok_idxs[name] = list()
    
print("Total vocabulary: {:,}".format(len(all_vocab)))

# drop tokens that appear in less that 200 documents
tok_docs = list()
for _ in range(len(all_vocab)):
    tok_docs.append([])
for doc_idx, name in enumerate(cities):
    tok_idxs = city_tok_idxs[name]
    for tok_idx in tok_idxs:
        tok_docs[tok_idx].append(doc_idx)
city_tok_idxs = None

threshold = 200
reduced_vocab = list()
for tok_idx in range(len(all_vocab)):
    if len(tok_docs[tok_idx]) >= threshold:
        reduced_vocab.append(tok_idx)
        
print("Reduced vocabulary: {:,} (reduction {:%})"
      .format(len(reduced_vocab), (len(all_vocab) - len(reduced_vocab)) / len(all_vocab)))    

ydata = np.array(list(map(lambda c:city_pop[c], cities)))

# use more classes here to see if we can pick 'village'
ydata = cell7_adjudicate(ydata, segments_at[4])

feature_utility = list()

xdata = np.zeros( ydata.shape )
for pos, tok_idx in enumerate(reduced_vocab):
    verbose = False
    if pos % int(len(reduced_vocab) / 100) == 0:
        print("Computing M.I. for {:>6} out of {:>6} tokens, token \"{}\""
              .format(pos, len(reduced_vocab), all_vocab[tok_idx]))
        #verbose = True

    xdata[:] = 0
    for idx in tok_docs[tok_idx]:
        xdata[idx] = 1.0

    # compute confusion table
    table = dict()
    for row in range(xdata.shape[0]):
        feat_val = int(xdata[row])
        target_val = int(ydata[row])
        if feat_val not in table:
            table[feat_val] = dict()
        table[feat_val][target_val] = table[feat_val].get(target_val, 0) + 1

    feats = set()
    for row in table.values():
        feats.update(row.keys())
    cols = { val: sum(map(lambda x:x.get(val,0), table.values())) for val in feats }
    full_table = sum(cols.values())
    
    if verbose:
        print("\tTable:\n\t{}\n\tfull_table: {}\n\tCols: {}"
              .format(table, full_table, cols))
    
    best_utility = None
    for feat_val in table.keys():
        for target_val in table[feat_val].keys():
            # binarize
            n11 = table[feat_val][target_val]
            if n11 < 5:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n11={}, skipping"
                        .format(feat_val, target_val, n11))
                continue
            n10 = sum(table[feat_val].values()) - n11
            n01 = cols.get(target_val) - n11
            n00 = full_table - n11 - n10 - n01
            if n10 == 0 or n01 == 0 or n00 == 0:
                if verbose:
                    print("\tFor feat_val={}, target_val={}, n10={} or n01={} or n00={} is zero, skipping"
                        .format(feat_val, target_val, n10, n01, n00))
                continue
            n1_ = n11 + n10
            n0_ = n01 + n00
            n_1 = n11 + n01
            n_0 = n10 + n00
            n = float(full_table)
            utility = n11/n * math.log(n*n11/(n1_*n_1),2) + \
               n01 / n * math.log(n*n01/(n0_*n_1), 2) + \
               n10 / n * math.log(n*n10/(n1_*n_0), 2) + \
               n00 / n * math.log(n*n00/(n0_*n_0), 2)
            if best_utility is None or best_utility < utility:
                best_utility = utility
    if verbose:
        print("\tbest_utility: {}".format(best_utility))
    if best_utility is not None:
        feature_utility.append( (all_vocab[tok_idx], best_utility) )
all_vocab = None # free memory
    
feature_utility = sorted(feature_utility, key=lambda x:x[1], reverse=True)

PARAM_KEEP_TOP = 1000
with open("ch8_cell10_vocab.tsv", "w") as kept:
    for row in feature_utility[:PARAM_KEEP_TOP]:
        kept.write("{}\t{}\n".format(*row))
        
table1 = ("<table><tr><th>Position</th><th>Stem</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(feature_utility[:100])))) +"</table>")
table2 = ("<table><tr><th>Position</th><th>Stem</th><th>Utility</th></tr>" +
            "\n".join(list(map(lambda r: 
                               "<tr><td>{}</td><td>{}</td><td>{:5.10f}</td></tr>".format(
                        r[0], r[1][0], r[1][1]), 
                               enumerate(reversed(feature_utility[-100:]))))) +"</table>")

with open("ch8_cell10_dev_tokens.tsv", "w") as kept:
    kept.write("name\t" + "\t".join(map(lambda x:"token=" + x[0],feature_utility[:PARAM_KEEP_TOP]))+"\n")
    matrix = np.zeros( (ydata.shape[0], PARAM_KEEP_TOP) )
    for idx_tok, row in enumerate(feature_utility[:PARAM_KEEP_TOP]):
        tok = row[0]
        for idx_doc in tok_docs[vocab_to_idx[tok]]:
            matrix[idx_doc, idx_tok] = 1.0
    for idx_doc in range(matrix.shape[0]):
        kept.write(cities[idx_doc] + "\t" + "\t".join(map(str,matrix[idx_doc,:])) +"\n")
matrix       = None
tok_docs     = None
vocab_to_idx = None

from IPython.display import HTML, display
display(HTML("<h3>Top 100 tokens by MI</h3>" + table1 + 
             "<h3>Last 100 tokens by MI</h3>" + table2))
Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Total vocabulary: 356,591
Reduced vocabulary: 4,757 (reduction 98.665979%)
Computing M.I. for      0 out of   4757 tokens, token ","
Computing M.I. for     47 out of   4757 tokens, token "situat"
Computing M.I. for     94 out of   4757 tokens, token "rome"
Computing M.I. for    141 out of   4757 tokens, token "bc"
Computing M.I. for    188 out of   4757 tokens, token "hand"
Computing M.I. for    235 out of   4757 tokens, token "element"
Computing M.I. for    282 out of   4757 tokens, token "reason"
Computing M.I. for    329 out of   4757 tokens, token "use"
Computing M.I. for    376 out of   4757 tokens, token "unknown"
Computing M.I. for    423 out of   4757 tokens, token "treatment"
Computing M.I. for    470 out of   4757 tokens, token "seen"
Computing M.I. for    517 out of   4757 tokens, token "cathol"
Computing M.I. for    564 out of   4757 tokens, token "leav"
Computing M.I. for    611 out of   4757 tokens, token "28"
Computing M.I. for    658 out of   4757 tokens, token "fair"
Computing M.I. for    705 out of   4757 tokens, token "root"
Computing M.I. for    752 out of   4757 tokens, token "parti"
Computing M.I. for    799 out of   4757 tokens, token "alleg"
Computing M.I. for    846 out of   4757 tokens, token "restor"
Computing M.I. for    893 out of   4757 tokens, token "plane"
Computing M.I. for    940 out of   4757 tokens, token "sampl"
Computing M.I. for    987 out of   4757 tokens, token "mini"
Computing M.I. for   1034 out of   4757 tokens, token "ad"
Computing M.I. for   1081 out of   4757 tokens, token "director"
Computing M.I. for   1128 out of   4757 tokens, token "basketbal"
Computing M.I. for   1175 out of   4757 tokens, token "indic"
Computing M.I. for   1222 out of   4757 tokens, token "cotton"
Computing M.I. for   1269 out of   4757 tokens, token "philip"
Computing M.I. for   1316 out of   4757 tokens, token "spain"
Computing M.I. for   1363 out of   4757 tokens, token "seasid"
Computing M.I. for   1410 out of   4757 tokens, token "automobil"
Computing M.I. for   1457 out of   4757 tokens, token "eventu"
Computing M.I. for   1504 out of   4757 tokens, token "mere"
Computing M.I. for   1551 out of   4757 tokens, token "40"
Computing M.I. for   1598 out of   4757 tokens, token "say"
Computing M.I. for   1645 out of   4757 tokens, token "specul"
Computing M.I. for   1692 out of   4757 tokens, token "door"
Computing M.I. for   1739 out of   4757 tokens, token "extra"
Computing M.I. for   1786 out of   4757 tokens, token "creat"
Computing M.I. for   1833 out of   4757 tokens, token "ex"
Computing M.I. for   1880 out of   4757 tokens, token "escap"
Computing M.I. for   1927 out of   4757 tokens, token "deleg"
Computing M.I. for   1974 out of   4757 tokens, token "buse"
Computing M.I. for   2021 out of   4757 tokens, token "concept"
Computing M.I. for   2068 out of   4757 tokens, token "ss"
Computing M.I. for   2115 out of   4757 tokens, token "slope"
Computing M.I. for   2162 out of   4757 tokens, token "enact"
Computing M.I. for   2209 out of   4757 tokens, token "eros"
Computing M.I. for   2256 out of   4757 tokens, token "parish"
Computing M.I. for   2303 out of   4757 tokens, token "split"
Computing M.I. for   2350 out of   4757 tokens, token "13th"
Computing M.I. for   2397 out of   4757 tokens, token "cultiv"
Computing M.I. for   2444 out of   4757 tokens, token "movi"
Computing M.I. for   2491 out of   4757 tokens, token "news"
Computing M.I. for   2538 out of   4757 tokens, token "polic"
Computing M.I. for   2585 out of   4757 tokens, token "medic"
Computing M.I. for   2632 out of   4757 tokens, token "liber"
Computing M.I. for   2679 out of   4757 tokens, token "commod"
Computing M.I. for   2726 out of   4757 tokens, token "sunday"
Computing M.I. for   2773 out of   4757 tokens, token "instead"
Computing M.I. for   2820 out of   4757 tokens, token "oliv"
Computing M.I. for   2867 out of   4757 tokens, token "mascot"
Computing M.I. for   2914 out of   4757 tokens, token "ciudad"
Computing M.I. for   2961 out of   4757 tokens, token "pan"
Computing M.I. for   3008 out of   4757 tokens, token "tire"
Computing M.I. for   3055 out of   4757 tokens, token "von"
Computing M.I. for   3102 out of   4757 tokens, token "duke"
Computing M.I. for   3149 out of   4757 tokens, token "vital"
Computing M.I. for   3196 out of   4757 tokens, token "disrupt"
Computing M.I. for   3243 out of   4757 tokens, token "notr"
Computing M.I. for   3290 out of   4757 tokens, token "octagon"
Computing M.I. for   3337 out of   4757 tokens, token "irish"
Computing M.I. for   3384 out of   4757 tokens, token "sam"
Computing M.I. for   3431 out of   4757 tokens, token "leed"
Computing M.I. for   3478 out of   4757 tokens, token "rescu"
Computing M.I. for   3525 out of   4757 tokens, token "preschool"
Computing M.I. for   3572 out of   4757 tokens, token "96"
Computing M.I. for   3619 out of   4757 tokens, token "willow"
Computing M.I. for   3666 out of   4757 tokens, token "provision"
Computing M.I. for   3713 out of   4757 tokens, token "cabin"
Computing M.I. for   3760 out of   4757 tokens, token "cruz"
Computing M.I. for   3807 out of   4757 tokens, token "haut"
Computing M.I. for   3854 out of   4757 tokens, token "adequ"
Computing M.I. for   3901 out of   4757 tokens, token "daughter"
Computing M.I. for   3948 out of   4757 tokens, token "spectat"
Computing M.I. for   3995 out of   4757 tokens, token "toy"
Computing M.I. for   4042 out of   4757 tokens, token "monro"
Computing M.I. for   4089 out of   4757 tokens, token "superintend"
Computing M.I. for   4136 out of   4757 tokens, token "bomber"
Computing M.I. for   4183 out of   4757 tokens, token "porch"
Computing M.I. for   4230 out of   4757 tokens, token "eve"
Computing M.I. for   4277 out of   4757 tokens, token "leonard"
Computing M.I. for   4324 out of   4757 tokens, token "confin"
Computing M.I. for   4371 out of   4757 tokens, token "mango"
Computing M.I. for   4418 out of   4757 tokens, token "strait"
Computing M.I. for   4465 out of   4757 tokens, token "org"
Computing M.I. for   4512 out of   4757 tokens, token "albanian"
Computing M.I. for   4559 out of   4757 tokens, token "pike"
Computing M.I. for   4606 out of   4757 tokens, token "sat"
Computing M.I. for   4653 out of   4757 tokens, token "liaison"
Computing M.I. for   4700 out of   4757 tokens, token "chile"
Computing M.I. for   4747 out of   4757 tokens, token "corregimiento"

Top 100 tokens by MI

PositionStemUtility
0TOKNUMSEG300.0537331244
1citi0.0513839709
2capit0.0482481137
3airport0.0445746900
4temperatur0.0420718228
5climat0.0420551462
6univers0.0404202701
7intern0.0398616893
8TOKNUMSEG290.0371992907
9urban0.0370085142
10import0.0366114102
11largest0.0357752559
12TOKNUMSEG180.0356157632
13TOKNUMSEG190.0349490764
14TOKNUMSEG200.0346995641
15china0.0338219118
16institut0.0322952533
17major0.0318239374
18TOKNUMSEG220.0318067838
19TOKNUMSEG140.0317591474
20industri0.0316884633
21govern0.0315258492
22fachhochschul0.0310251824
23TOKNUMSEG160.0307546347
24cultur0.0306875697
25TOKNUMSEG170.0304567335
26hub0.0298875857
27TOKNUMSEG210.0298263176
28svp0.0294803771
29connect0.0291452869
30level0.0286748120
31headquart0.0285513073
32chines0.0282720126
33ppen0.0279834306
34TOKNUMSEG70.0278148011
35stadium0.0272838925
36transport0.0272540607
37TOKNUMSEG120.0267550674
38also0.0266796079
39TOKNUMSEG280.0264290018
40period0.0264229278
41well0.0260936714
42TOKNUMSEG150.0260672592
43among0.0260045697
44mandatori0.0259701499
45mani0.0255557024
46million0.0255028040
47rainfal0.0254292935
48due0.0252655026
49TOKNUMSEG80.0251558691
50countri0.0251052745
51fdp0.0248824550
52trade0.0248340920
53econom0.0245166730
54product0.0244509987
55like0.0243475120
56month0.0243199686
57technolog0.0242229978
58divid0.0241811798
59agnost0.0241221597
60dri0.0239772716
61dynasti0.0239366088
62scienc0.0238046033
63winter0.0237821004
64one0.0237646216
65billion0.0237496517
66nation0.0236955184
67influenc0.0236604828
68TOKNUMSEG100.0236347021
69main0.0235435461
70modern0.0234926280
71TOKNUMSEG130.0233591576
72famous0.0232191202
73develop0.0230990125
74hot0.0230443182
75administr0.0229519698
76asia0.0228906767
77number0.0228499735
78base0.0227920955
79air0.0226375177
80foreign0.0226302182
81monsoon0.0225439540
82TOKNUMSEG110.0225388462
83flight0.0225015609
84provinc0.0224542477
85prefectur0.0224296865
86humid0.0223838186
87TOKNUMSEG310.0223162477
88season0.0221270289
89switzerland0.0219480403
90annual0.0217980806
91atheist0.0217704899
92TOKNUMSEG90.0216862653
93domest0.0216463631
94economi0.0216171387
95teenag0.0215342943
96relat0.0215287150
97unproduct0.0214950798
98rule0.0214693340
99swiss0.0213665068

Last 100 tokens by MI

PositionStemUtility
0nord0.0000358565
1blacksmith0.0000368659
2trout0.0000408468
3nave0.0000412574
4fork0.0000425422
5gaelic0.0000451307
6marion0.0000505955
7lordship0.0000524189
8dakota0.0000524357
9wroc0.0000531035
10townsit0.0000536172
11serb0.0000544652
12monro0.0000562435
13aisl0.0000563639
14minnesota0.0000564375
15oregon0.0000594520
16gro0.0000624232
17carolina0.0000627931
18texa0.0000628176
19utah0.0000634911
20albani0.0000647898
21wesleyan0.0000655043
22virginia0.0000657541
23provenc0.0000668552
24proprietor0.0000688756
25tennesse0.0000707456
26cumberland0.0000728798
27plough0.0000734097
28rhein0.0000740731
29welsh0.0000754198
30shaft0.0000755003
31somerset0.0000766462
32earl0.0000773754
33windmil0.0000775958
34kentucki0.0000778309
35alabama0.0000781721
36fief0.0000796384
37norway0.0000805684
38mississippi0.0000831047
39newport0.0000841265
40bohemian0.0000843630
41oxford0.0000845648
42mascot0.0000852984
43teau0.0000853105
44lancast0.0000856941
45comarca0.0000865814
46kansa0.0000912376
47footpath0.0000912545
48salmon0.0000917557
49tanneri0.0000936350
50butcher0.0000947079
51dublin0.0000957067
52rev0.0000961479
53graveyard0.0000972356
54bundesstra0.0000976962
55newcastl0.0000990926
56butler0.0000991996
57lumber0.0000995749
58bailey0.0000996819
59serbian0.0001009010
60csb0.0001009558
61georgia0.0001010745
62der0.0001020411
63surveyor0.0001026748
64pomerania0.0001038291
65res0.0001039333
66vermont0.0001039356
67arkansa0.0001046994
68beaver0.0001048955
69leonard0.0001052659
70benedictin0.0001053385
71krak0.0001058145
72holland0.0001064828
73pittsburgh0.0001079832
74cheroke0.0001085113
75burlington0.0001087504
76baroni0.0001088929
77missouri0.0001099265
78inn0.0001103388
79sawmil0.0001105784
80disus0.0001107899
81dale0.0001108712
82grist0.0001122458
83priori0.0001136469
84pike0.0001142573
85shire0.0001146439
86silesia0.0001152157
87suffolk0.0001153128
88normandi0.0001153268
89reverend0.0001156959
90gristmil0.0001162304
91croat0.0001163792
92perri0.0001168874
93bend0.0001170574
94farmhous0.0001176658
95cincinnati0.0001177525
96nors0.0001185797
97pub0.0001199626
98schleswig0.0001203447
99appalachian0.0001204434

Tokenization now takes much more time. It is the reason that NLP is usually done in batch using multiple machines, using frameworks such as Apache UIMA [LINK] or Spark NLP. Adding NER will make it much more slow.

Also, as I add more complexity, the results are more difficult to understand (what type of tokens the stem 'civil' captures? 'civilization'? 'civilized'? intriguing.)

'village' is still not picked (top 1647). (Expanding to 2000 terms was tried but didn't help).

I'll now re-do training the training in Cell #11.

In [11]:
# CELL 11
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    header = next(f)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
                
# add text features
with open("ch8_cell10_dev_tokens.tsv") as feats:
    extra_header = next(feats)
    extra_header = extra_header.strip().split("\t")
    extra_header.pop(0) # name
    header.extend(extra_header)
    for line in feats:
        fields = line.strip().split("\t")
        name = fields[0]
        all_data[city_to_all_data[name]][0].extend(list(map(float,fields[1:])))
        
with open("ch8_cell11_dev_feat3.tsv", "w") as feats:
    extheader = header.copy()
    extheader.insert(0, 'name')
    extheader.append('logpop')
    feats.write("\t".join(extheader) + "\n")
    for row in all_data:
        feats.write("{}\t{}\t{}\n".format(row[-1], "\t".join(map(str,row[0])), row[1]))
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row) 
    else:
        train_data.append(row)

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell11_rf_feat3.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Training on 35,971 cities
RMSE 0.3267168745861243
Out[11]:
<matplotlib.legend.Legend at 0x7fcf470c7110>

At RMSE 0.3267, we're getting closer to the best performance of Chapter 6.

And now for the error analysis as before (Cell #12).

In [12]:
# CELL 12
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read base features
rand = random.Random(42)
base_data         = list()
city_to_base_data = dict()
base_header = None
with open("ch8_cell1_dev_textlen.tsv") as f:
    base_header = next(f)
    base_header = base_header.strip().split("\t")
    base_header.pop(0) # name
    base_header.pop() # population
    for line in f:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_base_data[name] = len(base_data)
        base_data.append( (feats, logpop, name) )
                
# read text features
mi_data         = list()
city_to_mi_data = dict()
mi_header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    mi_header = next(mi)
    mi_header = mi_header.strip().split("\t")
    mi_header.pop(0) # name
    mi_header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_mi_data[name] = len(mi_data)
        mi_data.append( (feats, logpop, name) )
        
# split
base_train_data = list()
base_test_data  = list()
mi_train_data   = list()
mi_test_data    = list()
for row in base_data:
    if rand.random() < 0.2:
        base_test_data.append(row)
        mi_test_data.append(mi_data[city_to_mi_data[row[-1]]])
    else:
        base_train_data.append(row)
        mi_train_data.append(mi_data[city_to_mi_data[row[-1]]])
base_data = None
mi_data   = None

base_test_data = sorted(base_test_data, key=lambda t:t[1])
mi_test_data   = sorted(mi_test_data, key=lambda t:t[1])
test_names     = list(map(lambda t:t[2], base_test_data))

base_xtrain = np.array(list(map(lambda t:t[0], base_train_data)))
ytrain      = np.array(list(map(lambda t:t[1], base_train_data)))
base_xtest  = np.array(list(map(lambda t:t[0], base_test_data)))
ytest       = np.array(list(map(lambda t:t[1], base_test_data)))
base_train_data = None
base_test_data  = None


mi_xtrain = np.array(list(map(lambda t:t[0], mi_train_data)))
mi_xtest  = np.array(list(map(lambda t:t[0], mi_test_data)))
mi_train_data = None
mi_test_data  = None

# train
print("Base training on {:,} cities".format(len(base_xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(base_xtrain, ytrain)
base_ytest_pred = rf.predict(base_xtest)
base_se = (base_ytest_pred - ytest)**2

print("M.I. training on {:,} cities".format(len(mi_xtrain)))
rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(mi_xtrain, ytrain)
mi_ytest_pred = rf.predict(mi_xtest)
mi_se = (mi_ytest_pred - ytest)**2

# find the bigger winners and losers
se_ytest_diff = base_se - mi_se # small is better, it's error
named_se = list()
for idx in range(se_ytest_diff.shape[0]):
    named_se.append( (se_ytest_diff[idx], test_names[idx], idx) )

named_se = sorted(named_se, key=lambda x:x[0], reverse=True)

to_print = dict()
for idx, winner in enumerate(named_se[:10]):
    to_print[winner[1]] = { 
        'improv' : winner[0], 
        'base': int(round(10**base_ytest_pred[winner[2]])),
        'mi': int(round(10**mi_ytest_pred[winner[2]])),
        'pop': int(round(10**ytest[winner[2]])),
        'type': 'winner',
        'pos': idx }
for idx, loser in enumerate(named_se[-10:]):
    to_print[loser[1]] = { 
        'improv' : loser[0], 
        'base': int(round(10**base_ytest_pred[loser[2]])),
        'mi': int(round(10**mi_ytest_pred[loser[2]])),
        'pop': int(round(10**ytest[loser[2]])),
        'type': 'loser',
        'pos': (9-idx)}
    
kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

base_xtrain = None
base_xtest  = None
mi_xtrain   = None
mi_xtest    = None
                 
htmls = [""] * 20
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in to_print:
            text = cityline[tab:]
            tokens = list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text)))
            text = text.replace('\t','<p>')
            entry = to_print[name]
            this_html = ("<h1>Top {} {}: {}</h1>"+
                     "<h2>Change: {:1.5} (base: {:,}, MI: {:,}). Population: {:,}. Text length: {:,}</h2>{}"+
                     "<h2>Tokens (length {:,})</h2>{}") \
                      .format((entry['pos'] + 1), entry['type'], name[1:-1], entry['improv'], entry['base'], 
                              entry['mi'], entry['pop'], len(text), text[:1000], len(tokens), tokens[:100])
            if entry['type'] == 'winner':
                htmls[entry['pos']] = this_html
            else:
                htmls[10+entry['pos']] = this_html
html = "".join(htmls)
from IPython.display import HTML, display
display(HTML(html))
Base training on 35,971 cities
M.I. training on 35,971 cities

Top 1 winner: http://dbpedia.org/resource/Sharjah

Change: 2.7373 (base: 12,693, MI: 88,784). Population: 1,400,000. Text length: 12,841

Sharjah

Sharjah Sharjah (; "") is the third largest and third most populous city in the United Arab Emirates, forming part of the Dubai-Sharjah-Ajman metropolitan area. It is located along the southern coast of the Persian Gulf on the Arabian Peninsula. Sharjah is the capital of the emirate of Sharjah. Sharjah shares legal, political, military and economic functions with the other emirates of the UAE within a federal framework, although each emirate has jurisdiction over some functions such as civil law enforcement and provision and upkeep of local facilities. Sharjah has been ruled by the Al Qasimi dynasty since the 18th century. The city is a centre for culture and industry, and alone contributes 7.4% of the GDP of the United Arab Emirates. The city covers an approximate area of 235 km² and has a population of over 800,000 (2008). The sale or consumption of alcoholic beverages is prohibited in the emirate of Sharjah without possession of an alcohol licence and alcohol is not s

Tokens (length 750)

['third', 'largest', 'third', 'popul', 'citi', 'arab', 'form', 'part', 'metropolitan', 'area', 'locat', 'along', 'southern', 'coast', 'persian', 'capit', 'share', 'polit', 'militari', 'econom', 'function', 'within', 'although', 'jurisdict', 'function', 'civil', 'law', 'local', 'facil', 'rule', 'al', 'dynasti', 'sinc', 'centuri', 'citi', 'centr', 'cultur', 'industri', 'contribut', 'gdp', 'arab', 'citi', 'cover', 'approxim', 'area', 'TOKNUMSMALL', 'popul', 'TOKNUMSEG31', 'TOKNUMSEG6', 'without', 'serv', 'hotel', 'restaur', 'due', 'muslim', 'major', 'area', 'help', 'increas', 'number', 'islam', 'tourist', 'visit', 'countri', 'offici', 'name', 'citi', 'third', 'largest', 'citi', 'arab', 'palac', 'ruler', 'high', 'sultan', 'al', 'locat', 'citi', 'citi', 'persian', 'popul', 'TOKNUMSEG31', 'TOKNUMSEG6', 'contain', 'main', 'administr', 'commerci', 'centr', 'cultur', 'tradit', 'project', 'includ', 'sever', 'museum', 'cover', 'area', 'archaeolog', 'natur', 'histori', 'scienc']

Top 2 winner: http://dbpedia.org/resource/Ganzhou

Change: 2.6664 (base: 179,493, MI: 3,796,002). Population: 8,368,447. Text length: 5,851

Ganzhou

Ganzhou Ganzhou (), formerly romanized as Kanchow, is a prefecture-level city in southern Jiangxi, China, bordering Fujian to the east, Guangdong to the south, and Hunan to the west. Its administrative seat is at Zhanggong District. Its population was 8,361,447 at the 2010 census whom 1,977,253 in the built-up (or "metro") area made of Zhanggong and Nankang, and Ganxian largely being urbanized. In 201, Emperor Gaozu of Han established a county in the territory of modern Ganzhou. In those early years, Han Chinese settlement and authority in the area was minimal and largely restricted to the Gan River basin. The river, a tributary of the Yangtze via Poyang Lake, provided a route of communication from the north as well as irrigation for rice farming. During the Sui dynasty, the county administration was promoted to prefecture status and the area called Qianzhou (). During the Song, immigration from the north bolstered the local population and drove local aboriginal tribes f

Tokens (length 326)

[',', 'former', 'prefectur', 'level', 'citi', 'southern', 'china', 'border', 'east', 'south', 'west', 'administr', 'popul', 'TOKNUMSEG31', 'TOKNUMSEG6', 'TOKNUMSEG31', 'built', 'metro', 'area', 'larg', 'urban', 'TOKNUMSMALL', 'emperor', 'han', 'establish', 'territori', 'modern', 'earli', 'han', 'chines', 'author', 'area', 'larg', 'river', 'river', 'via', 'provid', 'rout', 'communic', 'north', 'well', 'rice', 'dynasti', 'administr', 'promot', 'prefectur', 'status', 'area', 'call', 'immigr', 'north', 'local', 'popul', 'local', 'tribe', 'fall', 'capit', 'TOKNUMSEG1', 'immigr', 'increas', 'provinc', 's', 'name', 'offici', 'chang', 'southern', 'TOKNUMSEG1', 'TOKNUMSEG2', 'late', 'open', 'one', 'southern', 'treati', 'port', 'becam', 'minor', 'base', 'foreign', 'compani', 'TOKNUMSEG6', 'TOKNUMSEG6', 'form', 'part', 'one', 'base', 'parti', 'china', 'due', 'proxim', 'red', 'capit', 'number', 'campaign', 'TOKNUMSEG6', 'TOKNUMSEG6', 'appoint', 'govern', 'republ', 'china', 'prefectur']

Top 3 winner: http://dbpedia.org/resource/Apodaca

Change: 2.3964 (base: 9,805, MI: 89,613). Population: 523,370. Text length: 1,525

Apodaca

Apodaca Apodaca () is a city and its surrounding municipality that is part of Monterrey Metropolitan area. It lies in the northeastern part of the metropolitan area. As of the 2005 census, the city had a population of 393,195 and the municipality had a population of 418,784. The municipality has an area of 183.5 km. The fourth-largest city in the state (behind Monterrey, Guadalupe, and San Nicolás de los Garza), Apodaca is one of the fastest-growing cities in Nuevo León and an important industrial center. Two airports, General Mariano Escobedo International Airport (IATA: MTY) and Del Norte International Airport (IATA: NTR), are located in Apodaca. VivaAerobus Airline and Grupo Aeroportuario Centro Norte have their corporate headquarters on the grounds of Escobedo Airport. The municipality of Apodaca is one of the major industrial centers of the state of Nuevo León. Apodaca's economy is founded basically in manufacturing operations and services. American companies such a

Tokens (length 83)

['citi', 'surround', 'part', 'metropolitan', 'area', 'lie', 'part', 'metropolitan', 'area', 'TOKNUMSEG6', 'citi', 'popul', 'TOKNUMSEG31', 'popul', 'TOKNUMSEG31', 'area', 'TOKNUMSMALL', 'fourth', 'largest', 'citi', 'behind', 's', 'de', ',', 'one', 'fastest', 'grow', 'citi', 'n', 'import', 'industri', 'center', 'airport', 'general', 'intern', 'airport', 'intern', 'airport', ',', 'locat', 'airlin', 'corpor', 'headquart', 'ground', 'airport', 'one', 'major', 'industri', 'center', 'n', 's', 'economi', 'found', 'manufactur', 'oper', 'servic', 'compani', 'general', 'electr', 'industri', 'compani', 'among', 'mani', 'other', 'manufactur', 'oper', 'japanes', 'compani', 'korean', 'compani', 'chines', 'compani', 'compani', 'also', 'manufactur', 'facil', 'countri', 'club', 'cours', 'locat', 'citi', 'name', 'citi']

Top 4 winner: http://dbpedia.org/resource/Az_Zubayr

Change: 2.3887 (base: 6,853, MI: 61,059). Population: 370,000. Text length: 1,634

Az Zubayr

Az Zubayr Az Zubayr () is a town in Basra Governorate in Iraq, just south of Basra. The name can also refer to the old Emirate of Zubair. The name is also sometimes written Az Zubair, Zubair, Zoubair, El Zubair, or Zobier. The city was named al-Zubair because one of the Sahaba (companions) of the Prophet Muhammad, Zubayr ibn al-Awwam, was buried there. During the Ottoman times, the city was a self-ruling emirate ruled by an Emir (or Sheikh) from Najdi families, such as Al Zuhair, Al Meshary, and Al Ibrahim families. Like other Emirates under the Ottoman Empire, the Emirate of Zubair used to pay dues and receive protection from the Ottomans. In the 19th century, the city of Zubair witnessed relatively large migrations from Najd. Up until the 1970s and 1980s, the town was predominantly populated by people of Najdi origin. Now only a few families remain of the old inhabitants. Most of them moved back to their homeland of Najd and other regions of Saudi Arabia and to Kuwai

Tokens (length 82)

['town', 'governor', 'south', 'name', 'can', 'also', 'refer', 'old', 'name', 'also', 'sometim', 'el', 'citi', 'name', 'al', 'one', 'muhammad', 'al', 'time', 'citi', 'rule', 'rule', 'al', 'al', 'al', 'like', 'empir', 'use', 'due', 'receiv', 'protect', '19th', 'centuri', 'citi', 'relat', 'larg', 'migrat', '1970s', '1980s', 'town', 'predomin', 'popul', 'origin', 'now', 'remain', 'old', 'inhabit', 'move', 'back', 'region', 'period', 'inhabit', 'citi', 'domin', 'sunni', 'islam', 'domin', 'nearbi', 'TOKNUMSEG6', 'citi', 'popul', 'around', 'TOKNUMSEG30', 'grown', 'metropolitan', 'area', 'near', 'million', 'inhabit', 'current', 'major', 'still', 'home', 'larg', 'sunni', 'minor', 'howev', 'face', 'mani', 'countri', 'sunni', 'area']

Top 5 winner: http://dbpedia.org/resource/Farah,_Afghanistan

Change: 2.1105 (base: 10,479, MI: 67,065). Population: 540,000. Text length: 5,858

Farah, Afghanistan

Farah, Afghanistan Farah (Pashto/Dari Persian: فراه) is the capital of Farah Province, located in western Afghanistan. It has a population of about 540000, and is mainly ethnic Pashtun people. It is about the 16th-largest city of the country in terms of population. The Farah Airport is located in the area. Farah is located in western Afghanistan, close to Herat and Iran, although it lacks a direct road connection with the latter. Farah has a very clear grid of roads distributed through the higher density residential areas. However barren land (35%) and vacant plots (25%) are the largest land uses and combine for 60% of total land use. The Citadel at Farah is probably one of a series of fortresses constructed by Alexander the Great, the city being an intermediate stop between Herat, the location of another of Alexander's fortresses, and Kandahar. The ‘Alexandria’ prefix was added to the city’s name when Alexander came in 330 BC. Under the Parthian Empire, Farah

Tokens (length 321)

['persian', 'capit', 'provinc', 'locat', 'western', 'popul', 'TOKNUMSEG31', 'main', 'ethnic', 'largest', 'citi', 'countri', 'term', 'popul', 'airport', 'locat', 'area', 'locat', 'western', 'close', 'although', 'lack', 'direct', 'road', 'connect', 'latter', 'road', 'higher', 'residenti', 'area', 'howev', 'largest', 'use', 'combin', 'use', 'one', 'seri', 'construct', 'great', 'citi', 'locat', 'anoth', 's', 'ad', 'citi', 's', 'name', 'came', 'TOKNUMSMALL', 'bc', 'empir', 'one', 'citi', 'centuri', 'ad', 'centuri', 'one', 'major', 'eastern', 'empir', 'region', 'histor', 'part', 'provinc', 'control', 'follow', 'earli', 'centuri', 'becam', 'part', 'dynasti', 'follow', 'empir', 'islam', 'introduc', 'region', 'centuri', 'later', 'dynasti', 'took', 'control', 'centuri', 'took', 'citi', 'follow', 'centuri', 'khan', 'armi', 'pass', 'centuri', 'citi', 'control', 'TOKNUMSEG4', 'defeat', 'forc', 'becam', 'part', 'empir', 'mid', 'centuri']

Top 6 winner: http://dbpedia.org/resource/Dehiwala-Mount_Lavinia

Change: 1.9814 (base: 5,748, MI: 36,834). Population: 245,974. Text length: 670

Dehiwala-Mount Lavinia

Dehiwala-Mount Lavinia Dehiwala-Mount Lavinia ( "Dehiwala-Galkissa", ), population 245,974 (2012) is the second largest municipality of Sri Lanka after capital Colombo (List of cities in Sri Lanka). It is situated immediately south of the Colombo Municipality. It is a combination of certain key urban suburbs and communities combined for administrative purposes. It is home to Sri Lanka's National Zoological Gardens, which remains one of Asia's largest. Dehiwala and Mount Lavinia lie along the Galle Road artery, which runs along the coast to the south of the country. Dehiwala-Mount Lavinia Municipality comprises the following areas.

Tokens (length 36)

[',', ',', 'popul', 'TOKNUMSEG30', 'TOKNUMSEG6', 'second', 'largest', 'capit', 'citi', 'situat', 'south', 'combin', 'urban', 'combin', 'administr', 'purpos', 'home', 's', 'nation', 'garden', 'remain', 'one', 'asia', 's', 'largest', 'lie', 'along', 'road', 'run', 'along', 'coast', 'south', 'countri', 'compris', 'follow', 'area']

Top 7 winner: http://dbpedia.org/resource/Qods,_Iran

Change: 1.9477 (base: 6,405, MI: 47,538). Population: 229,354. Text length: 618

Qods, Iran

Qods, Iran Qods (, also known as Shahr-e Qods, meaning "City of Qods"; formerly, Karaj, Qal‘eh Hasan, and Qal‘eh-ye Ḩasan Khān) is a city in and the capital of Qods County, Tehran Province, Iran. At the 2006 census, its population was 229,354, in 60,331 families. Before Qods officially became a municipality in 1989 it was named Qal‘eh Hasan. The Persian Gulf Pro League team Paykan plays in the city at Shahre Qods Stadium. The city has three universities: Islamic Azad University, Shahr-eQods Branch, University of Applied Science and Technology, Shahr-e-Qods Branch and Payam-e-Nour university.

Tokens (length 38)

[',', 'also', 'known', 'e', 'mean', 'citi', 'former', 'n', 'citi', 'capit', 'provinc', 'TOKNUMSEG6', 'popul', 'TOKNUMSEG30', 'TOKNUMSEG28', 'offici', 'becam', 'TOKNUMSEG6', 'name', 'persian', 'leagu', 'team', 'play', 'citi', 'stadium', 'citi', 'three', 'univers', 'islam', 'univers', 'branch', 'univers', 'scienc', 'technolog', 'e', 'branch', 'e', 'univers']

Top 8 winner: http://dbpedia.org/resource/Masina,_Kinshasa

Change: 1.7305 (base: 8,701, MI: 34,464). Population: 485,167. Text length: 1,086

Masina, Kinshasa

Masina, Kinshasa Masina is a municipality ("commune") in the Tshangu district of Kinshasa, the capital city of the Democratic Republic of the Congo. It is bordered by the Pool Malebo in the north and "Boulevard Lumbumba" to the south. Masina shelters within it the "Marché de la Liberté "M’Zee Laurent-Désiré Kabila"”, one of the largest markets of Kinshasa, which was built under the presidency of Laurent-Désiré Kabila to repay the inhabitants of the district of Tshangu who had resisted the rebels in August 1998. The area is known by the nickname "Chine Populaire" ("People's China"). Masina, together with the communes of Ndjili and Kimbanseke, belong to the district of Tshangu 20 km east of central Kinshasa. Most of the municipality is occupied by a wetland bordering the Pool Malebo, which explains the low population density relative of the municipality. The urban area, stretching along Boulevard Lumumba, however, reaches population densities comparable to those

Tokens (length 42)

['capit', 'citi', 'republ', 'border', 'north', 'south', 'within', 'march', 'de', 'm', ',', 'one', 'largest', 'market', 'built', 'presid', 'inhabit', 'resist', 'august', 'TOKNUMSEG6', 'area', 'known', 'nicknam', 's', 'china', 'east', 'central', 'occupi', 'border', 'low', 'popul', 'relat', 'urban', 'area', 'along', 'howev', 'reach', 'popul', 'compar', 'heart', 'TOKNUMSEG28', 'inhabit']

Top 9 winner: http://dbpedia.org/resource/Fianarantsoa

Change: 1.7047 (base: 5,272, MI: 26,935). Population: 190,318. Text length: 1,815

Fianarantsoa

Fianarantsoa Fianarantsoa is a city (commune urbaine) in south central Madagascar. Fianarantsoa is the capital of Haute Matsiatra Region. It was built in the early 19th century by the Merina as the administrative capital for the newly conquered Betsileo kingdoms. It is at an average altitude of , and has a population of 190,318 (2013 estimate). Fianarantsoa means "Good education" in Malagasy. It is a cultural and intellectual center for the whole island. It is home to some of the oldest Protestant and Lutheran cathedrals on the island, the oldest theological seminary (also Lutheran), as well as the Roman Catholic Archdiocese of Fianarantsoa. The city of "good education" also boasts a university named after it and built in 1972. Fianarantsoa is considered to be the capital of wine in Madagascar, because of the presence of many wine industries in the city. Fianarantsoa has been known for its political activism and was one of the "hot spots" during the political crisis

Tokens (length 95)

['citi', 'south', 'central', 'capit', 'region', 'built', 'earli', '19th', 'centuri', 'administr', 'capit', 'newli', 'conquer', 'kingdom', ',', 'popul', 'TOKNUMSEG30', 'TOKNUMSEG6', 'estim', 'mean', 'good', 'educ', 'cultur', 'center', 'whole', 'home', 'oldest', 'protest', 'cathedr', 'oldest', 'also', ',', 'well', 'citi', 'good', 'educ', 'also', 'univers', 'name', 'built', 'TOKNUMSEG6', 'consid', 'capit', 'presenc', 'mani', 'industri', 'citi', 'known', 'polit', 'activ', 'one', 'hot', 'polit', 'TOKNUMSEG6', 'student', 'univers', 'reput', 'group', 'mayor', 'come', 'polit', 'parti', 'base', 'place', 'world', 'monument', 'fund', 'TOKNUMSEG6', 'TOKNUMSMALL', 'site', 'mani', 'build', 'old', 'town', 'relat', 'will', 'attract', 'fund', 'old', 'town', 'beauti', 'citi', 'c', 'railway', 'also', 'airport', 'intern', 'k', 'ppen', 'climat', 'classif', 'system', 'classifi', 'climat', 'subtrop']

Top 10 winner: http://dbpedia.org/resource/Mejicanos

Change: 1.6833 (base: 9,174, MI: 71,701). Population: 224,661. Text length: 2,299

Mejicanos

Mejicanos Mejicanos is a San Salvador suburb in the San Salvador department of El Salvador. Mejicanos is a city located in San Salvador, El Salvador. At the 2009 estimate it had 160,751 inhabitants. It has been characterized by its typical food "Yuca Frita con Merienda". It has a municipal market, where the local citizens can buy groceries, vegetables, dairy products, meat, pupusas, etc. Many of the things available in the local market are produced in the surrounding villages, like vegetables. It is located in a strategic point because is in the main route to other towns or municipalities, like Cuscatancingo, Mariona, San Ramon, San Salvador, etc. However it has-single lane roads, which accounts for the frequent traffic jams in the center of the city. It has always been characterized by being a disorganized city, like other cities in El Salvador. Even though it has a municipal market people use the streets to sell their products. Because the average altitude of the cit

Tokens (length 131)

['el', 'citi', 'locat', 'el', 'TOKNUMSEG6', 'estim', 'TOKNUMSEG30', 'inhabit', 'typic', 'food', 'market', 'local', 'citizen', 'can', 'veget', 'product', 'etc', 'mani', 'avail', 'local', 'market', 'produc', 'surround', 'like', 'veget', 'locat', 'strateg', 'point', 'main', 'rout', 'town', 'like', 'etc', 'howev', 'road', 'account', 'frequent', 'traffic', 'center', 'citi', 'citi', 'like', 'citi', 'el', 'even', 'though', 'market', 'use', 'street', 'product', 'citi', 'around', 'TOKNUMSMALL', 'meter', 'sea', 'level', 'climat', 'typic', 'warm', 'although', 'local', 'citizen', 'use', 'februari', 'TOKNUMSEG6', 'mani', 'bus', 'rout', 'go', 'though', 'TOKNUMSEG3', 'spanish', 'conquer', 'territori', 'n', 'now', 'el', 'territori', 'arriv', 'conquer', 'north', 'capit', 'known', 'found', 'three', 'core', 'group', 'one', 'group', 'north', 'now', 'citi', 'locat', 'second', 'de', 'third', 'n', 'today', 'canton', 'citi']

Top 1 loser: http://dbpedia.org/resource/Bandundu_Province

Change: -2.982 (base: 499,974, MI: 62,995). Population: 8,062,463. Text length: 4,122

Bandundu Province

Bandundu Province Bandundu is one of eleven former provinces of the Democratic Republic of the Congo. It bordered the provinces of Kinshasa and Bas-Congo to the west, Équateur to the north, and Kasai-Occidental to the east. The provincial capital is also called Bandundu (formerly Banningstad/Banningville). Bandundu was formed in 1966 by merging the three post-colonial political regions: Kwilu, Kwango, and Mai-Ndombe. Under the 2006 constitution, Bandundu was to be broken up again into the aforementioned political regions. Kwilu province was to be formed by combining Kwilu district and the city of Kikwit, Kwango province was to be formed from Kwango district, and Mai-Ndombe province was to be formed by combining Plateaux District, Mai-Ndombe District and the city of Bandundu. Following much delay, by 2016 the change had taken effect. The landscape of Bandundu province consisted primarily of plateaus covered in savanna, cut by rivers and streams that are often b

Tokens (length 208)

['provinc', 'provinc', 'one', 'former', 'provinc', 'republ', 'border', 'provinc', 'west', 'north', 'east', 'provinci', 'capit', 'also', 'call', 'former', 'form', 'TOKNUMSEG6', 'three', 'coloni', 'polit', 'region', 'TOKNUMSEG6', 'constitut', 'polit', 'region', 'provinc', 'form', 'combin', 'citi', 'provinc', 'form', 'provinc', 'form', 'combin', 'citi', 'follow', 'much', 'TOKNUMSEG6', 'chang', 'taken', 'effect', 'provinc', 'consist', 'primarili', 'cover', 'river', 'often', 'border', 'provinc', 'river', 'river', 'provinc', 's', 'western', 'major', 'river', 'largest', 'surround', 'form', 'southern', 'situat', 'higher', 'ground', 'practic', 'shift', 'agricultur', 'main', ',', 'indian', 'chines', 'busi', 'electron', 'televis', 'system', 'open', 'shop', 'last', 'provinc', 'divid', 'citi', 'citi', 'town', 'TOKNUMSEG6', 'popul', 'mani', 'citizen', 'make', 'small', 'shop', 'food', 'various', 'beauti', 'product', 'beauti', 'product', 'increas', 'foreign', 'open', 'electron']

Top 2 loser: http://dbpedia.org/resource/Xuanwu_District,_Nanjing

Change: -2.1171 (base: 19,344, MI: 5,025). Population: 634,000. Text length: 1,869

Xuanwu District, Nanjing

Xuanwu District, Nanjing Xuanwu District () is one of 11 districts of Nanjing, the capital of Jiangsu province, China. Xuanwu District is an urban centre located in the north-eastern part of Nanjing. It is the site of the Nanjing Municipal Government. The main industries in the district are the leisure and tourism, information technology, retail and services. Its economy is primarily based upon the delivery of services. Industry zones include the Changjiang Road Cultural Area, Xinjiekou Central Economic Area, and Xuzhuang Software Industry Base. The district has attracted multi-national corporations, such as 3M, American Express, Siemens, Hyundai, Samsung, NYK Line, and Cathay Life Insurance. There are more than 40 colleges, universities and research institutes in the district, including Southeast University, Nanjing University of Science and Technology, Nanjing Agricultural University, Nanjing Forestry University and Jiangsu Academy of Agricultural Scie

Tokens (length 95)

['one', 'capit', 'provinc', 'china', 'urban', 'centr', 'locat', 'north', 'eastern', 'part', 'site', 'govern', 'main', 'industri', 'tourism', 'inform', 'technolog', 'servic', 'economi', 'primarili', 'base', 'servic', 'industri', 'zone', 'includ', 'road', 'cultur', 'area', 'central', 'econom', 'area', 'industri', 'base', 'attract', 'nation', 'corpor', 'express', 'life', 'colleg', 'univers', 'research', 'institut', 'includ', 'univers', 'univers', 'scienc', 'technolog', 'agricultur', 'univers', 'univers', 'academi', 'agricultur', 'scienc', 'academ', 'chines', 'academi', 'engin', 'chines', 'academi', 'scienc', 'repres', 'academ', 'academi', 'provinc', 'transport', 'within', 'includ', 'metro', 'station', 'within', 'sourc', 'transport', 'railway', 'station', 'railway', 'airport', 'also', 'known', 'tourist', 'attract', 'like', 'emperor', 's', 'one', 'imperi', 'dynasti', 'unesco', 'world', 'heritag', 'site', 'percent', 'cover', 'intern', 'exhibit', 'center']

Top 3 loser: http://dbpedia.org/resource/Dunmore_East

Change: -2.0688 (base: 7,616, MI: 61,324). Population: 1,559. Text length: 5,293

Dunmore East

Dunmore East Dunmore East () is a popular tourist and fishing village in County Waterford, Ireland. Situated on the west side of Waterford Harbour on Ireland's southeastern coast, it lies within the barony of Gaultier ("Gáll Tír" – "foreigners' land" in Irish): a reference to the influx of Viking and Norman settlers in the area. Iron Age people established a promontory fort overlooking the sea at Shanoon (referred to in 1832 as meaning the 'Old Camp' but more likely Canon Power's Sean Uaimh, 'Old Cave') at a point known for centuries as Black Nobb, where the old pilot station now stands, and underneath which a cave runs. Henceforth the place was referred to as Dun Mor, the Great Fort. Fish was an important part of the people's diet, and for hundreds of years a fishing community lived here. In 1640, Lord Power of Curraghmore, who owned a large amount of property in the area, built a castle on the cliff overlooking the strand about two hundred metres from St. Andrew's

Tokens (length 259)

['east', 'east', 'east', 'popular', 'tourist', 'situat', 'west', 's', 'coast', 'lie', 'within', 't', 'foreign', 'refer', 'area', 'establish', 'fort', 'sea', 'refer', 'mean', 'old', 'like', 'power', 's', 'old', 'point', 'known', 'centuri', 'old', 'station', 'now', 'stand', 'run', 'place', 'refer', 'great', 'fort', 'import', 'part', 's', 'hundr', 'TOKNUMSEG4', 'power', 'larg', 'amount', 'area', 'built', 'hundr', 's', 'fall', 'next', 'centuri', 'now', 'one', 'tower', 'remain', 'old', 's', 'built', 'centuri', 'one', 'wall', 'still', 'stand', 'top', 's', 'histori', 'port', 's', 'home', 'situat', 'near', 'launch', 's', 'built', 'work', 'east', 'creat', 'entir', 'new', 'point', 'passeng', 'come', 'southern', 'locat', 'east', 'TOKNUMSEG30', 'set', 'chang', 'took', 'place', 'engin', 's', 'bridg', 'work', 'new', 'station', 'ship', 'carri', 'royal']

Top 4 loser: http://dbpedia.org/resource/Demsa

Change: -2.0304 (base: 14,391, MI: 2,865). Population: 180,251. Text length: 276

Demsa

Demsa Demsa is a Local Government Area of Adamawa State, Nigeria with headquarters located in Demsa. Demsa lies on the Benue River. Population is 180,251. It is inhabited by ethnic groups such as the Bachama, Batta, Yandang, Bille, Mbula, Maya, Bare and fulani.

Tokens (length 12)

['local', 'govern', 'area', 'headquart', 'locat', 'lie', 'river', 'popul', 'TOKNUMSEG30', 'inhabit', 'ethnic', 'group']

Top 5 loser: http://dbpedia.org/resource/Leiyang

Change: -1.7694 (base: 591,709, MI: 55,023). Population: 1,300,000. Text length: 3,958

Leiyang

Leiyang Leiyang () is a county-level city in Hengyang, Hunan province. It is located in the southeast of Hengyang City and borders the prefecture-level city of Chenzhou to the south and east. It has over 1.3 million inhabitants. Cai Lun, the traditionally credited inventor of paper, was born and lived there. A Cai Lun invention square shows respect for Cai Lun. Leiyang history dates to the Qin dynasty (221 to 206 BC). According to the historical novel Romance of the Three Kingdoms, Pang Tong () was chosen as magistrate of Leiyang by Liu Bei (). After three years he had failed to fulfill the duties of his office. Many were upset by his failure and appealed to Liu Bei. Liu Bei sent Zhang Fei (), his sworn brother, to Leiyang to investigate. Before Zhang Fei arrived, Pang Tong, who knew that Zhang Fei loved wine, ordered that all wine must be diluted with water. Once Zhang Fei arrived, true to his reputation, he consumed copious amounts of wine, but wondered why he never be

Tokens (length 181)

['level', 'citi', 'provinc', 'locat', 'citi', 'border', 'prefectur', 'level', 'citi', 'south', 'east', 'million', 'inhabit', 'tradit', 'show', 'respect', 'histori', 'date', 'dynasti', 'TOKNUMSMALL', 'TOKNUMSMALL', 'bc', 'histor', 'three', 'kingdom', 'three', 'mani', ',', 'arriv', 'order', 'arriv', 'reput', 'amount', 'becam', 'go', 'becam', 'order', 'three', 'within', 'three', 'day', 'promot', 'special', 'creat', 'now', 'known', 'TOKNUMSEG6', 'chang', 'along', 'troop', 'command', 'de', 'follow', 'direct', 'origin', 'pass', 'higher', 'offici', 'chines', 'parti', 'leav', 'larg', 'number', 'chines', 'migrat', 'speak', 'dialect', 'dialect', 'chines', 'locat', 'chines', 'chines', 'chines', 'later', 'brought', 'center', 'product', 'materi', 'produc', 'area', 'today', 'includ', 'heavi', 'railway', 'station', 'locat', 'street', 'neighborhood', 'expressway', 'territori', 'new', 'town', 'three', 'high', 'speed', 'fair', 'TOKNUMSMALL', 'road', 'north', 'south']

Top 6 loser: http://dbpedia.org/resource/Bailadores

Change: -1.6468 (base: 160,900, MI: 16,328). Population: 345,489. Text length: 316

Bailadores

Bailadores Bailadores is a town in the western part of the Mérida State of Venezuela and is the capital of the Rivas Dávila Municipality. It was founded by Captain Luis Martín Martín, September 14, 1601, by appointment of founder Peter Sandes Court, from the Real Audiencia de Santa Fe de Bogotá.

Tokens (length 14)

['town', 'western', 'part', 'm', 'capit', 'found', 'n', 'n', 'septemb', 'TOKNUMSEG4', 'appoint', 'real', 'de', 'de']

Top 7 loser: http://dbpedia.org/resource/Madina,_Ghana

Change: -1.4931 (base: 8,971, MI: 2,726). Population: 137,162. Text length: 476

Madina, Ghana

Madina, Ghana Madina is a suburb of Accra and in the La-Nkwantanang-Madina Municipal Assembly, a district in the Greater Accra Region of southeastern Ghana. Madina is next to the University of Ghana and houses the Institute of Local Government. Madina is the twelfth most populous settlement in Ghana, in terms of population, with a population of 137,162 people. Madina is contained in the Abokobi-Madina electoral constituency of the republic of Ghana.

Tokens (length 15)

['assembl', 'greater', 'region', 'next', 'univers', 'institut', 'local', 'govern', 'popul', 'term', 'popul', 'popul', 'TOKNUMSEG30', 'contain', 'republ']

Top 8 loser: http://dbpedia.org/resource/Delgado,_San_Salvador

Change: -1.4645 (base: 7,772, MI: 2,680). Population: 174,825. Text length: 145

Delgado, San Salvador

Delgado, San Salvador Delgado, or Ciudad Delgado, is a municipality in the San Salvador department of El Salvador.

Tokens (length 1)

['el']

Top 9 loser: http://dbpedia.org/resource/Koro,_Mali

Change: -1.4261 (base: 12,727, MI: 2,611). Population: 62,681. Text length: 79

Koro, Mali

Koro, Mali Agriculture is the main income activity in Koro.

Tokens (length 3)

['agricultur', 'main', 'activ']

Top 10 loser: http://dbpedia.org/resource/Cixi_City

Change: -1.4 (base: 324,918, MI: 65,085). Population: 1,462,383. Text length: 2,293

Cixi City

Cixi City Cixi is a city with a rich culture and a long history. It was part of the state of Yue in the Spring and Autumn period (770-476 B.C.). The county was set up in the Qin Dynasty. At first it was called “Juzhang” and has been using the name of “Cixi” since the Kaiyuan reign of the Tang Dynasty (738 A.D.). Cixi City is located on the south of the economic circle of Yangtze River Delta, and is from Ningbo in the east, from Shanghai in the north and from Hangzhou in the west. Cixi has a subtropical monsoon climate, with an average annual temperature of 16℃. Cixi has an effective public transportation system that provides convenience. Highway connections are provided to all major cities, typical travel times are 1.5 hours or less by car, including access to the four major airports, Ningbo Lishe International Airport, Hangzhou Xiaoshan International Airport, Shanghai Hongqiao Airport, and Shanghai Pudong International Airport. Shanghai and Ningbo are also the closes

Tokens (length 138)

['citi', 'citi', 'citi', 'rich', 'cultur', 'long', 'histori', 'part', 'spring', 'autumn', 'period', 'TOKNUMSMALL', 'TOKNUMSMALL', 'c', 'set', 'dynasti', 'first', 'call', 'use', 'name', 'sinc', 'reign', 'dynasti', 'TOKNUMSMALL', 'citi', 'locat', 'south', 'econom', 'river', 'delta', 'east', 'north', 'west', 'subtrop', 'monsoon', 'climat', 'annual', 'temperatur', 'effect', 'public', 'transport', 'system', 'provid', 'highway', 'connect', 'provid', 'major', 'citi', 'typic', 'travel', 'time', 'hour', 'less', 'car', 'includ', 'access', 'four', 'major', 'airport', 'intern', 'airport', 'intern', 'airport', 'airport', 'intern', 'airport', 'also', 'sea', 'port', 'import', 'manufactur', 'citi', 'northern', 'provinc', 'new', 'locat', 'citi', 'construct', 'mid', 'scale', 'modern', 'citi', 'cover', 'area', 'popul', 'million', 'includ', 'million', 'million', 'town', 'five', 'jurisdict', 'citi', 'TOKNUMSMALL', 'administr', 'includ', 'citi', 'mani', 'site', 'wide']

We can see some new winners and losers but many repeats. Now, on the sequence of tokens we see thinks like 'popul' 'toknumseg30' that ought to inform about the overall size, if the ML were made aware this words are contiguous, which bring us to the concept of bigrams and the fourth featurization.

Fourth Featurization: Words in context

To incorporate some ordering among the words, a common technique is to use bigrams, pairs of words in order. If we were ot use bigrams directly, this will increase the vocabulary size quite a bit, so I'll threshold their minimum occurrence (Cell #13).

In [13]:
# CELL 13
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# read text features
rand = random.Random(42)
all_data = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

remaining = set(cities)
all_bigrams = list()
bigram_to_idx    = dict()
city_bigram_idxs = dict()
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing+bigrams {:>5} out of {:>5} cities, bigrams {:,} city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), len(all_bigrams), name))
            remaining.remove(name)
            text = cityline[tab:]
            bigrams = set()
            prev = '[PAD]'
            for token in list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text))):
                bigram = prev + '-' + token
                prev = token
                idx = bigram_to_idx.get(bigram, None)
                if idx is None:
                    idx = len(all_bigrams)
                    all_bigrams.append(bigram)
                    bigram_to_idx[bigram] = idx
                bigrams.add(idx)
            city_bigram_idxs[name] = sorted(list(toks))
bigram_to_idx = None

for name in remaining:
    city_bigram_idxs[name] = list()

print("Total bigrams: {:,}".format(len(all_bigrams)))

# drop bigrams that appear in less that 50 documents
bigram_docs = list()
for _ in range(len(all_bigrams)):
    bigram_docs.append([])
for doc_idx, name in enumerate(cities):
    bigram_idxs = city_bigram_idxs[name]
    for bigram_idx in bigram_idxs:
        bigram_docs[bigram_idx].append(doc_idx)
city_bigram_idxs = None

threshold = 50
reduced_bigrams = list()
for bigram_idx in range(len(all_bigrams)):
    if len(bigram_docs[bigram_idx]) >= threshold:
        reduced_bigrams.append(bigram_idx)
        
print("Reduced bigrams: {:,} (reduction {:%})"
      .format(len(reduced_bigrams), (len(all_bigrams) - len(reduced_bigrams)) / len(all_bigrams)))    

matrix = np.zeros( (len(cities), len(reduced_bigrams)) )
for idx, bigram_idx in enumerate(reduced_bigrams):
    header.append("bigram=" + all_bigrams[bigram_idx])

    for idx_doc in bigram_docs[bigram_idx]:
        matrix[idx_doc, idx] = 1.0

for idx_doc in range(len(cities)):
    all_data[city_to_all_data[cities[idx_doc]]][0].extend(matrix[idx_doc,:])
bigram_docs = None
matrix      = None 

with open("ch8_cell13_dev_feat4.tsv", "w") as f:
    f.write("name\t" + "\t".join(header)+"\tlogpop\n")
    for idx_doc in range(len(cities)):
        name = cities[idx_doc]
        entry = all_data[city_to_all_data[name]]
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,entry[0])), entry[1]))

# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

# free memory
all_bigrams = None
xtrain      = None
xtest       = None


import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual",    color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell13_rf_feat4.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Tokenizing+bigrams     0 out of 44959 cities, bigrams 0 city "<http://dbpedia.org/resource/Ankara>"
Tokenizing+bigrams  4495 out of 44959 cities, bigrams 335,865 city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing+bigrams  8990 out of 44959 cities, bigrams 382,256 city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing+bigrams 13485 out of 44959 cities, bigrams 458,421 city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing+bigrams 17980 out of 44959 cities, bigrams 512,669 city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing+bigrams 22475 out of 44959 cities, bigrams 540,316 city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing+bigrams 26970 out of 44959 cities, bigrams 557,120 city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing+bigrams 31465 out of 44959 cities, bigrams 566,614 city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing+bigrams 35960 out of 44959 cities, bigrams 571,918 city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing+bigrams 40455 out of 44959 cities, bigrams 575,277 city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing+bigrams 44950 out of 44959 cities, bigrams 581,778 city "<http://dbpedia.org/resource/Babatorun>"
Total bigrams: 581,797
Reduced bigrams: 113 (reduction 99.980577%)
Training on 35,971 cities
RMSE 0.32621332109795326
Out[13]:
<matplotlib.legend.Legend at 0x7fcfd42ffcd0>

That did worse. But what we were trying to accomplish (populatio-number) is not among the picked bigrams for any of the population numbers. Let's try skip-bigrams with hash encoding instead.

Fifth Featurization: Skip-bigrams

I will combine skip-bigrams with feature hashing to reduce the number of bigrams to a manageable size (Cell #14).

For hashing function, we will use Python's buit-in hashing function

In [14]:
# CELL 14
import bz2
import math
from sklearn.ensemble import RandomForestRegressor
import numpy as np

PARAM_HASH_SIZE=3000
PARAM_SKIP_SIZE=6
PARAM_STABLE_HASHES=True

def cell14_hash(x):
    hashed = 0
    if PARAM_STABLE_HASHES:
        # pure python FNV1a
        if x:
            hashed = 14695981039346656037
            for c in x:
                hashed = hashed ^ ord(c)
                hashed = (hashed * 1099511628211)
    else:
        # python after 3.3 use siphash24, which is better and the C implementation 
        # is faster but it is salted so unless you have to set PYTHONSEED environment 
        # variable to 0 (a bad idea) otherwise every run will produce different hashes
        hashed = hash(x)
    return abs(hashed) % PARAM_HASH_SIZE 
    
# read text features
rand   = random.Random(42)
all_data         = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

kept_terms = set(map(lambda l:l.split('\t')[0], open("ch8_cell10_vocab.tsv").readlines()))

remaining = set(cities)
with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
    for byteline in wikitext:
        cityline = byteline.decode("utf-8")
        tab = cityline.index('\t')
        name = cityline[:tab]
        if name in remaining:
            if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                print("Tokenizing+skip-bigrams {:>5} out of {:>5} cities, city \"{}\""
                      .format((len(cities) - len(remaining)), len(cities), name))
            remaining.remove(name)
            text    = cityline[tab:]
            bigrams = set()
            prev = [ '[PAD]' ] * PARAM_SKIP_SIZE
            feats = [ 0.0 ] * PARAM_HASH_SIZE
            for token in list(filter(lambda tok: tok in kept_terms, cell10_tokenize(text))):
                for skip in prev:
                    bigram = skip + '-' + token
                    feats[cell14_hash(bigram)] = 1.0
                prev.pop(0)
                prev.append(token)
            all_data[city_to_all_data[name]][0].extend(feats)

for name in remaining:
    all_data[city_to_all_data[name]][0].extend([ 0.0 ] * PARAM_HASH_SIZE)

for idx in range(PARAM_HASH_SIZE):
    header.append("hashed_skip_bigram#" + str(idx))

with open("ch8_cell14_dev_feat5.tsv", "w") as f:
    f.write("name\t" + "\t".join(header)+"\tlogpop\n")
    for idx_doc in range(len(cities)):
        name = cities[idx_doc]
        entry = all_data[city_to_all_data[name]]
        f.write("{}\t{}\t{}\n".format(name, "\t".join(map(str,entry[0])), entry[1]))

# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data = None
test_data  = None

# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest, label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell14_rf_feat5.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Tokenizing+skip-bigrams     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing+skip-bigrams  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing+skip-bigrams  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing+skip-bigrams 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing+skip-bigrams 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing+skip-bigrams 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing+skip-bigrams 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing+skip-bigrams 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing+skip-bigrams 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing+skip-bigrams 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing+skip-bigrams 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Training on 35,971 cities
RMSE 0.3261679716302875
Out[14]:
<matplotlib.legend.Legend at 0x7fcfd4641350>

Sixth Featurization: Embeddings

Finally, we explore using word embeddings (Cell #15). Because the embeddings and TF*IDF scores might overfit, I'll compute them on the trainset only.

To use these embeddings, we take the weighted average embedding for the whole document. We can also take the maximum and minimum for each coordinate, over all entries in the document.

TF*IDF

Instead of using raw counts, we can perform a traditional feature weighting employed in NLP/IR: adjust the counts by the inverse of the frequency of the word type over the corpus.

Therefore, we replace the Term Frequency (term is synonym with word type in IR) in the document with the TF times the Inverse Document Frequency (IDF). To have more informed statistics, we can compute the IDF counts on a larger set (e.g., Wikipedia or in our case, the whole StackOverflow dump). In this example we will use the train set.

In [15]:
# CELL 15
import re
import pickle
import random
import bz2
import math
from collections import OrderedDict

import numpy as np
import gensim
from sklearn.ensemble import RandomForestRegressor
from stemming.porter2 import stem as porter_stem

with open("ch6_cell27_splits.pk", "rb") as pkl:
    segments_at = pickle.load(pkl)

boundaries = list(map(lambda x:( int(round(10**x['min'])), 
                            int(round(10**x['val'])), 
                            int(round(10**x['max'])) ), segments_at[5]))
NUM_RE = re.compile('\d?\d?\d?(,?\d{3})+') # at least 3 digits
def cell15_tokenize(text):
    tokens = list(filter(lambda x: len(x)>0,
                         map(lambda x: x.lower(),
                         re.sub('\s+',' ', re.sub('[^A-z,0-9]', ' ', text)).split(' '))))
    result = list()
    for tok in tokens:
        if len(tok) > 1 and tok[-1] == ',':
            tok = tok[:-1]
        if NUM_RE.fullmatch(tok):
            num = int(tok.replace(",",""))
            if num < boundaries[0][0]:
                result.append("TOKNUMSMALL")
            elif num > boundaries[-1][2]:
                result.append("TOKNUMBIG")
            else:
                found = False
                for idx, seg in enumerate(boundaries[1:]):
                    if num < seg[0]:
                        result.append("TOKNUMSEG" + str(idx))
                        found = True
                        break
                if not found:
                    result.append("TOKNUMSEG" + str(len(boundaries) - 1))
        else:
            result.append(porter_stem(tok))
    return result

# read text features
rand = random.Random(42)
all_data         = list()
city_to_all_data = dict()
header = None
with open("ch8_cell11_dev_feat3.tsv") as mi:
    header = next(mi)
    header = header.strip().split("\t")
    header.pop(0) # name
    header.pop() # population
    for line in mi:
        fields = line.strip().split("\t")
        logpop = float(fields[-1])
        name  = fields[0]
        feats = list(map(float,fields[1:-1]))
        city_to_all_data[name] = len(all_data)
        all_data.append( (feats, logpop, name) )
cities = sorted(list(city_to_all_data.keys()))

# tokenize documents
city_tokenized = OrderedDict()
PARAM_COMPUTE_NEW_TOKENS = True
if PARAM_COMPUTE_NEW_TOKENS:
    remaining = set(city_to_all_data.keys())
    with bz2.BZ2File("cities1000_wikitext.tsv.bz2","r") as wikitext:
        for byteline in wikitext:
            cityline = byteline.decode("utf-8")
            tab      = cityline.index('\t')
            name     = cityline[:tab]
            if name in remaining:
                if (len(cities) - len(remaining)) % int(len(cities) / 10) == 0:
                    print("Tokenizing {:>5} out of {:>5} cities, city \"{}\""
                          .format((len(cities) - len(remaining)), len(cities), name))
                remaining.remove(name)
                text = cityline[tab:]
                city_tokenized[name] = cell15_tokenize(text)

    for name in remaining:
        city_tokenized[name] = list()

    print("Saving tokens...")
    with open("ch8_cell15_tokens.txt", "w") as f:
        for city, tokens in city_tokenized.items():
            f.write("{}\t{}\n".format(city, " ".join(tokens)))
else:
    print("Reading tokens...")
    with open("ch8_cell15_tokens.txt", "r") as f:
        for line in f:
            (city, toks) = line.split("\t")
            city_tokenized[city] = toks.split(" ")    
    
# split
train_data = list()
test_data  = list()
for row in all_data:
    if rand.random() < 0.2:
        test_data.append(row)
    else:
        train_data.append(row)
all_data = None # free memory

train_cities         = list(map(lambda x:x[-1], train_data))
tokenized_train_docs = list(map(lambda city:city_tokenized[city], train_cities))
print("Saving train split...")
with open("ch8_cell15_train_cities.txt", "w") as f:
    for city in city_tokenized.keys():
        f.write("{}\n".format(city))

PARAM_EMBEDDING_SIZE = 50
print("Training embeddings of size {}".format(PARAM_EMBEDDING_SIZE))
model   = gensim.models.Word2Vec(tokenized_train_docs, size=PARAM_EMBEDDING_SIZE)
tok2vec = dict(zip(model.wv.index2word, model.wv.vectors))

print("Trained ", len(tok2vec), " embeddings")
print("Saving embeddings...")
with open("ch8_cell15_embeddings.tsv", "w") as f:
    for token in model.wv.vocab.keys():
        f.write("{}\t{}\n".format(token, "\t".join(map(str,model.wv[token]))))

# compute idfs
df_tok = OrderedDict()
for tok_doc in tokenized_train_docs:
    seen = set()
    for tok in tok_doc:
        if tok not in seen:
            df_tok[tok] = df_tok.get(tok, 0) + 1
            seen.add(tok)
idf_tok = OrderedDict()    
for tok in df_tok:
    idf_tok[tok] = math.log(1 + len(tokenized_train_docs) * 1.0 / df_tok[tok])
print("Computed {:,} IDFs".format(len(idf_tok)))
print("Saving idfs...")
with open("ch8_cell15_idfs.tsv", "w") as f:
    for tok, idf in idf_tok.items():
        f.write("{}\t{}\n".format(tok, idf))

# plot
PARAM_PLOT_TSNE = True
if PARAM_PLOT_TSNE:
    print("Computing t-SNE...")
    from sklearn.manifold import TSNE
    vectors = []
    words = list(model.wv.vocab.keys())
    rand.shuffle(words)
    for word in words:
        vectors.append(model.wv[word])
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500, random_state=23)
    projected = tsne_model.fit_transform(vectors)

    x = []
    y = []
    for t in projected:
        x.append(t[0])
        y.append(t[1])

    print("Saving t-SNE points...")
    with open("ch8_cell15_tsne.tsv", "w") as embed:
        for idx in range(len(words)):
            embed.write("{}\t{}\t{}\n".format(words[idx], x[idx], y[idx]))
            
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.figure()
    plt.rcParams['figure.figsize'] = [20, 20]
    plotted_count   = 0
    plotted_section = set()
    # plot a meaningful, visible sample
    for idx in range(len(x)):
        if df_tok[words[idx]] < 200:
            continue # ensure meaningful
        section = str(int(x[idx] * 10 * 4)) + "-" + str(int(y[idx] * 10 * 4))
        if section in plotted_section:
            continue # ensure visible
        plotted_section.add(section)
        plotted_count += 1
        plt.scatter(x[idx] ,y[idx])
        plt.annotate(words[idx], xy=(x[idx], y[idx]), xytext=(5, 2), 
                     textcoords='offset points', ha='right', va='bottom')
        if plotted_count > 150:
            break # ensure visible

    plt.savefig("ch8_cell15_tsne.pdf", bbox_inches='tight', dpi=300)

# encode train_data and test_data
def encode(toks):
    result = np.zeros( (PARAM_EMBEDDING_SIZE,) )
    good_toks = list(filter(lambda t:t in tok2vec, toks))
    for tok in good_toks:
        tok_vect_scaled = np.copy(tok2vec[tok])
        tok_vect_scaled *= idf_tok[tok] / len(good_toks)
        result += tok_vect_scaled
    return result

for data in (train_data, test_data):
    for row in data:
        name = row[-1]
        row[0].extend(encode(city_tokenized[name]))

test_data  = sorted(test_data, key=lambda t:t[1])
test_names = list(map(lambda t:t[2], test_data))

xtrain = np.array(list(map(lambda t:t[0], train_data)))
ytrain = np.array(list(map(lambda t:t[1], train_data)))
xtest  = np.array(list(map(lambda t:t[0], test_data)))
ytest  = np.array(list(map(lambda t:t[1], test_data)))
train_data     = None
test_data      = None
idf_tok        = None
df_tok         = None
tok2vec        = None
city_tokenized = None
# train
print("Training on {:,} cities".format(len(xtrain)))

rf = RandomForestRegressor(max_features=0.75, random_state=42, max_depth=10, n_estimators=100, n_jobs=-1)
rf.fit(xtrain, ytrain)
ytest_pred = rf.predict(xtest)
RMSE = math.sqrt(sum((ytest - ytest_pred)**2) / len(ytest))
print("RMSE", RMSE)

xtrain = None
xtest  = None

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(ytest_pred, label="predicted", color='gray')
plt.plot(ytest,      label="actual", color='black')
plt.ylabel('scaled log population')
plt.savefig("ch8_cell15_rf_feat6.pdf", bbox_inches='tight', dpi=300)
plt.legend()
Tokenizing     0 out of 44959 cities, city "<http://dbpedia.org/resource/Ankara>"
Tokenizing  4495 out of 44959 cities, city "<http://dbpedia.org/resource/Gonzales,_Louisiana>"
Tokenizing  8990 out of 44959 cities, city "<http://dbpedia.org/resource/Laurel_Bay,_South_Carolina>"
Tokenizing 13485 out of 44959 cities, city "<http://dbpedia.org/resource/Nysa,_Poland>"
Tokenizing 17980 out of 44959 cities, city "<http://dbpedia.org/resource/Vilathikulam>"
Tokenizing 22475 out of 44959 cities, city "<http://dbpedia.org/resource/Arroyo_Seco,_Santa_Fe>"
Tokenizing 26970 out of 44959 cities, city "<http://dbpedia.org/resource/Fatehpur,_Barabanki>"
Tokenizing 31465 out of 44959 cities, city "<http://dbpedia.org/resource/Kirchheim_am_Neckar>"
Tokenizing 35960 out of 44959 cities, city "<http://dbpedia.org/resource/Pirching_am_Traubenberg>"
Tokenizing 40455 out of 44959 cities, city "<http://dbpedia.org/resource/Scone,_Perth_and_Kinross>"
Tokenizing 44950 out of 44959 cities, city "<http://dbpedia.org/resource/Babatorun>"
Saving tokens...
Saving train split...
Training embeddings of size 50
Trained  72361  embeddings
Saving embeddings...
Computed 309,291 IDFs
Saving idfs...
Computing t-SNE...
Saving t-SNE points...
Training on 35,971 cities
RMSE 0.32706006847260105
Out[15]:
<matplotlib.legend.Legend at 0x7fcea35cab10>
In [16]:
# CELL 16

# running example
s = """Its population was 8,361,447 at the 2010 census whom 1,977,253 in the built-up 
(or "metro") area made of Zhanggong and Nankang, and Ganxian largely being urbanized.
"""
cell7_tokens = set(map(lambda x: x[len("token="):], next(open("ch8_cell7_dev_tokens.tsv")).split("\t")))
print("cell6",          cell6_tokenize(s))
print("cell7",          cell7_tokenize(s))
print("cell7-filtered", list(filter(lambda tok: tok in cell7_tokens, cell7_tokenize(s))))
print("cell10",         cell10_tokenize(s))
print("cell15",         cell15_tokenize(s))
cell6 ['TOKNUMSEG31', 'TOKNUMSEG6', 'TOKNUMSEG31']
cell7 ['its', 'population', 'was', 'TOKNUMSEG31', 'at', 'the', 'TOKNUMSEG6', 'census', 'whom', 'TOKNUMSEG31', 'in', 'the', 'built', 'up', 'or', 'metro', 'area', 'made', 'of', 'zhanggong', 'and', 'nankang', 'and', 'ganxian', 'largely', 'being', 'urbanized']
cell7-filtered ['its', 'population', 'was', 'at', 'the', 'TOKNUMSEG6', 'the', 'built', 'metro', 'area', 'of', 'and', 'and', 'largely', 'being']
cell10 ['popul', 'TOKNUMSEG31', 'TOKNUMSEG6', 'census', 'TOKNUMSEG31', 'built', 'metro', 'area', 'made', 'zhanggong', 'nankang', 'ganxian', 'larg', 'urban']
cell15 ['it', 'popul', 'was', 'TOKNUMSEG31', 'at', 'the', 'TOKNUMSEG6', 'census', 'whom', 'TOKNUMSEG31', 'in', 'the', 'built', 'up', 'or', 'metro', 'area', 'made', 'of', 'zhanggong', 'and', 'nankang', 'and', 'ganxian', 'larg', 'be', 'urban']
In [17]:
# memory check

import sys
l = list()
for v in dir():
    l.append( (int(eval("sys.getsizeof({})".format(v))), v) )
for c, v in sorted(l, reverse=True)[:20]:
    print("\t{:,}\t{}".format(c,v))
	2,621,552	text_lengths
	2,621,552	city_to_mi_data
	2,621,552	city_to_base_data
	2,621,552	city_to_all_data
	2,621,552	city_pop
	2,621,552	cities_and_pop
	2,097,384	remaining
	651,368	words
	651,360	y
	651,360	x
	651,360	vectors
	404,744	cities
	359,768	ydata
	359,768	xdata
	304,584	train_cities
	304,584	tokenized_train_docs
	287,864	ytrain
	131,304	seen
	81,008	named_se
	77,856	data