CS301_Proj10

CS301_Proj10

Introduction

For your final CS 301 project, you’re going to analyze the whole world!

Specifically, you’re going to study various statistics for 174 countries, answering questions such as: what is the correlation between a country’s literacy rate and GDP?

To start, download test.py and expected.html. You’ll also need to download lint.py (see linter documentation under “Testing” below). Do not download any data files manually (you must write Python code to download these automatically). You’ll do all your work in a main.ipynb.

Data

For this project, you’ll be using one large JSON file with statistics about 174 countries adapted from here. and you will also extract data from a snapshot of this page.

First check these resources:

Some of the columns require a little extra explanation:

  • area: measured in square miles
  • coastline: ratio of coast to area
  • birth-rate: births per 1000 people per year
  • death-rate: deaths per 1000 people per year
  • infant-mortality: per 1000 births
  • literacy: (out of 100%)
  • phones: number of phone per 1000 people
#project: p10
#submitter: naixinzhang
#partner: none
import pandas as pd
import requests,json
from bs4 import BeautifulSoup
import os
from pandas import DataFrame, Series
from math import radians, cos, sin, asin, sqrt
import numpy as np
import matplotlib.pyplot as plt
def download(filename, url):
    # We do not download again if the file already exists
    if os.path.exists(filename):
        return (str(filename) + " already exists!")
    else:
        r = requests.get(url)
        r.raise_for_status()
        data = r.text
        f = open(filename, "w", encoding="utf-8")
        f.write(data)
        f.close()
    return (str(filename) + " created!")
download("countries.json","https://raw.githubusercontent.com/tylerharter/caraza-harter-com/master/tyler/cs301/fall19/data/countries.json")
'countries.json already exists!'
countries = pd.read_json('countries.json')
countries

area birth-rate coastline continent country death-rate gdp-per-capita infant-mortality literacy net-migration phones population region
0 756950 15.23 0.85 South America Chile 5.81 9900 8.80 96,2 0.00 213,0 16134219 LATIN AMER. & CARIB
1 665 17.80 24.21 Asia Bahrain 4.14 16900 17.27 89,1 1.05 281,3 698585 NEAR EAST
2 22966 28.84 1.68 North America Belize 5.72 4900 25.69 94,1 0.00 115,7 287730 LATIN AMER. & CARIB
3 582650 39.72 0.09 Africa Kenya 14.02 1000 61.47 85,1 -0.10 8,1 34707817 SUB-SAHARAN AFRICA
4 11854 33.05 3.12 Australia Marshall Islands 4.78 1600 29.45 93,7 -6.04 91,2 60422 OCEANIA
5 111370 44.77 0.52 Africa Liberia 23.10 1000 128.87 57,5 0.00 2,3 3042004 SUB-SAHARAN AFRICA
6 504782 10.06 0.98 Europe Spain 9.72 22000 4.42 97,9 0.99 453,5 40397842 WESTERN EUROPE
7 65610 15.51 2.04 Asia Sri Lanka 6.52 3700 14.35 92,3 -1.31 61,5 20222240 ASIA (EX. NEAR EAST)
8 212460 36.24 0.98 Asia Oman 3.81 13100 19.51 75,8 0.28 85,5 3102229 NEAR EAST
9 316 10.22 62.28 Europe Malta 8.10 17700 3.89 92,8 2.07 505,0 400214 WESTERN EUROPE
10 185180 27.76 0.10 Asia Syria 4.81 3300 29.53 76,9 0.00 153,8 18881361 NEAR EAST
11 1759540 26.49 0.10 Africa Libya 3.48 6400 24.60 82,6 0.00 127,1 5900754 NORTHERN AFRICA
12 752614 41.00 0.00 Africa Zambia 19.93 800 88.29 80,6 0.00 8,2 11502010 SUB-SAHARAN AFRICA
13 10991 20.82 9.30 North America Jamaica 6.52 3900 12.36 87,9 -4.92 124,0 2758124 LATIN AMER. & CARIB
14 447400 26.36 0.00 Asia Uzbekistan 7.84 1700 71.10 99,3 -1.72 62,9 27307134 C.W. OF IND. STATES
15 61 10.02 0.00 Europe San Marino 8.17 34600 5.73 96,0 10.98 704,3 29251 WESTERN EUROPE
16 488100 27.61 0.00 Asia Turkmenistan 8.60 5800 73.08 98,0 -0.86 74,6 5042920 C.W. OF IND. STATES
17 33843 15.70 0.00 Europe Moldova 12.64 1800 40.42 99,1 -0.26 208,1 4466706 C.W. OF IND. STATES
18 28748 15.11 1.26 Europe Albania 5.22 4500 21.52 86,5 -4.93 71,2 3581655 EASTERN EUROPE
19 53 11.40 194.34 North America Bermuda 7.74 36000 8.53 98,0 2.49 851,4 65773 NORTHERN AMERICA
20 803940 29.74 0.13 Asia Pakistan 8.23 2100 72.44 45,7 -2.77 31,8 165803560 ASIA (EX. NEAR EAST)
21 30528 10.38 0.22 Europe Belgium 10.27 29100 4.68 98,0 1.23 462,6 10379067 WESTERN EUROPE
22 47000 33.65 0.00 Asia Bhutan 12.70 1300 100.44 42,2 0.00 14,3 2279723 ASIA (EX. NEAR EAST)
23 587040 41.41 0.82 Africa Madagascar 11.11 800 76.83 68,9 0.00 3,6 18595469 SUB-SAHARAN AFRICA
24 17363 27.41 0.00 Africa Swaziland 29.74 4900 69.27 81,6 0.00 30,8 1136334 SUB-SAHARAN AFRICA
25 4167 16.68 60.60 Australia French Polynesia 4.69 17500 8.44 98,0 2.94 194,5 274578 OCEANIA
26 64589 9.24 0.82 Europe Latvia 13.66 10200 9.55 99,8 -2.23 321,4 2274735 BALTICS
27 93030 9.72 0.00 Europe Hungary 13.11 13900 8.57 99,4 0.86 336,2 9981334 EASTERN EUROPE
28 693 9.34 27.85 Asia Singapore 4.28 23700 2.29 92,5 11.53 411,4 4492150 ASIA (EX. NEAR EAST)
29 160 10.21 0.00 Europe Liechtenstein 7.18 25000 4.70 100,0 4.85 585,5 33987 WESTERN EUROPE
... ... ... ... ... ... ... ... ... ... ... ... ... ...
144 267667 36.16 0.33 Africa Gabon 12.25 5500 53.64 63,2 0.00 27,4 1424906 SUB-SAHARAN AFRICA
145 13790 12.77 3.63 North America Puerto Rico 7.65 16800 8.24 94,1 -1.46 283,1 3927188 LATIN AMER. & CARIB
146 338145 10.45 0.37 Europe Finland 9.86 27400 3.57 100,0 0.95 405,3 5231372 WESTERN EUROPE
147 1240000 49.82 0.00 Africa Mali 16.89 900 116.79 46,4 -0.33 6,4 11716829 SUB-SAHARAN AFRICA
148 2170 36.93 15.67 Africa Comoros 8.20 700 74.93 56,5 0.00 24,5 690948 SUB-SAHARAN AFRICA
149 7686850 12.14 0.34 Australia Australia 7.51 29000 4.69 100,0 3.98 565,5 20264082 OCEANIA
150 825418 24.32 0.19 Africa Namibia 18.86 7200 48.98 84,0 0.00 62,6 2044147 SUB-SAHARAN AFRICA
151 181040 26.90 0.24 Asia Cambodia 9.06 1900 71.48 69,4 0.00 2,6 13881427 ASIA (EX. NEAR EAST)
152 236040 47.35 0.00 Africa Uganda 12.24 1400 67.83 69,9 0.00 3,6 28195754 SUB-SAHARAN AFRICA
153 945087 37.71 0.15 Africa Tanzania 16.39 600 98.54 78,2 -2.06 4,0 37445392 SUB-SAHARAN AFRICA
154 27830 42.22 0.00 Africa Burundi 13.46 600 69.29 51,6 -0.06 3,4 8090068 SUB-SAHARAN AFRICA
155 82880 18.96 1.59 Asia United Arab Emirates 4.40 23200 14.51 77,9 1.03 475,3 2602713 NEAR EAST
156 329750 22.86 1.42 Asia Malaysia 5.05 9000 17.70 88,7 0.00 179,0 24385858 ASIA (EX. NEAR EAST)
157 78866 9.02 0.00 Europe Czech Republic 10.59 15700 3.93 99,9 0.97 314,3 10235455 EASTERN EUROPE
158 514000 13.87 0.63 Asia Thailand 7.04 7400 20.48 92,6 0.00 108,9 64631595 ASIA (EX. NEAR EAST)
159 70280 14.45 2.06 Europe Ireland 7.82 29600 5.39 98,0 4.99 500,5 4062235 WESTERN EUROPE
160 18270 22.55 6.18 Australia Fiji 5.65 5800 12.62 93,7 -3.14 112,6 905949 OCEANIA
161 2717300 16.00 0.00 Asia Kazakhstan 9.42 6300 29.21 98,4 -3.35 164,1 15233244 C.W. OF IND. STATES
162 1564116 21.59 0.00 Asia Mongolia 6.95 1800 53.79 97,8 0.00 55,1 2832224 ASIA (EX. NEAR EAST)
163 198500 22.80 0.00 Asia Kyrgyzstan 7.08 1600 35.64 97,0 -2.45 84,0 5213898 C.W. OF IND. STATES
164 527970 42.89 0.36 Asia Yemen 8.30 800 61.50 50,2 0.00 37,2 21456188 NEAR EAST
165 9984670 10.78 2.02 North America Canada 7.80 29800 4.75 97,0 5.96 552,2 33098932 NORTHERN AMERICA
166 262 12.74 61.07 North America Cayman Islands 4.89 35000 8.19 98,0 18.75 836,3 45436 LATIN AMER. & CARIB
167 118480 43.13 0.00 Africa Malawi 19.33 600 103.32 62,7 0.00 7,9 13013926 SUB-SAHARAN AFRICA
168 86600 20.74 0.00 Europe Azerbaijan 9.75 3400 81.74 97,0 -4.90 137,1 7961619 C.W. OF IND. STATES
169 2381740 17.14 0.04 Africa Algeria 4.61 6000 31.00 70,0 -0.39 78,1 32930091 NORTHERN AFRICA
170 103000 13.64 4.83 Europe Iceland 6.72 30900 3.31 99,9 2.38 647,7 299388 WESTERN EUROPE
171 748 25.37 56.02 Australia Tonga 5.28 2200 12.62 98,5 0.00 97,7 114689 OCEANIA
172 27750 36.44 6.38 North America Haiti 12.17 1600 73.45 52,9 -3.40 16,9 8308504 LATIN AMER. & CARIB
173 1284000 45.73 0.00 Africa Chad 16.38 1200 93.82 47,5 -0.11 1,3 9944201 SUB-SAHARAN AFRICA

174 rows × 13 columns

#q1 How many countries do we have in our dataset?
def get_num_countries():
    return countries['country'].nunique()
get_num_countries()
174
#q2 what is the total area across all the countries in our dataset?
def get_total_area():
    return countries['area'].sum()
get_total_area()
127170843

download('capitals.html', 'https://raw.githubusercontent.com/tylerharter/cs301-projects/master/fall19/p10/techslides-snapshot.html')
'capitals.html already exists!'
def state_stats(path):
    stats = {}
    f = open(path)
    html_string = f.read()
    f.close()
    doc = BeautifulSoup(html_string, "html.parser")
    trs = doc.find_all("tr")
    for tr in trs[1:]:
        cells = tr.find_all("td")[:4]
        j = cells[0].get_text()
        stats.setdefault("country",[]).append(j)
        j = cells[1].get_text()
        stats.setdefault("capital",[]).append(j) 
        j = cells[2].get_text()
        stats.setdefault("latitude",[]).append(float(j)) 
        j = cells[3].get_text()
        stats.setdefault("longitude",[]).append(float(j)) 
    return stats

wi_stats = state_stats("capitals.html")
data = DataFrame({"country":wi_stats['country'],
           "capital":wi_stats['capital'],
           "latitude":wi_stats['latitude'],
           "longitude":wi_stats['longitude'],     
          })
data = data[0:]
full = pd.merge(data, countries, on=['country'], how='inner')
capitals = full[['country','capital','latitude','longitude']]
capitals.head()

country capital latitude longitude
0 Afghanistan Kabul 34.516667 69.183333
1 Albania Tirana 41.316667 19.816667
2 Algeria Algiers 36.750000 3.050000
3 Anguilla The Valley 18.216667 -63.050000
4 Argentina Buenos Aires -34.583333 -58.666667
#q3 What are the country names in capitals.json?
def get_country():
    res = []
    for i in capitals['country']:
        res.append(i)
    return sorted(res)
get_country()
['Afghanistan',
 'Albania',
 'Algeria',
 'Anguilla',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cape Verde',
 'Cayman Islands',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'French Polynesia',
 'Gabon',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guam',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Mauritania',
 'Mauritius',
 'Mexico',
 'Moldova',
 'Monaco',
 'Mongolia',
 'Morocco',
 'Mozambique',
 'Namibia',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Romania',
 'Russia',
 'Rwanda',
 'Saint Helena',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Saudi Arabia',
 'Senegal',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Slovenia',
 'Somalia',
 'South Africa',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syria',
 'Taiwan',
 'Tajikistan',
 'Tanzania',
 'Thailand',
 'Togo',
 'Tonga',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela',
 'Vietnam',
 'Yemen',
 'Zambia',
 'Zimbabwe']
#q4 What is the capital of Cuba?
def get_capital():
    capitals_country = capitals[capitals.country == 'Cuba']
    return next(iter(capitals_country['capital']))
get_capital()
'Havana'
#q5 Which country's capital is Tbilisi?
def get_countries():
    capital = capitals[capitals.capital == 'Tbilisi']
    return next(iter(capital['country']))
get_countries()
'Georgia'
#q6 Which 7 countries have the southern-most capitals?
def get_southern_most(): 
    res = []
    for i in capitals.sort_values(by='latitude',ascending=True)[:7]['country']:
        res.append(i)
    return res
get_southern_most()
['New Zealand',
 'Australia',
 'Uruguay',
 'Argentina',
 'Chile',
 'Lesotho',
 'Swaziland']
#q7 Which 5 countries have the northern-most capitals?
def get_northern_most(): 
    res = []
    for i in capitals.sort_values(by='latitude',ascending=False)[:5]['country']:
        res.append(i)
    return res
get_northern_most()
['Iceland', 'Finland', 'Norway', 'Estonia', 'Sweden']
#q8 For birth-rate and death-rate, what are various summary statistics (e.g., mean, max, standard deviation, etc)?
df = pd.DataFrame(data = {'birth-rate':countries['birth-rate'],
                         'death-rate':countries['death-rate']})
df.describe()

birth-rate death-rate
count 174.000000 174.000000
mean 22.463851 9.625172
std 11.278992 5.187143
min 8.250000 2.410000
25% 12.597500 6.027500
50% 20.010000 8.230000
75% 29.860000 11.715000
max 50.730000 29.740000
#q9 For columns literacy and phones, what are various summary statistics?
countries['literacy'] = countries['literacy'].str.replace(',', '.', regex=False)
countries['literacy'] = countries['literacy'].astype('float')

countries['phones'] = countries['phones'].str.replace(',', '.', regex=False)
countries['phones'] = countries['phones'].astype('float')

df = pd.DataFrame(data = {'literacy':countries['literacy'],
                         'phones':countries['phones']})
df.describe()

literacy phones
count 174.000000 174.000000
mean 82.362069 215.770690
std 20.003166 218.713811
min 17.600000 1.300000
25% 70.100000 32.000000
50% 92.100000 140.350000
75% 98.000000 331.750000
max 100.000000 1035.600000
#q10 What is the largest land-locked country in Europe?
def get_EU_largest_land_country():
    countries_EU_land = countries.loc[(countries['continent'] == 'Europe') & (countries['coastline'] == 0)]
    countries_EU_land_largest = countries_EU_land.sort_values(by='area',ascending=False).iloc[0]['country']
    return countries_EU_land_largest
get_EU_largest_land_country()
'Belarus'
#q11 What is the largest land-locked country in Asia?
def get_AS_largest_land_country():
    countries_AS_land = countries.loc[(countries['continent'] == 'Asia') & (countries['coastline'] == 0)]
    countries_AS_land_largest = countries_AS_land.sort_values(by='area',ascending=False).iloc[0]['country']
    return countries_AS_land_largest
get_AS_largest_land_country()
'Kazakhstan'
#q12 What is the most populous land-locked country in Asia?
def get_AS_most_pop_land_country():
    countries_AS_land = countries.loc[(countries['continent'] == 'Asia') & (countries['coastline'] == 0)]
    countries_AS_land_most_pop = countries_AS_land.sort_values(by='population',ascending=False).iloc[0]['country']
    return countries_AS_land_most_pop
get_AS_most_pop_land_country()
'Afghanistan'
# Haversine Formula (https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points)

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 3956 # Radius of earth in kilometers. Use 3956 for miles
    return c * r
#q13 What is the distance between Camp Randall Stadium and the Wisconsin State Capital?

Randall_lon = -89.411893
Randall_lat = 43.070231
Wisconsin_lon = -89.384113
Wisconsin_lat = 43.074645

haversine(Randall_lon, Randall_lat, Wisconsin_lon, Wisconsin_lat)
1.433899492072933
#q14 What is the distance between France and Japan?

def get_distance(country1, country2):
    country1_lon = capitals.loc[(capitals['country'] == country1)].iloc[0]['longitude']
    country1_lat = capitals.loc[(capitals['country'] == country1)].iloc[0]['latitude']
    country2_lon = capitals.loc[(capitals['country'] == country2)].iloc[0]['longitude']
    country2_lat = capitals.loc[(capitals['country'] == country2)].iloc[0]['latitude']
    return haversine(country1_lon, country1_lat, country2_lon, country2_lat)
get_distance('France', 'Japan')
6032.330932363535
#q15 What are the distances between Guam, New Zealand, and Australia?

country_list = ["Guam", "New Zealand", "Australia"]

def get_distances(country_list):
    data = {}
    for country1 in country_list:
        tmp = []
        for country2 in country_list:
            if country1 == country2:
                tmp.append(np.nan)
            else:
                tmp.append(get_distance(country1, country2))
        data[country1] = tmp
    df = pd.DataFrame(data, index = country_list)
    return df
get_distances(country_list)

Guam New Zealand Australia
Guam NaN 4239.314912 3377.103870
New Zealand 4239.314912 NaN 1445.147895
Australia 3377.103870 1445.147895 NaN
#q16 What is the distance between every pair of countries in the Australian continent?

def get_distance_Australian():
    country_list = []
    country_AU = countries.loc[(countries['continent'] == 'Australia')]
    for i in country_AU['country']:
        country_list.append(i)
    return get_distances(country_list)
get_distance_Australian()

Marshall Islands French Polynesia New Zealand Vanuatu Palau Papua New Guinea Guam Samoa New Caledonia Australia Fiji Tonga
Marshall Islands NaN 3159.856137 3348.725378 1727.328621 2516.301728 2019.144119 1861.584428 1849.150692 2054.905160 3263.901948 1806.972599 2151.129410
French Polynesia 3159.856137 NaN 2668.807328 2765.339382 5447.018761 4265.757856 4961.131713 1496.953816 2864.497308 3920.570216 2102.135722 1686.539411
New Zealand 3348.725378 2668.807328 NaN 1671.641167 4218.188412 2767.607095 4239.314912 2062.548020 1400.483623 1445.147895 1613.916992 1510.219450
Vanuatu 1727.328621 2765.339382 1671.641167 NaN 2882.041892 1527.495923 2687.395771 1349.904435 335.602971 1687.810709 663.967196 1098.016273
Palau 2516.301728 5447.018761 4218.188412 2882.041892 NaN 1453.306058 800.243081 3956.305436 2974.892838 3099.023227 3468.037102 3931.843110
Papua New Guinea 2019.144119 4265.757856 2767.607095 1527.495923 1453.306058 NaN 1591.171595 2789.123270 1552.708901 1786.749660 2174.712829 2623.742354
Guam 1861.584428 4961.131713 4239.314912 2687.395771 800.243081 1591.171595 NaN 3520.883355 2871.868601 3377.103870 3165.570014 3617.512808
Samoa 1849.150692 1496.953816 2062.548020 1349.904435 3956.305436 2789.123270 3520.883355 NaN 1542.152644 2840.091065 716.316054 553.393188
New Caledonia 2054.905160 2864.497308 1400.483623 335.602971 2974.892838 1552.708901 2871.868601 1542.152644 NaN 1375.924712 825.890876 1179.069567
Australia 3263.901948 3920.570216 1445.147895 1687.810709 3099.023227 1786.749660 3377.103870 2840.091065 1375.924712 NaN 2145.768971 2362.766922
Fiji 1806.972599 2102.135722 1613.916992 663.967196 3468.037102 2174.712829 3165.570014 716.316054 825.890876 2145.768971 NaN 463.843872
Tonga 2151.129410 1686.539411 1510.219450 1098.016273 3931.843110 2623.742354 3617.512808 553.393188 1179.069567 2362.766922 463.843872 NaN
#q17 What is the most central country in the Australian continent?

def get_most_central_AU():
    df_AU = get_distance_Australian()
    df_AU_mean = df_AU.mean(axis = 1, numeric_only = True)
    df_AU_mean = df_AU_mean.sort_values(ascending=True)
    return df_AU_mean.index[0]

get_most_central_AU()
'Vanuatu'
#q18 What is the least central country in Australia?

def get_least_central_AU():
    df_AU = get_distance_Australian()
    df_AU_mean = df_AU.mean(axis = 1, numeric_only = True)
    df_AU_mean = df_AU_mean.sort_values(ascending=False)
    return df_AU_mean.index[0]

get_least_central_AU()
'French Polynesia'
#q19 How close is each country in Australia to it's nearest neighbor?

def AU_nearest_neighbor():
    df_AU = get_distance_Australian()
    df_AU_idxmin = df_AU.idxmin()
    df_AU_min = df_AU.min()
    country_list = df_AU_min.index
    data = {'nearest':df_AU_idxmin.values, 'distance':df_AU_min.values}
    df = pd.DataFrame(data, index = country_list)
    return df
AU_nearest_neighbor()

nearest distance
Marshall Islands Vanuatu 1727.328621
French Polynesia Samoa 1496.953816
New Zealand New Caledonia 1400.483623
Vanuatu New Caledonia 335.602971
Palau Guam 800.243081
Papua New Guinea Palau 1453.306058
Guam Palau 800.243081
Samoa Tonga 553.393188
New Caledonia Vanuatu 335.602971
Australia New Caledonia 1375.924712
Fiji Tonga 463.843872
Tonga Fiji 463.843872
#q20 How far is each country in Australia to it's furthest neighbor?

def AU_furthest_neighbor():
    df_AU = get_distance_Australian()
    df_AU_idxmax = df_AU.idxmax()
    df_AU_max = df_AU.max()
    country_list = df_AU_max.index
    data = {'furthest':df_AU_idxmax.values, 'distance':df_AU_max.values}
    df = pd.DataFrame(data, index = country_list)
    return df
AU_furthest_neighbor()

furthest distance
Marshall Islands New Zealand 3348.725378
French Polynesia Palau 5447.018761
New Zealand Guam 4239.314912
Vanuatu Palau 2882.041892
Palau French Polynesia 5447.018761
Papua New Guinea French Polynesia 4265.757856
Guam French Polynesia 4961.131713
Samoa Palau 3956.305436
New Caledonia Palau 2974.892838
Australia French Polynesia 3920.570216
Fiji Palau 3468.037102
Tonga Palau 3931.843110
import sqlite3
conn = sqlite3.connect('countries.db')
countries.to_sql("countries", conn, if_exists="replace", index=False)
capitals.to_sql("capitals", conn, if_exists="replace", index=False)
#q21 In our dataset, how many countries are in each continent?
df = pd.read_sql('select continent, count(country) AS `number of countries` from countries group by continent order by `number of countries`',conn)
df

continent number of countries
0 Australia 12
1 South America 12
2 North America 24
3 Asia 38
4 Europe 41
5 Africa 47
#q22 How many countries are in each continent? (answer with a plot)
#plt.figure(figsize = (10,3)) 
plt.bar(df['continent'],df['number of countries'],fc = 'black')
plt.xlabel('continent')
plt.ylabel('number of countries')
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5], <a list of 6 Text xticklabel objects>)

png

#q23 What is the total population of each continent?
df_pop = pd.read_sql('select continent,sum(population) AS `total population` from countries group by continent order by `total population` DESC',conn)
bl = df_pop['total population']/1000000
df_pop.loc[:,'total population (million)'] = bl
df_pop.drop(columns = 'total population',inplace = True)
df_pop

continent total population (million)
0 Asia 3739.902863
1 Africa 807.299195
2 Europe 792.053486
3 North America 515.041558
4 South America 375.441666
5 Australia 32.163025
#q24 What is the total population of each continent? (answer with a plot)
#plt.figure(figsize = (10,3)) 
plt.bar(df_pop['continent'],df_pop['total population (million)'],fc = 'black')
plt.xlabel('continent')
plt.ylabel('total population (million)')
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5], <a list of 6 Text xticklabel objects>)

png

#q25 What is the population of each country in South America?
df_sou = pd.read_sql("select country,sum(population) AS population from countries where continent = 'South America' group by country order by population DESC ",conn)
df_sou

country population
0 Brazil 188078227
1 Colombia 43593035
2 Argentina 39921833
3 Peru 28302603
4 Venezuela 25730435
5 Chile 16134219
6 Ecuador 13547510
7 Bolivia 8989046
8 Paraguay 6506464
9 Uruguay 3431932
10 Guyana 767245
11 Suriname 439117
#q26 What are the top 10 countries that have the largest real GDP?
df_top = pd.read_sql("select country,population * `gdp-per-capita` AS `real GDP` from countries order by `real GDP` DESC ",conn)
df_top_10 = df_top.head(10).copy()
bl = df_top_10['real GDP']/1000000000 
df_top_10.loc[:,'real GDP (billion)'] = bl
df_top_10.drop(columns = 'real GDP',inplace = True)
df_top_10

country real GDP (billion)
0 United States 11281.191327
1 China 6569.868565
2 Japan 3594.473830
3 India 3176.520785
4 Germany 2274.855452
5 France 1680.181354
6 United Kingdom 1678.873538
7 Italy 1552.164690
8 Brazil 1429.394525
9 Russia 1271.752506
#q27 What are the top 10 countries that have the largest real GDP? (answer with a plot)
ot = sum(df_top[10:]['real GDP'])/1000000000
df_top_10.loc['new'] = ['Other',ot]
df_top_10 = df_top_10.copy()
#plt.figure(figsize = (10,3))
plt.bar(df_top_10['country'],df_top_10['real GDP (billion)'],fc = 'black')
plt.xlabel('contry')
plt.ylabel('GDP (billions)')

plt.ylim((0,20000))

my_y_ticks = np.arange(0, 20000,5000)
plt.yticks(my_y_ticks) 
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], <a list of 11 Text xticklabel objects>)

png

#q28 What are the top 15 countries that have the largest gap between birth rate and death rate?
df_gap = pd.read_sql("select country,`birth-rate`,`death-rate` from countries order by (`birth-rate`-`death-rate`) DESC  ",conn)
df_gap.head(15)

country birth-rate death-rate
0 Uganda 47.35 12.24
1 Yemen 42.89 8.30
2 Mali 49.82 16.89
3 Oman 36.24 3.81
4 Madagascar 41.41 11.11
5 Burkina Faso 45.62 15.60
6 Niger 50.73 20.91
7 Chad 45.73 16.38
8 Mauritania 40.99 12.16
9 Burundi 42.22 13.46
10 Comoros 36.93 8.20
11 Somalia 45.13 16.63
12 Marshall Islands 33.05 4.78
13 Maldives 34.81 7.06
14 Togo 37.01 9.83
#q29 What are the top 15 countries that have the largest gap between birth rate and death rate? (answer with a plot)
df_gap.head(15)
ax = df_gap.head(15).plot.barh()
ax.set_xlabel("rate")
ax.set_yticklabels(df_gap.head(15).country)
[Text(0, 0, 'Uganda'),
 Text(0, 0, 'Yemen'),
 Text(0, 0, 'Mali'),
 Text(0, 0, 'Oman'),
 Text(0, 0, 'Madagascar'),
 Text(0, 0, 'Burkina Faso'),
 Text(0, 0, 'Niger'),
 Text(0, 0, 'Chad'),
 Text(0, 0, 'Mauritania'),
 Text(0, 0, 'Burundi'),
 Text(0, 0, 'Comoros'),
 Text(0, 0, 'Somalia'),
 Text(0, 0, 'Marshall Islands'),
 Text(0, 0, 'Maldives'),
 Text(0, 0, 'Togo')]

png

#q30 What is the relationship between death rates and birth rates? (answer with a plot)
plt.rcParams["font.size"] = 10

ax = df_gap.plot.scatter(x='death-rate', y='birth-rate', c='black')
ax.set_xlabel('death-rate')
Text(0.5, 0, 'death-rate')

png

#q31 Did you notice the six outliers? Which countries are they?
df_part = df_gap[df_gap['death-rate']>16]
df_part = df_part[df_part['birth-rate'] < 30]
df_1 = df_part[['country','death-rate','birth-rate']]
df_1.sort_values(by=['death-rate']).reset_index(drop=True)

country death-rate birth-rate
0 Namibia 18.86 24.32
1 Zimbabwe 21.84 28.01
2 South Africa 22.00 18.20
3 Lesotho 28.71 24.75
4 Botswana 29.50 23.08
5 Swaziland 29.74 27.41
#q32 What is the correlation between death rates and birth rates?
countries.corr()['death-rate']["birth-rate"]
0.3965086117752512
#q33 What is the relationship between GDP per capita and the number of phones per capita?
plt.rcParams["font.size"] = 10
ax = countries.plot.scatter(x='gdp-per-capita', y='phones', c='black')

png

#q34 What is the correlation between GDP per capita and number of phones per capita?
countries.corr()['phones']['gdp-per-capita']
0.8786946640591311
#q35 What is the relationship between literacy and GDP per capita?
plt.rcParams["font.size"] = 10
ax = countries.plot.scatter(x='literacy', y='gdp-per-capita', c='black')

png

#q36 What is the relationship between GDP per capita and death rates?:
plt.rcParams["font.size"] = 10
ax = countries.plot.scatter(x='gdp-per-capita', y='death-rate',c='black')

png

#q37 What is the relationship between GDP per capita and birth rates?
plt.rcParams["font.size"] = 10
ax = countries.plot.scatter(x='gdp-per-capita', y='birth-rate',c='black')

png

#q38 What is the relationship between GDP per capita and the inverse of birth rates?
countries['inverse'] = 1/countries['birth-rate']
plt.rcParams["font.size"] = 10
ax = countries.plot.scatter(x='gdp-per-capita', y='inverse',c='black')

png

#q39 fit a y = m * x + n line to the scatter plot in question 38 using the least-squares method; what are the coefficients m and n?
countries["1"] = 1
res = np.linalg.lstsq(countries[["gdp-per-capita", "1"]], countries["inverse"], rcond=None)

coefficients = res[0] # coefficients is (m,n:
m = coefficients[0] # slope
n = coefficients[1] # intercept

coef_tuple = (m,n)
coef_tuple
(1.8365703926233426e-06, 0.03916624252079354)
#q40 plot the regression line to the scatter plot
ax = countries.plot.scatter(x='gdp-per-capita', y='inverse', c = 'gray')
countries["fit"] = countries["gdp-per-capita"] * m + n
countries.plot.line(x='gdp-per-capita', y='fit', c='red', ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1a6a75308d0>

png

conn.close()

  TOC