1 Web Scraping Wikipedia tables with Beautiful Soup

This section provides a guide on how to scrape data from Wikipedia. The purpose is to create a pipeline for identifying Texas’ ghost towns, unincorporated areas (i.e. Colonias), and municipalities. The code blocks on this page define the function’s purpose and expected outcomes. This process is beneficial for web scraping. Following each code block, you will find instructions on how to execute the function and generate a JSON file to store the data. It is recommend to understand HTML layout

1.0.1 packages need

Code

import requests
import pandas as pd
import json 
from bs4 import BeautifulSoup

/Users/dianitalopez/Documents/GitHub/gregoriocortezmap/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning:

urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020

List of Municipalities in Texas

Code

def table_extraction(a,b):
    url= a 
    file = f"list of texas places/{b}.txt" 
    s = requests.get(url)
    soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml)
    table = soup.find("table", class_="wikitable sortable")
    with open (file, "w", encoding="utf-8")as f: 
        #rows=list()
        #headerString  = ""
        dataString = ""
        for row in table.find_all("tr"):   
            #for th in row.find_all("th"): 
                #headerString  = headerString+th.get_text(strip=True)+"|"
                #print(headerString)
                #ending a line break to organize the txt file
            for td in row.find_all("td"):
                dataString = dataString+td.get_text(strip=True)+"|"
            dataString=dataString + "\n"
        #f.write(headerString)       
        f.write(dataString)

def do_all(a, b):
    table_extraction(a,b)
do_all("https://en.wikipedia.org/wiki/List_of_municipalities_in_Texas","List of municipalities in Texas")

List of Unincorporated Communities in Texas

Code

def table_extraction(a,b):
    url= a 
    file = f"list of texas places/{b}.txt" 
    s = requests.get(url)
    soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml)
    table = soup.find("table", class_="wikitable sortable mw-collapsible")
    with open (file, "w", encoding="utf-8")as f: 
        #rows=list()
        #headerString  = ""
        dataString = ""
        for row in table.find_all("tr"):   
            #for th in row.find_all("th"): 
                #headerString  = headerString+th.get_text(strip=True)+"|"
                #print(headerString)
                #ending a line break to organize the txt file
            for td in row.find_all("td"):
                dataString = dataString+td.get_text(strip=True)+"|"
            dataString=dataString + "\n"
        #f.write(headerString)       
        f.write(dataString)

def do_all(a, b):
    table_extraction(a,b)

do_all("https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Texas", "List of unincorporated communities in Texas")

List of Ghost towns in Texas

Code

def table_extraction(a,b):
    url= a 
    file = f"list of texas places/{b}.txt" 
    s = requests.get(url)
    soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml)
    table = soup.find("table", class_="wikitable sortable")
    with open (file, "w", encoding="utf-8")as f: 
        #rows=list()
        #headerString  = ""
        dataString = ""
        for row in table.find_all("tr"):   
            #for th in row.find_all("th"): 
                #headerString  = headerString+th.get_text(strip=True)+"|"
                #print(headerString)
                #ending a line break to organize the txt file
            for td in row.find_all("td"):
                dataString = dataString+td.get_text(strip=True)+"|"
            dataString=dataString + "\n"
        #f.write(headerString)       
        f.write(dataString)

def do_all(a, b):
    table_extraction(a,b)

do_all("https://en.wikipedia.org/wiki/List_of_ghost_towns_in_Texas", "List of ghost towns in Texas")