1Web Scraping Wikipedia tables with Beautiful Soup
This section provides a guide on how to scrape data from Wikipedia. The purpose is to create a pipeline for identifying Texas’ ghost towns, unincorporated areas (i.e. Colonias), and municipalities. The code blocks on this page define the function’s purpose and expected outcomes. This process is beneficial for web scraping. Following each code block, you will find instructions on how to execute the function and generate a JSON file to store the data. It is recommend to understand HTML layout
1.0.1 packages need
Code
import requestsimport pandas as pdimport json from bs4 import BeautifulSoup
/Users/dianitalopez/Documents/GitHub/gregoriocortezmap/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning:
urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
List of Municipalities in Texas
Code
def table_extraction(a,b): url= a file=f"list of texas places/{b}.txt" s = requests.get(url) soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml) table = soup.find("table", class_="wikitable sortable")withopen (file, "w", encoding="utf-8")as f: #rows=list()#headerString = "" dataString =""for row in table.find_all("tr"): #for th in row.find_all("th"): #headerString = headerString+th.get_text(strip=True)+"|"#print(headerString)#ending a line break to organize the txt filefor td in row.find_all("td"): dataString = dataString+td.get_text(strip=True)+"|" dataString=dataString +"\n"#f.write(headerString) f.write(dataString)def do_all(a, b): table_extraction(a,b)do_all("https://en.wikipedia.org/wiki/List_of_municipalities_in_Texas","List of municipalities in Texas")
List of Unincorporated Communities in Texas
Code
def table_extraction(a,b): url= a file=f"list of texas places/{b}.txt" s = requests.get(url) soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml) table = soup.find("table", class_="wikitable sortable mw-collapsible")withopen (file, "w", encoding="utf-8")as f: #rows=list()#headerString = "" dataString =""for row in table.find_all("tr"): #for th in row.find_all("th"): #headerString = headerString+th.get_text(strip=True)+"|"#print(headerString)#ending a line break to organize the txt filefor td in row.find_all("td"): dataString = dataString+td.get_text(strip=True)+"|" dataString=dataString +"\n"#f.write(headerString) f.write(dataString)def do_all(a, b): table_extraction(a,b)do_all("https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Texas", "List of unincorporated communities in Texas")
List of Ghost towns in Texas
Code
def table_extraction(a,b): url= a file=f"list of texas places/{b}.txt" s = requests.get(url) soup= BeautifulSoup(s.text, "html.parser") #to tell the program to read the site as HTML (lxml) table = soup.find("table", class_="wikitable sortable")withopen (file, "w", encoding="utf-8")as f: #rows=list()#headerString = "" dataString =""for row in table.find_all("tr"): #for th in row.find_all("th"): #headerString = headerString+th.get_text(strip=True)+"|"#print(headerString)#ending a line break to organize the txt filefor td in row.find_all("td"): dataString = dataString+td.get_text(strip=True)+"|" dataString=dataString +"\n"#f.write(headerString) f.write(dataString)def do_all(a, b): table_extraction(a,b)do_all("https://en.wikipedia.org/wiki/List_of_ghost_towns_in_Texas", "List of ghost towns in Texas")