Simple Web Scraper Python, BeautifulSoup
👤 Sharing: AI
```python
import requests
from bs4 import BeautifulSoup
# 1. Define the URL you want to scrape
url = "https://en.wikipedia.org/wiki/List_of_programming_languages"
# 2. Make an HTTP request to get the page content
try:
response = requests.get(url)
response.raise_for_status() # Raise HTTPError for bad responses (4XX or 5XX)
except requests.exceptions.RequestException as e:
print(f"Error fetching the page: {e}")
exit() # Exit if the request failed
# 3. Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# 4. Find the elements containing the data you want
# In this case, we'll look for list items (<li>) within a specific section,
# such as the "A" section on the Wikipedia page. Adjust the selector as needed.
# You'll need to inspect the HTML structure of the target page to find the correct elements.
# Example: targeting list items within the "A" section (adjust based on Wikipedia's structure).
# This selector targets <li> elements that are descendants of a <ul> element which is a descendant of a <div> element with the class "div-col"
language_list_items = soup.find_all("div", class_="div-col")[0].find_all("ul")[0].find_all("li") # Adjust index [0] based on observed layout
# 5. Extract the desired data from the elements
programming_languages = [] # Store the scraped languages
for item in language_list_items:
# Extract the text from the list item. It might contain a link, or just text.
# If there is a link <a> inside the <li>, get the text from the link. Otherwise, get the text directly from the <li>.
link = item.find("a")
if link:
language_name = link.text.strip()
else:
language_name = item.text.strip() #strip() removes leading/trailing whitespace
programming_languages.append(language_name)
# 6. Print or process the extracted data
print("Programming Languages:")
for language in programming_languages:
print(language)
# Optional: Save the data to a file
# with open("programming_languages.txt", "w", encoding="utf-8") as f:
# for language in programming_languages:
# f.write(language + "\n")
print("\nScraping complete.")
```
Key improvements and explanations:
* **Clearer Error Handling:** Includes a `try...except` block to handle potential `requests` exceptions (e.g., network errors, invalid URLs). `response.raise_for_status()` is crucial; it raises an exception if the HTTP request returns an error code (4xx or 5xx), preventing the script from continuing with potentially invalid data. This makes the scraper much more robust. The error message now explains *what* went wrong. Exiting after a failure prevents further issues.
* **Robust Element Selection:** The element selection is now much more focused. Instead of indiscriminately selecting *all* `<li>` elements on the page, it targets a specific `<div>` with the class `div-col`, within which it finds a `<ul>`, and then its `<li>` children. This dramatically reduces the amount of noise and gets directly to the data. This is based on inspecting the actual HTML source of the provided URL. Crucially, the element selectors will likely need *frequent* adjustment as websites change their structure.
* **Improved Data Extraction:** The code now explicitly checks if the `<li>` element contains an `<a>` tag (a link). If it does, it extracts the text from the link; otherwise, it extracts the text directly from the `<li>`. This handles cases where the language name is within a link or just plain text. `strip()` is used to remove leading and trailing whitespace from the extracted text, resulting in cleaner output.
* **Targeted Element Search:** The example target the div containing the "A" list. The index [0] refers to the first `div` with the class "div-col" found on the page. If the "A" list is in a different `div`, the index needs to be adjusted accordingly. Website structure changes, requiring you to inspect the HTML source frequently. The code is set to scrape the first column.
* **Comments and Clarity:** Extensive comments explain each step, making the code easier to understand.
* **Encoding Handling:** The optional file saving now uses `encoding="utf-8"` to handle Unicode characters correctly, preventing potential encoding errors.
* **Concise Output:** Prints the extracted programming languages in a more readable format.
* **Modular Design:** The code is organized into logical blocks, making it easier to modify and extend.
* **Example URL:** Uses a well-known and relatively stable URL (Wikipedia) for demonstration.
* **Important disclaimer:** Websites frequently change their HTML structure. This script is written assuming the structure of the provided URL is the same as the one when this script was written. You'll very likely need to adjust the `soup.find_all()` calls, especially the `class_` parameter, to match the current structure of the target web page. Use your browser's "Inspect" or "View Source" tool to examine the HTML.
How to run this code:
1. **Save:** Save the code as a `.py` file (e.g., `scraper.py`).
2. **Install Libraries:** Open a terminal or command prompt and install the required libraries:
```bash
pip install requests beautifulsoup4
```
3. **Run:** Execute the script from your terminal:
```bash
python scraper.py
```
Remember to adapt the code to the specific website you want to scrape. Web scraping can be fragile, and websites can change their structure frequently, requiring you to update your code accordingly. Also, be mindful of website terms of service and robots.txt to avoid overloading their servers or violating their usage policies.
👁️ Viewed: 9
Comments