Building a Web Scraper with Python

I've been working on my Python skills for Cybersecurity, taking on various projects that expose me to various facets of Python that would potentially meet various security use cases. Today I will be building a web scraper to crawl a website. We often use these when we don't have an API that allows us to easily interact with an application. However, even without an API endpoint there is often still a need to parse the HTML pages for the information that we want. Are you familiar with CeWL?

CeWL(Custom Wordlist Generator) is a Ruby app which spiders a given URL to a specified depth, optionally following external links, and returns a list of words which can then be used for password cracking, bucket discovery, and reconnaissance in general. Learn more about CeWL below:

https://github.com/digininja/CeWL

The objective of this lab is to gain familiarity with BeautifulSoup for WebScraping. In this lab, I am going to start building a Web Scraper that like CeWL will:

Scrape not just the initial Home webpage, but iteratively scrape URLs that are found on the home page.
Look for email addresses affiliated with a target domain.
Curate a list of Words that could be used in password cracking attempts.

If I were to write pseudocode for this, it would look like:

Import the necessary modules. Requests, Re, BeautifulSoup from bs4

Create a function to scrape for words and append to a text file.

Create a function to scrape for emails and append to a text file.

Create a function to scrape for URLs from the webpage and return the list.

Specify the domain.

Send a Get request to the page

Store the response in a variable

Pass to HTTP response data to the functions created above to scrape data from the webpage.

The documentation I will be using to help are:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

For this project I asked ChatGPT to create me a Webpage with emails and a good amount of content.

It gave me this initially.

However, I prompted it for more advanced styling and it rendered this webpage.

I am now going to locally host this website, so we can leverage the requests module.

In our Python code, our URL will be the one that is in the browser.

To begin we will need to import a few modules. use the requests module to make HTTP requests to the website an we will also leverage. Send a GET Request to our locally hosted page was successful. I am going to continue building out the functions as specified in the pseudo code above.

Create_url_list() function

One of the functions we want to create first is to extract all URLs found within a page's A tags. According to the docs, one way we can do it is below:

One thing to note is that some href tags may not be strictly URLs but mailto: links. We want to handle these edge cases gracefully by excluding them. Since we will have a separate function to scrape emails from the page. To address this I am going to create a conditional statement that includes the regex search method. If the link begins with 'mailto:' then we exclude t.

This is the output of it initially.

The initial code that I used to remedy this is as follows.

def create_url_list(parsed_response: BeautifulSoup):
    #Extract all URLs found within the webpage's A tags
    with open ("urls-targetdomain.txt","a") as f:
        #opens file urls-targetdomain.txt where we will write the urls to the file. 
        for link in parsed_response.find_all('a'):
            if re.search(r'^mailto:',link["href"]) is None:
                f.write(f"{url}{link['href']}\n")

Testing the function, I discerned that I did not account for absolute URLs that may also be found. That is why in the last two options the output is not as expected. Therefore, I added another condition to my if statement. So it not only checks for emails beginning with mailto: but also absolute URLs by using the regex search method (re.search) to search for string pattern beginning with http and only executes if none is found.

I then handle separately the absolute URLs that begin with http and still does not begin with mailto. For these cases, we can print the URL as is.

def create_url_list(parsed_response: BeautifulSoup):
    #Extract all URLs found within the webpage's A tags
    with open ("urls-targetdomain.txt","a") as f:
        #opens file urls-targetdomain.txt where we will write the urls to the file. 
        for link in parsed_response.find_all('a'): 
            if (re.search(r'^mailto:',link["href"]) is None) and (re.search(r'^http',link["href"])is None):
                #links found will be written to the file with the below formatting only if the string doesn't begin with mailto: or http:
                #this accounts for relative urls
                f.write(f"{url}{link['href']}\n")
            elif not (re.search(r'^http',link["href"])is None) and (re.search(r'^mailto:',link["href"]) is None) :
                #in this case links will be written to the file just as they are in the href if the search for http returns a match instead of none and the
                # mailto: pattern is not matched
                f.write(f"{link['href']}\n")
        return

Create_wordlist() function

The next function we are going to create is the word list. It should output a file of words scraped from the webpage. To get only the human readable text in between the tags we use the get_text() method on the soup object. The separator allows us to specify a string to join the bits of text together, we also set strip to true so to remove excessive white spaces.

Learn more about the get_text method in the BeautifulSoup Documentation here.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

It may not be totally clear, but punctuation marks are still in the text that we have, this will interfere with further string manipulation. I will show you two techniques that we could use to clean up the text, so that we only have text.

Option 1 - Regex

We substitute the punctuation marks using the regex method sub. The re.sub() accepts the regex pattern, what we will be replacing it with, and the data to be processed. It will return the cleaned data.

#re.sub looks for matches where the pattern is NOT word(\w) and spaces (\s). where #\w includes letters,digits, underbar, 
cleaned_text = re.sub(r'[^\w\s]',"",words)

More information on how this works can be found here. https://www.geeksforgeeks.org/re-sub-python-regex/

If you want to learn more about regex use python docs or even this.

https://docs.python.org/3/library/re.html#

https://developers.google.com/edu/python/regular-expressions

Look at the text below. It now has no punctuations marks just words and spaces.

Option 2 - Str.translate()

For this method, at the top make sure to import the strings module.

import string

The maketranse() method allows us to create a mapping that includes:

List of characters that we want to replace.
List of characters that they need to be replaced with.
List of characters that need to be removed.

In this case, our goal is to simply remove the punctuation marks from our code, we do not need to add or replace. Therefore, we create a translation table called translator that has two first empty strings because we don't have any characters to replace, but we use string.punctuation for our third argument, because we want to remove all punctuation marks from the text.

Finally, we call the translate method passing the translation table we created as the argument.

translator = str.maketrans("","",string.punctuation)
cleaned_text =words.translate(translator)

The result is that words string which contains all our text and punctuations has the translated function called on it, which will use the translator we made to remove the punctuations.

Resume here after completing either of the above options. My code will reflect the regex method.

The text is now in a format that makes further processing easier. We will now use the split() string function to return a list of the text taking each word as an element in the list. Split defaults to splitting on whitespaces so we do not have to specify a delimiter, since there is already whitespaces between each word. We now have our list.

The use case most times for these kinds of wordlist would be for passwords, bucket discovery, etc. Therefore, it would help if we could filter the list for words that are at least 4 letters long? That would eliminate fillers such as the, are etc.

It's pretty simple we will create a variable called min_length = 4

Then create a filter_list using a list comprehension with a conditional.

filtered_list: list = []
min_length: int = 4
word_list = cleaned_text.split()
filtered_list = [word for word in word_list if len(word)>=min_length]

Let's see how much the size of the list went down.

We reduced the size of our list by 104 words.

Our filtered list is what will written to the file. Actually, let's do that now. To write this to a file, we open a file and iterate through the list write each element of the list to the text file.

with open("wordlist-targetdomain.txt","a") as f:
 #opens file and loops through the filtered list for us to write our words to.
   for word in filtered_list:
       f.write(f"{word}\n")

Create_emaillist() function

The final function we are going to create is a Create_emaillist() function. It will write to an email-list-targetdomain.txt file emails scraped from the website. We will create a regex email pattern using the re.compile() method and the for the first time in this program we will use the findall() regex method, passing the email pattern and the extracted text from the webpage.

The regex pattern was a bit challenging to build but the one we constructed in the code below does the job and accounts for most possible email addresses.

It successfully scraped and wrote the emails to the file.

See the emails on the actual webpage below.

We now have all our files being written to the folder and are successfully scraping emails, urls, and words from a webpage.

Conclusion

It was great working through building out this program with you. The index html code is below as well as the code for the entire program. Use it as a reference if you do decide to tackle this project. If I had a longer weekend, I would have researched the documentation on how to make my code able to accept command line just like CeWL does. Additionally, because we made it so modularized, it would be very easy to be able to scrape recursively the subsequent URLs found on the initial webpage.

As it is currently, the main method to scrape the initial URL is as followed:

if __name__ == '__main__':
    parsed_page = get_page(url)
    create_url_list(parsed_page)
    create_wordlist(parsed_page)
    create_emaillist(parsed_page)

To scrape two levels —meaning the initial homepage and the URLs linked from it— you simply open the URL list text file and iterating through each URL, calling the same functions. Since each function writes to the file in "a" (append) mode, the number of emails, URLs, etc will accumulate.

if __name__ == '__main__':
    parsed_page = get_page(url)
    create_url_list(parsed_page)
    create_wordlist(parsed_page)
    create_emaillist(parsed_page)

    with open ("urls-targetdomain.txt","r") as file:
        for url in file:
            url = url.strip()
            parsed_page = get_page(url)
            create_url_list(parsed_page)
            create_wordlist(parsed_page)
            create_emaillist(parsed_page)

So that is definitely something to explore in future. I really enjoyed working on this though as it allowed me refresh and also apply some other things I have been learning such as string manipulations, requests, regex, and now scraping. Until next time!

-TheSocSpot

The entire code could be found below.

import re
import requests
from bs4 import BeautifulSoup
import string

url = "http://localhost:8000/"

def get_page(url:str) -> BeautifulSoup:
#Function to get webpage
    response = requests.get(url)
    print (response.text)
    print (response.headers)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup 

def create_wordlist(soup: BeautifulSoup):
    filtered_list: list = []
    min_length: int = 4
    #Extracts all words found within the page's HTML
    #get_text gets all the text from the page and strip= true removes extra white spaces
    words = soup.get_text(separator = " ",strip=True )
    #re.sub looks for matches where the pattern is NOT word and spaces. where \w includes letters,digits, underbar
    cleaned_text = re.sub(r'[^\w\s]',"",words)
    word_list = cleaned_text.split()
    filtered_list = [word for word in word_list if len(word)>=min_length]
    with open("wordlist-targetdomain.txt","a") as f:
        #opens file and loops through the filtered list for us to write our words to.
        for word in filtered_list:
            f.write(f"{word}\n")
    

def create_url_list(parsed_response: BeautifulSoup):
    #Extract all URLs found within the webpage's A tags
    with open ("urls-targetdomain.txt","a") as f:
        #opens file urls-targetdomain.txt where we will write the urls to the file. 
        for link in parsed_response.find_all('a'): 
            if (re.search(r'^mailto:',link["href"]) is None) and (re.search(r'^http',link["href"])is None):
                #links found will be written to the file with the below formatting only if the string doesn't begin with mailto: or http:
                #this accounts for relative urls
                f.write(f"{url}{link['href']}\n")
            elif not (re.search(r'^http',link["href"])is None) and (re.search(r'^mailto:',link["href"]) is None) :
                #in this case links will be written to the file just as they are in the href if the search for http returns a match instead of none and the
                # mailto: pattern is not matched
                f.write(f"{link['href']}\n")

def create_emaillist(soup: BeautifulSoup):
    #from the text look for emails. 
    #get all the text from the webpage 
    page_text = soup.get_text()
    print (page_text)
    #creates a regex pattern object that would match email addresses
    email_pattern = re.compile(r'[a-zA-Z0-9._+%.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
    #re.findall will look for all matches to our email pattern and store them in a list
    email_list = re.findall(email_pattern,page_text)
    #we then write each element from the list to the below text file
    with open ("email-list-targetdomain.txt","a") as f:
        for email in email_list:
            f.write(f"{email}\n")

if __name__ == '__main__':
    parsed_page = get_page(url)
    create_url_list(parsed_page)
    create_wordlist(parsed_page)
    create_emaillist(parsed_page)

The index.html file could be found below.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Acme Corporation - About Us</title>
  <meta name="description" content="Acme Corporation, a leader in innovative solutions. Learn about our services, team, and contact information.">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <!-- Google Font for improved typography -->
  <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;700&display=swap" rel="stylesheet">
  <style>
    /* CSS Variables for consistency */
    :root {
      --primary-color: #2c3e50;
      --secondary-color: #18bc9c;
      --accent-color: #e74c3c;
      --light-color: #ecf0f1;
      --dark-color: #34495e;
      --font-family: 'Roboto', sans-serif;
    }
    
    /* Global Reset */
    * {
      margin: 0;
      padding: 0;
      box-sizing: border-box;
    }
    
    body {
      font-family: var(--font-family);
      line-height: 1.6;
      background-color: var(--light-color);
      color: var(--primary-color);
    }
    
    header {
      background: var(--primary-color);
      color: #fff;
      padding: 40px 20px;
      text-align: center;
    }
    
    header h1 {
      font-size: 2.8rem;
      margin-bottom: 10px;
    }
    
    header p {
      font-size: 1.2rem;
      font-weight: 300;
    }
    
    nav {
      background: var(--secondary-color);
    }
    
    nav ul {
      display: flex;
      justify-content: center;
      list-style: none;
      padding: 10px 0;
    }
    
    nav ul li {
      margin: 0 20px;
    }
    
    nav ul li a {
      color: #fff;
      text-decoration: none;
      font-size: 1.1rem;
      font-weight: 500;
      transition: color 0.3s ease;
    }
    
    nav ul li a:hover {
      color: var(--dark-color);
    }
    
    .container {
      max-width: 1200px;
      margin: 40px auto;
      padding: 0 20px;
    }
    
    section {
      margin-bottom: 50px;
    }
    
    section h2 {
      font-size: 2rem;
      margin-bottom: 15px;
      border-bottom: 3px solid var(--secondary-color);
      display: inline-block;
      padding-bottom: 5px;
    }
    
    p, li {
      margin-bottom: 15px;
    }
    
    ul {
      margin-left: 20px;
    }
    
    /* Team Section Grid */
    .team-grid {
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
      gap: 20px;
    }
    
    .team-member {
      background: #fff;
      border: 1px solid #ddd;
      border-radius: 8px;
      padding: 20px;
      box-shadow: 0 2px 5px rgba(0,0,0,0.1);
      transition: transform 0.3s ease, box-shadow 0.3s ease;
    }
    
    .team-member:hover {
      transform: translateY(-5px);
      box-shadow: 0 4px 10px rgba(0,0,0,0.15);
    }
    
    .team-member h3 {
      margin-bottom: 10px;
      font-size: 1.5rem;
      color: var(--accent-color);
    }
    
    footer {
      background: var(--dark-color);
      color: #fff;
      text-align: center;
      padding: 20px 10px;
    }
    
    footer p {
      margin: 5px 0;
    }
    
    footer a {
      color: var(--secondary-color);
      text-decoration: underline;
    }
    
    @media (max-width: 768px) {
      header h1 {
        font-size: 2rem;
      }
      
      nav ul {
        flex-direction: column;
        gap: 10px;
      }
    }
  </style>
</head>
<body>
  <header>
    <h1>Acme Corporation</h1>
    <p>Leading the future of innovation and excellence.</p>
  </header>
  
  <nav>
    <ul>
      <li><a href="index.html">Home</a></li>
      <li><a href="about.html">About Us</a></li>
      <li><a href="services.html">Services</a></li>
      <li><a href="team.html">Team</a></li>
      <li><a href="contact.html">Contact</a></li>
    </ul>
  </nav>
  
  <div class="container">
    <section id="about">
      <h2>About Acme Corporation</h2>
      <p>Acme Corporation has been a pioneer in innovative solutions since 1999. We specialize in providing top-notch services and products to clients worldwide. Our commitment to excellence drives us to continually improve and adapt to new challenges in the ever-evolving market.</p>
      <p>With a diverse portfolio ranging from cutting-edge technology solutions to sustainable energy systems, Acme Corporation stands at the forefront of industry advancements.</p>
    </section>
    
    <section id="services">
      <h2>Our Services</h2>
      <ul>
        <li>Technology Consulting</li>
        <li>Software Development</li>
        <li>Renewable Energy Solutions</li>
        <li>Business Strategy and Analysis</li>
        <li>Customer Support and Maintenance</li>
      </ul>
      <p>For more details, visit our <a href="services.html">services page</a> where we outline our offerings in greater detail.</p>
    </section>
    
    <section id="team">
      <h2>Our Team</h2>
      <div class="team-grid">
        <div class="team-member">
          <h3>Jane Doe, CEO</h3>
          <p>With over 20 years of experience in leading successful enterprises, Jane has been driving Acme Corporation to new heights.</p>
          <p>Email: <a href="mailto:jane.doe@acmecorp.com">jane.doe@acmecorp.com</a></p>
        </div>
        <div class="team-member">
          <h3>John Smith, CTO</h3>
          <p>John is responsible for the technical strategy and innovation at Acme Corporation. His expertise has been a cornerstone in our technological advancements.</p>
          <p>Email: <a href="mailto:john.smith@acmecorp.com">john.smith@acmecorp.com</a></p>
        </div>
        <div class="team-member">
          <h3>Emily Johnson, CFO</h3>
          <p>Emily ensures our financial stability and growth, handling complex financial strategies that support our overall mission.</p>
          <p>Email: <a href="mailto:emily.johnson@acmecorp.com">emily.johnson@acmecorp.com</a></p>
        </div>
      </div>
    </section>
    
    <section id="blog">
      <h2>Latest News</h2>
      <article>
        <h3>Acme Corporation Launches New Product Line</h3>
        <p>Today, Acme Corporation unveiled its latest line of innovative products aimed at revolutionizing the tech industry. Read more about our product launch on our <a href="blog.html">blog</a>.</p>
      </article>
      <article>
        <h3>Sustainable Energy: The Future is Now</h3>
        <p>Our renewable energy solutions are setting a new standard in the industry. Find detailed insights and case studies on our <a href="blog.html">blog</a>.</p>
      </article>
    </section>
    
    <section id="contact">
      <h2>Contact Us</h2>
      <p>If you have any questions or would like to learn more about Acme Corporation, please reach out to us. You can contact our main office at <a href="mailto:info@acmecorp.com">info@acmecorp.com</a> or use the form on our <a href="contact.html">contact page</a>.</p>
      <p>Alternatively, for media inquiries, please email our PR team at <a href="mailto:pr@acmecorp.com">pr@acmecorp.com</a>.</p>
    </section>
  </div>
  
  <footer>
    <p>&copy; 2023 Acme Corporation. All rights reserved.</p>
    <p>Follow us on <a href="https://www.twitter.com/acmecorp">Twitter</a> and <a href="https://www.facebook.com/acmecorp">Facebook</a>.</p>
  </footer>
</body>
</html>