SpiderFoot has over 200 modules, many of which were contributed by the community. Modules exist for extracting OSINT from third parties using APIs, but modules also exist for analysing content from the target directly, for example for extracting email addresses from web content. In this post, community contributor Jess Williams documented her experience writing a SpiderFoot module for extracting OSINT from StackOverflow.

One of the coolest features of SpiderFoot is the ability to create custom modules to enhance your scans. In this article, we will be walking through the creation of a module to automatically extract data from posts on StackOverflow as a way to illustrate the process.

First, we’ll run through how the module works, then we’ll dive into how it was created.

About the Module

Since 2008, StackOverflow has been an essential part of knowledge sharing within the IT industry. It’s a fantastic public platform for people to post questions and get answers from other professionals and students alike. StackOverflow sees on average 8000 new questions per day, any one of which may contain information pivotal to your OSINT investigation!

Fun Fact:

One out of every four users will copy text within five minutes of hitting the page! So if you’ve ever felt guilty about copy-pasting code straight from StackOverflow, you can rest assured you’re certainly not alone!

A wealth of information is leaked on StackOverflow, from overly verbose error messages to leaked secrets in .env files. From the 2021 developer survey, StackOverflow found that 30% of users are relatively inexperienced with only 1-4 years under their belt. Often users will accidentally post secrets when troubleshooting, whether it is by accident or because the poster wasn’t aware that a piece of data was meant to be kept secret. Additionally, the site provides user profiles with information on their job titles, company, and even geolocation.

The SpiderFoot StackOverflow module will help you gain additional context on your target by searching the site for mentions of your target domain. Simply provide the domain, and SpiderFoot will return a link to the question where your domain was identified, as well as a preview of the text. Additionally, the module will extract any email addresses, IP addresses seen in the text, as well as return the poster’s username and name.

Let’s pretend our target is contoso.com, go to new scan and enter “contoso.com” into scan target:

After entering our target, we’ll configure the scan to run by module:

At the bottom of the page, select “Run Scan Now”. 

Since we used Microsoft’s fictional company, there seems to be a lot of hits:

There are a few different events to look at here, most of them are self-explanatory, but the Raw Data is more interesting.  

Here we can see the preview of the question where the domain was identified; clicking on the link, we then see that “contoso.com” was part of the log file posted in the question:

https://stackoverflow.com/questions/18939846/how-to-generate-wcf-service-host-from-existing-wsdl-xsd-files

Exploring the question above by reviewing the raw data, we can derive that this contoso.com is running IIS 8 with ASP.NET 4.5 using WAS ( Windows Activation Service). Additionally, we can see that contoso.com has an endpoint called “/MyService.svc/IMyService/Login”. The user has also shared some code containing insights on how login requests are handled. Obviously, this isn’t exactly exciting considering it’s Microsoft’s fake company, but if this were a real company, this sort of information would be ideal for an OSINT investigation. It’s worth noting that the module also returns the display name of the poster.

Investigating the poster further by browsing their profile, you may confirm their job title and employer.

Creating the Module

Even for those who don’t have a whole lot of experience in open-source projects or Python development, creating a module for SpiderFoot is very beginner-friendly. 

Quite often, when deciding to contribute to an open-source project, it can take some time before actually getting into writing the code. There’s usually a significant amount of time spent understanding how the application works, reading through wiki entries, and planning how your contribution will fit into the application. SpiderFoot has been set up in a way that makes contributing really easy. Its modular design means most contributors only need to write one file for their module.

SpiderFoot has even gone so far as to create a template for contributors wanting to make modules. The template contains plenty of comments to walk you through creating the metadata for your module, identifying the events your module will listen for and produce, and some guidance on how best to write the logic for your module. 

The StackOverflow module works by listening for domain name events, which are entered through the “Scan Target” input in the UI, as we saw in the above example, or detected by other modules during the scan. The events produced and watched for are defined in the below two functions:

    # What events is this module interested in for input
    def watchedEvents(self):
        return ["DOMAIN_NAME"]
 
    # What events this module produces
    def producedEvents(self):
        return ["RAW_RIR_DATA", "EMAILADDR", "USERNAME", "IP_ADDRESS"]

After defining the events, the “sfp_template.py” file has a function dedicated to the actual query. This contains the logic to make a request to the StackOverflow API. This function contains two different queries, one to search the posts for the target domain and another for returning the username from the questions endpoint. Below is an example of one of the queries:

  def query(self, qry, qryType):
        # The Stackoverflow excerpts endpoint will search the site for mentions of a keyword and returns a snippet of relevant results
        if qryType == "excerpts":
            try:
                res = self.sf.fetchUrl(f"https://api.stackexchange.com/2.3/search/excerpts?order=desc&q={qry}&site=stackoverflow",
                    timeout=self.opts['_fetchtimeout'],
                    useragent="SpiderFoot"
                )
            except Exception as e:
                self.error(f"Error querying StackExchange API: {e}")
                self.errorState = True
                return None

After querying StackOverflow, the logic in the “handleEvent” function will iterate through the items returned by the query and extract out the juicy information.

 for item in items:
            if self.checkForStop():
                return
 
            # create raw_rir_data event
            body = item["body"]
            excerpt = item["excerpt"]
            question = item["question_id"]

Once an event is extracted from the API response, the module must notify any listener modules, so that they may process the event. For example, the following code returns all of the identified emails as events:

        for email in allEmails:
            email = str(email)
            e = SpiderFootEvent('EMAILADDR', email, self.__name__, event)
            self.notifyListeners(e)

Email addresses and IP addresses are extracted out through three separate functions. The IPs are extracted out through “extractIP4s” and “extractIP6s”. These functions receive the “body” and “excerpt” as one string and run regex over it to find any matches. For example, this is what the “extractIP4s” function looks like:

    def extractIP4s(self, text):
        ips = set()
 
        matches = re.findall(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', text)
 
        if matches:
            for match in matches:
                if self.sf.validIP(match) and not(self.sf.isValidLocalOrLoopbackIP(match)):
                    ips.add(match)
            return list(ips)
        else:
            return

Extracting the email addresses was extremely easy, as there was already a function in the SpiderFoot library file called “sflib.py”. This file contains a huge amount of common functions used by the modules.  All I had to do was pass the function the text I wanted to search, and it would return the emails:

emails = self.sf.parseEmails(text)
if emails is not None:
    allEmails.append(emails)

Overall, making this module was an awesome learning experience. The documentation for writing a module is easy to follow and the SpiderFoot team is extremely helpful and responsive if you ever get stuck. Before writing this module, I reached out over Discord to validate if a StackOverflow module would be useful. I received a response very quickly and was thoughtfully guided through the best way to go about achieving my goals with the module.

Resources for Writing Your Own Module

Repository:

Discord:

SpiderFoot documentation for creating a module:

If you have an idea for a module, you can check if it already exists here:

If this is your first time contributing to open-source, here are some great resources to get you started: