Web Application Enumeration
Overview
Web Application Enumeration focuses on identifying technologies, frameworks, hidden content, and potential vulnerabilities in web applications. This phase builds upon subdomain discovery to analyze the actual web services and applications running on discovered hosts.
Key Objectives:
Identify web technologies and frameworks
Discover hidden directories and files
Enumerate parameters and API endpoints
Analyze security headers and configurations
Identify CMS-specific vulnerabilities
Discover virtual hosts and applications
Technology Stack Identification
whatweb - Command Line Technology Detection
# Basic scan
whatweb https://example.com
# Aggressive scan with all plugins
whatweb -a 3 https://example.com
# Output to JSON format
whatweb --log-json=results.json https://example.com
# Scan multiple URLs from file
whatweb -i urls.txt
# Scan with specific user agent
whatweb --user-agent "Mozilla/5.0..." https://example.comWappalyzer (Browser Extension)
Automatically identifies technologies on visited pages
Shows: CMS, frameworks, libraries, servers, databases
Real-time analysis during browsing
BuiltWith - Web Technology Profiler
Netcraft - Web Security Services
Nikto - Web Server Scanner
Nmap HTTP Scripts for Technology Detection
Manual Header Analysis
Directory & File Enumeration
Gobuster - Directory Brute Forcing
ffuf - Fast Web Fuzzer
dirb - Recursive Directory Scanner
Virtual Host Discovery
Understanding Virtual Hosts
Virtual hosting allows web servers to host multiple websites or applications on a single server by leveraging the HTTP Host header. This is crucial for discovering hidden applications and services that might not be publicly listed in DNS.
How Virtual Hosts Work
Key Concepts:
Subdomains: Extensions of main domain (e.g.,
blog.example.com) with DNS recordsVirtual Hosts (VHosts): Server configurations that can host multiple sites on same IP
Host Header: HTTP header that tells the server which website is being requested
Process Flow:
Browser Request: Sends HTTP request to server IP with Host header
Host Header: Contains domain name (e.g.,
Host: www.example.com)Server Processing: Web server examines Host header and consults virtual host config
Content Serving: Server serves appropriate content based on matched virtual host
Types of Virtual Hosting
Name-Based
Uses HTTP Host header to distinguish sites
Cost-effective, flexible, no multiple IPs needed
Requires Host header support, SSL/TLS limitations
IP-Based
Assigns unique IP to each website
Protocol independent, better isolation
Expensive, requires multiple IPs
Port-Based
Different ports for different websites
Useful when IPs limited
Not user-friendly, requires port in URL
Example Apache Configuration
Key Point: Even without DNS records, virtual hosts can be accessed by modifying local /etc/hosts file or fuzzing Host headers directly.
gobuster - Virtual Host Enumeration
gobuster is highly effective for virtual host discovery with its dedicated vhost mode:
Basic gobuster vhost Usage
Important gobuster Flags
gobuster vhost Example Output
ffuf - Fast Virtual Host Fuzzing
ffuf provides flexible and fast virtual host discovery with powerful filtering:
Basic ffuf Virtual Host Discovery
Advanced ffuf Filtering
feroxbuster - Rust-Based Virtual Host Discovery
Virtual Host Discovery Strategies
1. Preparation Phase
2. Initial Discovery
3. Filtering Setup
4. Comprehensive Enumeration
Manual Virtual Host Testing
Local Testing with /etc/hosts
HTB Academy Lab Examples
Lab: Virtual Host Discovery
Analysis Process
Security Considerations
Detection Avoidance
Traffic Analysis
Virtual host discovery generates significant HTTP traffic
Monitor for IDS/WAF detection
Use proper authorization before testing
Document all discovered virtual hosts
False Positive Management
Defensive Measures
Server Hardening
Monitoring
Parameter Discovery
ffuf Parameter Fuzzing
Arjun - Parameter Discovery Tool
paramspider - Parameter Mining
API Enumeration
Common API Endpoints
API Fuzzing with ffuf
GraphQL Enumeration
Web Crawling & Spidering
Popular Web Crawlers Overview
Professional Tools:
Burp Suite Spider - Active crawler for web application mapping and vulnerability discovery
OWASP ZAP - Free, open-source web application security scanner with spider component
Scrapy - Versatile Python framework for building custom web crawlers
Apache Nutch - Highly extensible and scalable open-source web crawler
ReconSpider - HTB Academy Custom Spider
ReconSpider Results Analysis
ReconSpider saves data in results.json with the following structure:
JSON Key Analysis:
emails
Email addresses found on domain
User enumeration, social engineering
links
URLs of links within domain
Site mapping, hidden pages
external_files
External files (PDFs, docs)
Information disclosure
js_files
JavaScript files
Endpoint discovery, sensitive data
form_fields
Form fields discovered
Parameter discovery, injection points
images
Image URLs
Metadata extraction
videos
Video URLs
Content analysis
audio
Audio file URLs
Content analysis
comments
HTML comments
Information disclosure
ReconSpider Data Mining
hakrawler - Fast Web Crawler
wget Recursive Download
Burp Suite Spider
OWASP ZAP Spider
Scrapy Custom Spider
Ethical Crawling Practices
Critical Guidelines
Always obtain permission before crawling a website
Respect robots.txt and website terms of service
Be mindful of server resources - avoid excessive requests
Implement delays between requests to prevent server overload
Use appropriate scope - don't crawl beyond authorized targets
Monitor impact - watch for 429 (rate limit) responses
Responsible Crawling Configuration
Legal Considerations
Penetration Testing Authorization - Ensure proper scope documentation
Rate Limiting Compliance - Don't bypass intentional restrictions
Data Protection - Handle discovered data responsibly
Service Availability - Don't impact legitimate users
Disclosure - Report findings through proper channels
Search Engine Discovery (OSINT)
Overview
Search Engine Discovery, also known as OSINT (Open Source Intelligence) gathering, leverages search engines as powerful reconnaissance tools to uncover information about target websites, organizations, and individuals. This technique uses specialized search operators to extract data that may not be readily visible on websites.
Why Search Engine Discovery Matters:
Open Source - Information is publicly accessible, making it legal and ethical
Breadth of Information - Search engines index vast portions of the web
Ease of Use - User-friendly and requires no specialized technical skills
Cost-Effective - Free and readily available resource for information gathering
Applications:
Security Assessment - Identifying vulnerabilities, exposed data, and potential attack vectors
Competitive Intelligence - Gathering information about competitors' products and services
Threat Intelligence - Identifying emerging threats and tracking malicious actors
Investigative Research - Uncovering hidden connections and financial transactions
Search Operators
Search operators are specialized commands that unlock precise control over search results, allowing you to pinpoint specific types of information.
site:
Limits results to specific website/domain
site:example.com
Find all publicly accessible pages
inurl:
Finds pages with specific term in URL
inurl:login
Search for login pages
filetype:
Searches for files of particular type
filetype:pdf
Find downloadable PDF documents
intitle:
Finds pages with specific term in title
intitle:"confidential report"
Look for confidential documents
intext:
Searches for term within body text
intext:"password reset"
Identify password reset pages
cache:
Displays cached version of webpage
cache:example.com
View previous content
link:
Finds pages linking to specific webpage
link:example.com
Identify websites linking to target
related:
Finds websites related to specific webpage
related:example.com
Discover similar websites
info:
Provides summary information about webpage
info:example.com
Get basic details about target
define:
Provides definitions of word/phrase
define:phishing
Get definitions from various sources
numrange:
Searches for numbers within specific range
site:example.com numrange:1000-2000
Find pages with numbers in range
allintext:
Finds pages containing all specified words in body
allintext:admin password reset
Search for multiple terms in body
allinurl:
Finds pages containing all specified words in URL
allinurl:admin panel
Look for multiple terms in URL
allintitle:
Finds pages containing all specified words in title
allintitle:confidential report 2023
Search for multiple terms in title
Advanced Search Operators
AND
Requires all terms to be present
site:example.com AND (inurl:admin OR inurl:login)
Find admin or login pages
OR
Includes pages with any of the terms
"linux" OR "ubuntu" OR "debian"
Search for any Linux distribution
NOT
Excludes results containing specified term
site:bank.com NOT inurl:login
Exclude login pages
*
Wildcard - represents any character/word
site:company.com filetype:pdf user* manual
Find user manuals (user guide, etc.)
..
Range search for numerical values
site:ecommerce.com "price" 100..500
Products priced between 100-500
" "
Searches for exact phrases
"information security policy"
Find exact phrase matches
-
Excludes terms from search results
site:news.com -inurl:sports
Exclude sports content
Google Dorking Examples
Finding Login Pages
Identifying Exposed Files
Uncovering Configuration Files
Locating Database Backups
Finding Sensitive Information
Directory Listings
Error Pages and Debug Information
Specialized Google Dorks
WordPress-Specific Dorks
Database-Specific Dorks
Version Control Systems
OSINT Tools and Resources
Google Hacking Database
Automated Google Dorking Tools
Search Engine Alternatives
Bing Search Operators
DuckDuckGo Search
Yandex Search
Practical OSINT Workflow
Phase 1: Initial Discovery
Phase 2: Deep Enumeration
Phase 3: Vulnerability Discovery
Phase 4: Intelligence Analysis
Legal and Ethical Considerations
Best Practices
Stay within legal boundaries - Only search publicly indexed information
Respect robots.txt - Understand website crawling policies
Avoid automation abuse - Don't overload search engines with requests
Document findings responsibly - Handle discovered information ethically
Report vulnerabilities - Follow responsible disclosure practices
Limitations
Not all information is indexed - Some data may be hidden or protected
Information may be outdated - Search engine caches may not reflect current state
False positives - Search results may include irrelevant information
Rate limiting - Search engines may limit query frequency
Web Archives (Wayback Machine)
Overview
Web Archives provide access to historical snapshots of websites, allowing reconnaissance professionals to explore how websites appeared and functioned in the past. The Internet Archive's Wayback Machine is the most prominent web archive, containing billions of web pages captured since 1996.
What is the Wayback Machine? The Wayback Machine is a digital archive of the World Wide Web operated by the Internet Archive, a non-profit organization. It allows users to "go back in time" and view snapshots of websites as they appeared at various points in their history.
How the Wayback Machine Works
The Wayback Machine operates through a three-step process:
Crawling - Automated web crawlers browse the internet systematically, following links and downloading webpage copies
Archiving - Downloaded webpages and resources are stored with specific timestamps, creating historical snapshots
Accessing - Users can view archived snapshots through the web interface by entering URLs and selecting dates
Archive Frequency:
Popular websites: Multiple captures per day
Regular websites: Weekly or monthly captures
Less popular sites: Few snapshots over years
Factors: Website popularity, update frequency, available resources
Why Web Archives Matter for Reconnaissance
Critical Applications:
Uncovering Hidden Assets - Discover old pages, directories, files, or subdomains no longer accessible
Vulnerability Discovery - Find exposed sensitive information or security flaws from past versions
Change Tracking - Observe website evolution, technology changes, and structural modifications
Intelligence Gathering - Extract historical OSINT about target's activities, employees, strategies
Stealthy Reconnaissance - Passive activity that doesn't interact with target infrastructure
Wayback Machine Usage
Basic Web Interface
URL Format Structure
Advanced Wayback Machine Techniques
Subdomain Discovery
Directory and File Discovery
Technology Evolution Tracking
Automated Wayback Machine Tools
waybackurls - URL Extraction
gau (GetAllURLs)
Wayback Machine Downloader
Historical Intelligence Gathering
Employee and Contact Discovery
Technology Stack Evolution
Sensitive Information Discovery
Manual Investigation Techniques
Timeline Analysis
Content Comparison
HTB Academy Lab Examples
Lab 6: Wayback Machine Investigation
Practical Investigation Workflow
Alternative Web Archives
Archive.today
Common Crawl
Library and Government Archives
Limitations and Considerations
Technical Limitations
Not all content archived - Dynamic content, JavaScript-heavy sites may not work
Incomplete captures - Some resources (images, CSS) may be missing
No interaction - Forms, logins, and dynamic features don't work
robots.txt respect - Some content excluded by website owners
Legal restrictions - Some content removed due to legal requests
Investigation Challenges
Content authenticity - Verify information with other sources
Timestamp accuracy - Archive dates may not reflect actual publication dates
Context missing - Surrounding events and circumstances
Selective preservation - Popular sites better archived than obscure ones
Legal and Ethical Guidelines
Best Practices
Respect copyright - Archived content still subject to intellectual property laws
Privacy considerations - Personal information in archives should be handled responsibly
Purpose limitation - Use archived data only for legitimate security research
Disclosure responsibility - Report significant findings through proper channels
Documentation - Maintain records of research methodology and sources
JavaScript Analysis
LinkFinder - Extract Endpoints from JS
JSFScan.sh - JavaScript File Scanner
Manual JavaScript Analysis
CMS-Specific Enumeration
WordPress
Joomla
Drupal
Security Headers Analysis
Security Headers Check
SSL/TLS Analysis
HTTP Methods Testing
Method Enumeration
robots.txt and Sitemap Analysis
robots.txt Enumeration
Sitemap Discovery
WAF Detection and Bypass
WAF Detection
Basic WAF Bypass Techniques
HTB Academy Lab Examples
Lab 1: Fingerprinting inlanefreight.com
Banner Grabbing with curl
WAF Detection with wafw00f
Comprehensive Scanning with Nikto
Technology Stack Analysis
Lab 2: Virtual Host Discovery
Lab 3: Directory Discovery
Lab 4: ReconSpider Web Crawling
ReconSpider Results Analysis
Lab 5: Search Engine Discovery (OSINT)
OSINT Intelligence Analysis
Automated Reconnaissance Frameworks
Overview
While manual reconnaissance can be effective, it can also be time-consuming and prone to human error. Automating web reconnaissance tasks significantly enhances efficiency and accuracy, allowing you to gather information at scale and identify potential vulnerabilities more rapidly.
Why Automate Reconnaissance?
Key Advantages:
Efficiency - Automated tools perform repetitive tasks much faster than humans
Scalability - Scale reconnaissance efforts across large numbers of targets
Consistency - Follow predefined rules ensuring reproducible results
Comprehensive Coverage - Perform wide range of tasks: DNS, subdomains, crawling, port scanning
Integration - Easy integration with other tools creating seamless workflows
Reconnaissance Frameworks
FinalRecon - All-in-One Python Framework
FinalRecon Features:
Header Information - Server details, technologies, security misconfigurations
Whois Lookup - Domain registration details, registrant information
SSL Certificate Information - Certificate validity, issuer, security details
Web Crawler - HTML/CSS/JavaScript analysis, internal/external links
DNS Enumeration - 40+ DNS record types including DMARC
Subdomain Enumeration - Multiple sources (crt.sh, AnubisDB, ThreatMiner, etc.)
Directory Enumeration - Custom wordlists and file extensions
Wayback Machine - URLs from last 5 years
Port Scanning - Fast port enumeration
FinalRecon Command Options
--url
URL
Specify target URL
--headers
-
Retrieve header information
--sslinfo
-
Get SSL certificate information
--whois
-
Perform Whois lookup
--crawl
-
Crawl target website
--dns
-
Perform DNS enumeration
--sub
-
Enumerate subdomains
--dir
-
Search for directories
--wayback
-
Retrieve Wayback URLs
--ps
-
Fast port scan
--full
-
Full reconnaissance scan
FinalRecon Advanced Options
-dt
30
Number of threads for directory enum
-pt
50
Number of threads for port scan
-T
30.0
Request timeout
-w
dirb_common.txt
Path to wordlist
-r
False
Allow redirect
-s
True
Toggle SSL verification
-d
1.1.1.1
Custom DNS servers
-e
-
File extensions (txt,xml,php)
-o
txt
Export format
-k
-
Add API key (shodan@key)
FinalRecon Practical Examples
Other Reconnaissance Frameworks
Recon-ng - Modular Framework
Recon-ng Features:
Modular Structure - Various modules for different tasks
Database Integration - Store and manage reconnaissance data
API Integration - Multiple third-party services
Report Generation - HTML, XML, CSV output formats
Extensible - Custom module development
theHarvester - OSINT Data Gathering
theHarvester Features:
Email Address Discovery - Multiple search engines and sources
Subdomain Enumeration - Various databases and APIs
Employee Name Discovery - Social media and public records
Host Discovery - Active and passive techniques
Port Scanning - Basic port enumeration
Banner Grabbing - Service identification
SpiderFoot - OSINT Automation
SpiderFoot Features:
100+ Modules - Comprehensive data source integration
Web Interface - User-friendly dashboard
API Support - RESTful API for automation
Real-time Analysis - Live data correlation
Threat Intelligence - Malware, blacklist checking
Social Media - Profile and relationship discovery
OSINT Framework - Tool Collection
Automation Workflow Design
Phase 1: Initial Reconnaissance
Phase 2: Deep Enumeration
Phase 3: Data Analysis
Custom Automation Scripts
Bash Automation Example
Python Automation Example
Tool Integration Strategies
API-Based Integration
Output Standardization
Best Practices for Automation
Performance Optimization
Parallel Execution - Run multiple tools simultaneously
Rate Limiting - Respect target server resources
Caching - Store results to avoid duplicate work
Threading - Use appropriate thread counts
Resource Management - Monitor CPU and memory usage
Error Handling
Graceful Failures - Continue execution if one tool fails
Retry Logic - Implement retry mechanisms for network issues
Logging - Comprehensive logging for debugging
Validation - Verify tool outputs and results
Backup Plans - Alternative tools for critical functions
Security Considerations
API Key Management - Secure storage of credentials
Network Isolation - Run in controlled environments
Output Sanitization - Clean and validate results
Access Controls - Restrict tool usage and access
Audit Trails - Maintain records of automation activities
HTB Academy Lab Examples
Lab 7: FinalRecon Automation
Automation Workflow Example
Security Assessment
Vulnerability Indicators
Exposed admin interfaces - /admin, /wp-admin, /administrator
Default credentials - admin:admin, admin:password
Information disclosure - Error messages, debug information
Weak authentication - No rate limiting, weak passwords
Missing security headers - XSS protection, CSRF tokens
Outdated software - Old CMS versions, known vulnerabilities
Common Misconfigurations
Directory listing enabled - Apache/Nginx misconfiguration
Backup files accessible - .bak, .old, .backup files
Source code exposure - .git directories, .svn folders
Configuration files - .env, config.php, web.config
Temporary files - Editors' backup files (~, .swp)
Defensive Measures
Web Application Hardening
Remove server banners - Hide version information
Implement security headers - CSP, HSTS, X-Frame-Options
Disable directory listing - Prevent folder browsing
Remove default files - Default pages, documentation
Secure configuration - Error handling, debug modes off
Monitoring and Detection
WAF implementation - Block malicious requests
Access logging - Monitor enumeration attempts
Rate limiting - Prevent brute force attacks
Anomaly detection - Unusual request patterns
Regular security assessments - Automated vulnerability scanning
Tools Summary
whatweb
Technology detection
Initial reconnaissance
nikto
Web server scanning
Comprehensive security assessment
builtwith
Technology profiling
Detailed technology stack analysis
netcraft
Web security services
Security posture assessment
gobuster
Directory/file discovery
Finding hidden content
ffuf
Web fuzzing
Parameter/vhost discovery
wpscan
WordPress security
CMS-specific testing
burp suite
Web application testing
Manual analysis
arjun
Parameter discovery
Finding hidden parameters
wafw00f
WAF detection
Security control identification
reconspider
Custom web crawling
HTB Academy reconnaissance
hakrawler
Web crawling
Content discovery
burp spider
Professional crawling
Web application mapping
owasp zap
Security scanning
Vulnerability discovery
scrapy
Custom crawling
Python framework
google dorking
OSINT reconnaissance
Search engine discovery
pagodo
Automated dorking
Google hacking database
wayback machine
Web archives
Historical website analysis
waybackurls
Archive URL extraction
Historical endpoint discovery
gau
URL aggregation
Multiple source URL collection
finalrecon
Automated framework
All-in-one Python reconnaissance
recon-ng
Modular framework
Database-driven reconnaissance
theharvester
OSINT gathering
Email, subdomain, employee discovery
spiderfoot
OSINT automation
100+ module automation platform
linkfinder
JavaScript analysis
Endpoint extraction
Key Takeaways
Technology identification guides subsequent testing approaches
Directory enumeration reveals hidden functionality and files
Parameter discovery uncovers additional attack surface
Web crawling provides comprehensive content discovery
Search engine discovery exposes publicly indexed sensitive information
Web archives reveal historical assets and vulnerabilities
JavaScript analysis exposes client-side vulnerabilities
Virtual hosts may contain additional applications
Security headers indicate the security posture
CMS enumeration requires specialized tools and techniques
WAF detection is crucial for bypass strategy
API enumeration focuses on modern application architectures
OSINT techniques reveal organizational intelligence
Automated frameworks significantly enhance reconnaissance efficiency
Comprehensive methodology combines multiple tools and techniques
References
HTB Academy: Information Gathering - Web Edition
OWASP Web Security Testing Guide
SecLists: https://github.com/danielmiessler/SecLists
Burp Suite Documentation
FFUF Documentation: https://github.com/ffuf/ffuf
Google Hacking Database: https://www.exploit-db.com/google-hacking-database
Pagodo: https://github.com/opsdisk/pagodo
ReconSpider: https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
Wayback Machine: https://web.archive.org/
waybackurls: https://github.com/tomnomnom/waybackurls
gau (GetAllURLs): https://github.com/lc/gau
Wayback Machine Downloader: https://github.com/hartator/wayback-machine-downloader
FinalRecon: https://github.com/thewhiteh4t/FinalRecon
Recon-ng: https://github.com/lanmaster53/recon-ng
theHarvester: https://github.com/laramies/theHarvester
SpiderFoot: https://github.com/smicallef/spiderfoot
OSINT Framework: https://osintframework.com/
Last updated