logo

LINKSCOPE

Documentation

The development of LINKSCOPE

Introduction

Phishing URLs pose a grave threat to individuals in Thailand, and the risk is expected to escalate considerably in the future. While existing phishing link detection tools primarily focus on technical aspects, there remains a significant gap in addressing the informational needs of users beyond just numbers and data.

In response to the increasing threat of phishing attacks and the need for user-centric cybersecurity solutions, this project aims to develop a phishing detection tool powered by interpretable machine learning models.


URL components

A Uniform Resource Locator (URL) is designed to locate web pages. The diagram below outlines the structure of a common URL and its key components.

url-structure

A phisher has complete control over the subdomain sections and can assign any value to them. The URL might also include a path and file elements that can be manipulated by the phisher as desired. The path is entirely under the phisher's control. In the rest of this article, we refer to these portions of the URL as FreeURL.


Machine Learning Techniques

Algorithm : Random Forest

The Random Forest algorithm can be utilized to enhance accuracy and mitigate overfitting in URL classification. The algorithm combines the outputs of multiple decision trees to categorize URLs. By using random subsets of features, each decision tree focuses on different combinations of URL features. The diverse set of decision trees creates a robust ensemble model that effectively analyzes and classifies URLs. A final decision is made through majority voting based on the predictions from each tree, resulting in more reliable and precise classifications compared to a single decision tree model.

Features selection method : Particle Swarm Optimization

The purpose of feature selection is to select a subset of relevant features from a large number of available features to achieve similar or even better classification performance than using all features. By eliminating/reducing irrelevant and redundant features, feature selection could reduce the number of features, shorten the training time, simplify the learned classifiers, and/or improve the classification performance.

feature-importance-chart

Particle Swarm Optimization is a population based technique to address feature selection problems in this project due to better representation, capability of searching large spaces, being less expensive computationally, being easier to implement, and fewer parameters being required. PSO simulates social behavior such as birds flocking and fish schooling. In PSO, a population, also called a swarm, of candidate solutions are encoded as particles in the search space. PSO starts with the random initialisation of a population of particles. Particles move in the search space to search for the optimal solution by updating the position of each particle based on the experience of its own.

After performing feature selection using PSO, 19 features were selected, resulting in the highest accuracy of 96.12 percent. The following are the features used for the model.


Model Results

In this study, we developed a phishing URL detection model and evaluated its performance across key metrics, achieving highly promising results. The model's overall accuracy was 95.07%

Precision, Recall, and F1-Score

To further assess the model's reliability, we considered additional performance metrics:

  • Precision: The model achieved a precision of 96.00%, meaning that of all the URLs flagged as phishing, 96% were actual phishing URLs. A high precision score is essential in this context, as it reduces false positives, ensuring that legitimate URLs are not misclassified as phishing.
  • Recall: The recall was measured at 94.05%, indicating the model's ability to detect most phishing URLs correctly. High recall ensures that phishing attacks are not missed, which is critical for minimizing the risk of undetected security threats.
  • F1-Score: The F1-score, which balances precision and recall, was 95.07%, demonstrating that the model maintains a strong balance between minimizing false positives and false negatives. This metric highlights the overall reliability of the model in a real-world application where both accuracy and coverage are important.
  • Accuracy
    96.13%
  • Precision
    96.28%
  • Recall
    95.97%
  • F1-Score
    96.13%

Information About Each Features

NameTypeExplanation
domainlengthAddress Bar basedCount the characters in the hostname string.
wwwAddress Bar basedIf the URL has 'www' as the subdomain, then return 0; otherwise, return 1.
subdomainAddress Bar basedIf the URL has more than 1 subdomain then return 1, else 0.
httpsAddress Bar basedIf the URL contains 'https', then return 0; otherwise, return 1.
short_urlAddress Bar basedIf the URL is a short URL, return 1; otherwise, return 0.
@Address Bar basedCount the ‘@' characters in the URL.
-Address Bar basedCount the '-' characters in the URL.
=Address Bar basedCount the '=' characters in the URL.
.Address Bar basedCount the '.' characters in the URL's hostname.
_Address Bar basedCount the '_' characters in the URL.
/Address Bar basedCount the '/' characters in the URL.
digitAddress Bar basedCount the digit (0-9) characters in the URL.
logAddress Bar basedIf the URL contains a 'log' word in the URL then return 0, else 1.
payAddress Bar basedIf the URL contains a 'pay' word in the URL then return 0, else 1.
webAddress Bar basedIf the URL contains a 'web' word in the URL then return 0, else 1.
accountAddress Bar basedIf the URL contains an 'account' word in the URL then return 0, else 1.
pcemptylinksHTML/DOM Structure basedPercentage of empty links. An empty link does not lead to a different page.
pcextlinksHTML/DOM Structure basedPercentage of external links that direct you to another site with a different domain.
pcrequrlHTML/DOM Structure basedPercentage of external resource URLs hosted on a different domain.
zerolinkHTML/DOM Structure basedIf the URL page has no links in the HTML body, return 1; otherwise, return 0.
extfaviconHTML/DOM Structure basedIf the favicon URL is from a different domain than the submitted URL, return 1; otherwise, return 0.
submit2EmailHTML/DOM Structure basedIf the HTML page contains "\b(mail()|mailto:?)\b" then return 1, else 0.
sfhHTML/DOM Structure basedSFHs that contain an empty string or lead to different domain sites from the submitted URL should return 1; otherwise, return 0.
redirectionAbnormal BasedIf clicking the submitted URL results in a redirection to another URL, return 1; otherwise, return 0.
domainageDomain BasedIf the domain age is less than 6 months, return 1; otherwise, return 0.
domainendDomain BasedIf the difference in days between the current date and expiration date is less than or equal to one year, return 1; otherwise, return 0.

Feature Importance

The success of this model is largely due to its careful selection and weighting of relevant features. In phishing URL detection, the following features proved to be most influential:

url-structure

LINKSCOPE | Developed by Senior Synergy