Detection of malicious uniform resource locator using machine learning

Chatterjee, Saptaparni

Please use this identifier to cite or link to this item: https://irju.jdvu.ac.in/jspui/handle/123456789/9001

Title:	Detection of malicious uniform resource locator using machine learning
Authors:	Chatterjee, Saptaparni
Advisors:	Mukherjee, Joydeep
Keywords:	Uniform Resource Locators;Artificial intelligence (AI);QuantileTransformer
Issue Date:	2023
Publisher:	Jadavpur University, Kolkata, West Bengal
Abstract:	The increasing use of the internet has resulted in a corresponding rise in cyber-attacks, with malicious webpages posing a significant threat to users. This has necessitated the development of effective tools for detecting and preventing such attacks. In recent years, artificial intelligence (AI) has emerged as a promising tool in this regard, with the ability to learn from vast amounts of data and identify patterns that may not be immediately apparent to human analysts. In this essay, we discuss the use of AI in malicious webpage detection, focusing specifically on the use of lexical features, hyperparameters tunning, QuantileTransformer, Support Vector Machine, K-Nearest Neighbors Algorithm, LOGISTIC REGREESION,RANDOM FOREST and doing chi-squared test of independence, it is a statistical test,it is use for determine if there is a significant association between two categorical variables. A critical step in preserving the safety of online platforms and shielding consumers from cyber dangers is the detection of dangerous Uniform Resource Locators. Automating this process using machine learning approaches has shown promising results, enabling quicker and more precise identification of harmful Uniform Resource Locators. In this field, a number of essential approaches and methods are used, as detailed below. A statistical technique for determining the independence between two category variables is the chi-square test. It is frequently used during the feature selection stage of developing a machine learning model for detecting dangerous Uniform Resource Locators. The chi-square test aids in finding the most pertinent features for classification by calculating the relationship between features and the target variable. An efficient method for improving the hyperparameters of machine learning models is model tuning. This method efficiently examines various combinations of hyperparameters to identify the ideal configuration for the model by randomly selecting samples from a predetermined search space. The performance and generalizability of the model are enhanced by this method.After execute these machine learning algorithm the researcher notice that randomforest is much better than other algorithms. Machine learning models for malicious Uniform Resource Locator identification can achieve improved accuracy rates, adapt to changing threats, and successfully defend people and systems from potential security hazards by utilising the power of neural networks. A method for assessing the efficacy of machine learning models is cross-validation. Partitioning the data into several subsets, training the model on a subset, and assessing the model's performance on the remaining data are the steps involved. Cross-validation offers a trustworthy assessment of the model's performance and aids in avoiding overfitting by repeatedly going through this procedure with various subsets. The QuantileTransformer is a preprocessing technique used to transform features by mapping them to a uniform or Gaussian distribution. It belongs to the family of quantile-based transforms. The main purpose of the QuantileTransformer is to make the feature distributions more Gaussian or uniformly distributed, which can be beneficial for certain machine learning algorithms that assume a Gaussian or uniform distribution of features. The transformation performed by the Quantile Transformer involves estimating the cumulative distribution function (CDF) of the input feature values and then mapping them to the desired output distribution using the inverse of the CDF of the desired output distribution. This ensures that the transformed values have a specific desired distribution. The Quantile Transformer in scikit-learn has two modes: "uniform" and "normal". In the "uniform" mode, the transformed values are mapped to a uniform distribution over the range [0, 1]. In the "normal" mode, the transformed values are mapped to a standard normal distribution (mean=0, standard deviation=1). Models for detecting malicious Uniform Resource Locators perform best when evaluated using metrics like the ROC curve and classification report. The choice of an appropriate classification threshold is made possible by the ROC curve, which depicts the trade-off between true positive rate and false positive rate. Precision, recall, and F1 score are just a few of the performance statistics the categorization report includes. The preparation of the data for model training involves data processing in a significant way. To ensure data integrity and avoid bias, actions like eliminating duplicates from the dataset are required. To further assure the accuracy and reliability of the data, it is crucial to test for null values and handle missing entries. When determining the qualities of Uniform Resource Locators that are suggestive of malicious intent, lexical aspects are quite important. These characteristics include the Uniform Resource Locator's length, the existence characteristics include the Uniform Resource Locator's length, the existence of particular words or patterns, the use of special characters, and the domain's organisational structure. Machine learning models can learn to differentiate between trustworthy and malicious Uniform Resource Locators by examining these linguistic properties. In order to perform the research for this study, a sizable dataset of Uniform Resource Locators—including both trustworthy and harmful examples—was collected and pre-processed. The dataset is used to train a variety of machine learning techniques, including decision trees, support vector machines, and neural networks, to create precise classifiers. The models are then assessed using pertinent performance measures to determine how well they can identify dangerous Uniform Resource Locators. In conclusion, a variety of methods including the chi-square test, model tuning with grid search, ensemble methods like Random forest, crossvalidation, quantile transformer, evaluation metrics like ROC curve and classification report, confusion matrix, and proper data processing are used to detect malicious Uniform Resource Locators using machine learning. Organisations can create reliable and accurate methods for identifying and reducing cyber hazards linked to bad Uniform Resource Locators by using these techniques.
URI:	http://20.198.91.3:8080/jspui/handle/123456789/9001
Appears in Collections:	Dissertation

Files in This Item:

File	Description	Size	Format
M.Tech (School of Education Technology) Saptaparni Chatterjee.pdf		3.71 MB	Adobe PDF	View/Open

Show full item record

IR@JU Digital Repository

IR@JU preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets