The Homestretch: SCA Mentorship ML Project
It has been 3 months of a wonderful SCA mentorship, data science track experience. I am now on the homestretch of Cohort 4. As I conclude this journey, I am marveled at the skills and knowledge gained thus far. You can read all about my experience and journey here.
This article is part of my final project and gives a clear snippet of how I went about my final SCA mentorship project. Here I give vital tips on machine learning which was the focus this month. As part of the project, I have had the opportunity to build my data science and machine learning skills. This process has definitely boosted my self-confidence in handling data science and analytics projects.
Learning Path: Month 3
With the evolution of ICT comes Electronic Commerce (e-commerce) which has transformed the way people buy goods, exchange services, and manage their lifestyles.
Ecommerce, refers to the buying and selling of goods or services using the internet, and the transfer of money and data to execute these transactions. ~ Shopify
E-Commerce has gained traction over the years allowing a global shopping experience at the comfort of your locale. This revolution has changed the marketplace, reshaping the retail industry while improving the users' online shopping experience. With these advancements, cutting-edge technologies like machine learning and artificial intelligence are recommended to meet and personalize customer needs.
Machine learning helps redefine the future of the e-commerce sector. This is through:
- Improved customer recommendations
- Improved customer service
- Data product management: business is able to create new products
- Precise search results
- AI-powered virtual assistants (chatbots)
- Customer churn predictions
It is important to gain in-depth insights into e-commerce via data-driven analytics and identify the factors affecting product sales, the impact of characteristics of customers on their purchase habits.
The aim of this machine learning project is gender prediction for e-commerce. For e-commerce businesses, it is important to understand the demand, habits, concern, perception, and interest of customers from the clue of genders for e-commerce companies.
However, in some e-commerce platforms, the challenge faced is that the genders of users are in general unavailable. To address this gap the aim of this project is to predict the gender of e-commerce participants from their product viewing records.
Submissions are evaluated on accuracy between the predicted and observed gender for the sessions in the test set.
Machine Learning with Python
Step 1: Data Gathering
The pre-collected data I used to predict the accuracy of the model was a dataset from Analytics Vidhya and contained train, test, and submission files. The data dictionary is as below:
Import libraries & Load Dataset
- In this step, I downloaded data from the website, imported the modules required, and loaded data onto my local environment using Visual Studio Code.
Step 2: Data Preprocessing/ Preparation
Exploratory Data Analysis was performed on both train and test data as they had the same features with exception of the target variable in the test data.
- Here, I performed data wrangling on the training and test data to prepare them for training and testing. Exploring the data using pandas functions.
- Data cleaning was performed to handle missing values, check for duplicates, encode data, and correct any errors present.
- I also visualized some data to check relevant correlations between variables.
Data Cleaning on Product List
- The product list contains the list of products viewed by the user in the given session and it was to be split into a category, subcategory, sub-sub category, and the product all encoded and separated with a slash symbol. Each consecutive product is separated with a semicolon.
- Gender Distribution
- Gender Distribution by the product viewing
Feature Engineering: Dates
- On the date features, I changed the data type of the date columns to DateTime
- I used the dt.total_seconds() function to convert the difference between two DateTime (startTime, endTime) objects to seconds which returned the total amount of seconds between the two. This is the time the customer took to view the product.
Feature encoding of categorical data
- Transformed categorical data and target variables using encoding to numerical before fitting into a model. I used dummy variables so sklearn can understand them.
Step 3: Model Selection
- Different algorithms perform different tasks. I chose the target and features.
- Built multiple different models to predict gender. Set up the test harness to use 3-fold cross-validation.
Step 4: Train Model
- Split data into training and validation data for both features and target, based on a random number generator. quickly evaluate the performance of an algorithm on your problem.
- Defined selected model. In my case, I selected the GradientBoostingClassifier that gave a 0.8846 accuracy score.
Step 5: Model Evaluation
- Used validation set to assess how well the trained model performs on unseen data.
- Since the dataset was a classification problem, I used an accuracy metric or combination of metrics to evaluate the performance of the model.
- Tested the model on the unseen data
Step 6: Hyperparameter Tuning
- Tuned model parameters for the GradientBoostingClassifier to improve performance. This included: number of training steps, learning rate e.tc.
Step 7: Make Predictions
- Generated predictions with the trained model and submitted results to the competition
Selection of best model
I implemented both a regular boosting classifier, XGBoost classifier, Light GBM, Random Forest on the same data set. The performance comparison showed the Gradient Boosting Classifier worked best, thus tuned it to improve the accuracy score.
3-fold cross validation:
Accuracy: 0.78 (+/- 0.00) [KNN]
Accuracy: 0.86 (+/- 0.01) [Random Forest]
Accuracy: 0.77 (+/- 0.00) [Naive Bayes]
Accuracy: 0.85 (+/- 0.00) [StackingClassifier]Baseliner Models
Model Name | CV
XGBClassifie | 0.8799
GradientBoos | 0.8846
LogisticRegr | 0.8322
RandomForest | 0.8600
DecisionTree | 0.8185
AdaBoostClas | 0.8803
ExtraTreeCla | 0.8230
ExtraTreesCl | 0.8533
KNeighborsCl | 0.7860
BaggingClass | 0.8619
I settled on a gradient boosting model which is a powerful boosting algorithm used for both classification and regression tasks. The Gradient boosting classifiers is easy to implement in Scikit-Learn and is specific types of algorithms that are used for classification tasks. It tries to create a strong learner from an ensemble of weak learners.
I was able to attain an 86.55% accuracy private score, ranking me at the top 10% in the contest. The final submission code can be accessed in my Github profile.
A special thanks to She Code Africa, for the 3 -month data science mentorship opportunity that has enabled me to learn end-to-end analysis on real-world tasks using relevant analytics tools.
Thank you to my mentor Kolawole Precious for being a part of my journey and guiding me through the data science-focused problems, amidst your busy schedule.