How I Used Web Scraping to Train a Machine Learning Model: A Case Study

How I Used Web Scraping to Train a Machine Learning Model: A Case Study
Photo by charlesdeluvio / Unsplash

As a developer with a passion for web scraping and machine learning, I've always been interested in how the two fields intersect. One of the most exciting projects I've worked on combined both disciplines in a practical, hands-on way: training a machine learning model using data that I scraped from the web. In this article, I’ll take you through the journey of how I leveraged web scraping to gather data and trained a model that brought real insights to the table. Along the way, I’ll share the challenges I encountered, the solutions I implemented, and the lessons I learned.


The Goal: Predictive Insights from Real-World Data

My primary objective in this project was to predict price trends for a popular category of consumer electronics: smartphones. Given the rapid changes in product prices and the competitive nature of online marketplaces, this seemed like the perfect use case for scraping and machine learning. My goal was to scrape price data from multiple e-commerce platforms, clean and preprocess that data, and build a machine learning model that could forecast future price changes.


Step 1: Scraping the Data

The first step was to identify the sources of the data I needed. I decided to focus on three major e-commerce platforms that list smartphones: Amazon, Jumia, and Konga. These sites provide detailed product listings, including prices, specifications, and reviews.

For this project, I used Node.js along with the got-scraping and Cheerio libraries to scrape the data. Here’s an overview of the key data points I collected:

  • Product name: To ensure that I was tracking the same products across different sites.
  • Price: The central feature of interest for my model.
  • Date: The timestamp for when the price was scraped, as price trends would need to be analyzed over time.
  • Seller: To account for price differences between vendors.

The scraping process was not without its challenges. Some sites had anti-bot mechanisms that needed to be bypassed using rotating proxies and CAPTCHA solvers. Additionally, ensuring that I captured consistent data across multiple platforms required fine-tuning my scrapers to handle each site’s specific structure.


Step 2: Data Cleaning and Preprocessing

After collecting the raw data, it quickly became apparent that the data was messy. There were inconsistencies between platforms, such as different currencies, missing values, and duplicate entries. Cleaning this data was crucial for feeding it into a machine learning model.

Here are the main steps I took to clean and preprocess the data:

  • Removing duplicates: On many occasions, the same product was listed multiple times due to pagination or different vendors selling the same item. I wrote a script to remove duplicates based on product name and specifications.
  • Handling missing data: Some listings were missing prices or timestamps. I opted to drop these entries, as they represented a small portion of the dataset.
  • Currency conversion: Prices were listed in different currencies (Naira for Jumia and Konga, USD for Amazon). I used an API to convert all prices to a common currency, ensuring consistency.
  • Date formatting: I standardized the date and time formats across all platforms to prepare for time-series analysis.

At this point, I had a clean dataset with thousands of data points, including historical price data that I could use to train a model.


Step 3: Feature Engineering

Feature engineering was a critical step in this project. I needed to extract meaningful features from the raw data that would help my model understand the relationships between various factors and price trends.

Here are the features I engineered:

  • Time-based features: I created features based on the timestamp, such as the day of the week and month. This would allow the model to capture potential seasonal price changes.
  • Price change rate: I calculated the rate of price change over different periods (e.g., week-over-week, month-over-month). This gave the model a sense of how fast prices were rising or falling.
  • Seller influence: I included a categorical feature that represented the seller of the product. This helped the model understand if certain sellers consistently offered higher or lower prices.

Step 4: Training the Machine Learning Model

With the cleaned and processed data in hand, I began training my machine learning model. Given that I was predicting a continuous variable (price), I opted to use a Random Forest Regressor as my baseline model, which is a powerful algorithm for handling both numerical and categorical data.

Here’s how I set up the training process:

  • Training/Test Split: I split the dataset into 80% training data and 20% test data to ensure that the model was evaluated fairly.
  • Cross-Validation: I used 5-fold cross-validation to avoid overfitting and to ensure that the model generalizes well across different subsets of the data.
  • Hyperparameter Tuning: I used grid search to tune hyperparameters like the number of trees in the forest and the maximum depth of each tree.

The model training took a few hours, but the results were promising. After evaluating the model on the test data, I achieved an R-squared score of 0.85, indicating that the model explained 85% of the variance in the price data. While this wasn’t perfect, it was a solid start for a first iteration.


Step 5: Continuous Data Scraping and Model Updating

One of the most valuable lessons I learned during this project was the importance of continuously updating the data. Price trends are dynamic and can change rapidly, especially in the fast-moving electronics market. To keep my model relevant, I set up a cron job that scraped new price data every day. This allowed me to retrain the model on a regular basis, ensuring that it remained accurate and up-to-date.

For automating the retraining process, I used Airflow to schedule daily scrapes and model retraining tasks. This automation saved me countless hours and ensured that the model was always learning from the latest data.


Key Challenges and Solutions

While this project was successful overall, it wasn’t without its challenges. Here are some of the key obstacles I faced and how I overcame them:

  • Anti-Scraping Measures: Some platforms employed sophisticated anti-scraping mechanisms, such as CAPTCHAs and IP blocking. I mitigated these challenges by using rotating proxies and integrating CAPTCHA solvers like 2Captcha to handle the blocks.
  • Data Inconsistencies: Dealing with inconsistent data formats across different platforms was a major hurdle. I had to write custom scripts for each platform to normalize the data.
  • Computational Resources: Training a machine learning model on large datasets requires substantial computational power. I leveraged cloud services like AWS to handle the more intensive model training tasks.

Results and Insights

The results of this project were not only a functioning machine learning model but also valuable insights into how prices fluctuate over time. I discovered that prices for certain products, particularly flagship smartphones, tended to drop significantly around major sales events like Black Friday. Additionally, I found that certain vendors consistently offered lower prices, which was useful information for consumers.

The project also provided a clear roadmap for future work. I plan to expand the scope of the model to include more product categories, such as laptops and accessories, and to refine the feature set for even better predictions.


Conclusion

This project taught me the power of combining web scraping with machine learning. By scraping vast amounts of real-time data, cleaning and preprocessing it, and feeding it into a robust machine learning model, I was able to gain valuable insights into pricing trends in the electronics market.

For anyone looking to undertake a similar project, I’d recommend paying close attention to the quality of your data and investing time in feature engineering. In my experience, these two factors are key to building successful machine learning models. Additionally, continuously updating your model with fresh data will ensure it remains relevant in the face of constantly changing trends.

If you’re interested in learning more about how to integrate web scraping and machine learning in your projects, feel free to reach out! I’m always happy to share tips, strategies, and lessons learned from my own experience.

Read more